FEATURE WEIGHTING PROJECT SELECTION ANALOGY BASED ESTIMATION
William Sica
--------------------------------------------------------------------------

STATUS OF PROJECT / SUMMARY OF RESULTS

My data does not agree with Li's, over a sufficient number of trials.  In individual trials it appears promising, but over a large number of trials those low numbers do not hold.

Currently, either:
1) I am doing something wrong somewhere.  Or the data in the paper is wrong/misleading.
2) I randomize the data splitting every trial, in Li's experiment the data was random-split once.  This is what I believe is most likely.
3) Li's paper says "best variants of all cost estimation methods", and this is meant to mean that it is the best trial over X number of trials, not the average etc.
4) Something I don't know.
--------------------------------------------------------------------------

INDEX

Conclusion
Experiment 1: Copying Li's Results
Experiment 2: Approximating Li's Results over More Trials
Experiment 3: FWPSABE vs. Guessing vs. ABE
Experiment 4: Approximate Desharnais
Observations
--------------------------------------------------------------------------

CONCLUSION
We assess the fundamental problem is increasing the accuracy of estimates produced by Analogy Based Estimation (ABE) using stochiastic search techniques to optimize feature weighting and project selection (FW and PS).  This is most commonly measured by Mean MRE, and that will be the primary measure here.

FW and PS were shown to be an improvement over ABE for certain cases, as seen in Experiment 3 and 4.

Initial data [Experiments 1 and 2] seemed to indicate that a GA was not an optimum search technique for optimzing FW and PS.  However, Experiment 4 seems to indicate otherwise and further research is needed before genetic algorithms can be ruled out for this problem. 

--------------------------------------------------------------------------

EXPERIMENT 1: EX1.PNG
Experiment 1 follows the experiment as closely to the paper as possible for:
Albrecht Dataset
3 Nearest Neighbors
Mean Solution Function
Euclidean Distance

I was able to perform 11 trials.

I performed a time/convergence analysis on it and found that the following:
Minimum time: 12 min 40 sec (converged at generation 3)
Maximum time: 36 min 35 sec (converged at generation 555)
Mean    time: 20 min  0 sec (converged at generation 110)

The elite function as in the paper required 100 * # Features + C generations, where C is the Convergence generation.  Each generation took an average of 1.48 seconds.  The program runs from 701 - 7,000 generations for the Albrecht dataset, so approximate possible min/max times are 17 to 170 minutes.

I did not find time/convergence data to be useful for optimization, as I noticed that the estimates being produced were not as good as in the paper.   As seen in the data:

Mean Training MMRE .85872
Mean Testing MMRE 1.1106 

Li's data stated [Training MMRE .56, Testing MMRE .41]
Li's data point is plotted on the graph, and it can be seen that it is plausible for a single run of the experiment.  There are values with lower testing and training MMRE than Li's data-point.

As a comparison, ABE w/o using FW or PS over 500 trials had mean MMRE 1.0633.  [Outperforming optimized FWPSABE]
---------------------------------------------------------------------

EXPERIMENT 2: EX2.PNG
As in Experiment 1, except 5 generations to get data quickly.  Data gathered over 500 trials.

Mean Training MMRE .75948
Mean Testing MMRE  1.0461

This is a good approximation of the longer-trials, and actually an improvement.  It still does not perform as well as ABE without FWPS over many trials.

--------------------------------------------------------------------------

EXPERIMENT 3: EX3.PNG
This data was accidentally gathered, but helps answer a question posed in Experiment 1 and 2: Is FWPSABE better than ABE for estimation, and more importantly, better than random guessing?

The data gathered shows the results of FWPSABE using Li's genetic algorithm as follows:
Desharnais Dataset
Nearest Neighbors
Mean Solution Function
Euclidean Distance
5 Generations

Instead of estimating effort, it was estimating what language the project was which is a numerical representation of a discrete attribute with possible values 1, 2, or 3.

Just guessing has expected MMRE .224 
Using ABE     has     mean MMRE .414   over 1,000 trials.
FWPSABE       has  traning MMRE .21241 over 60    trials
FWPSABE       has  testing MMRE .28391 over 60    trials

The optimized FWPSABE function outperformed the expected value of guessing.  Looking at the data, you can also see a large number of trials outperformed guessing in both the testing and training phase.

FWPS also outperformed pure ABE in this experiment.
--------------------------------------------------------------------------

EXPERIMENT 4: EX4.PNG

Experiment 4 takes a leap of faith, as follows:
% Experiment 2 approximated Experiment 1 well with a small number of trials.
% This approximation should hold for a different dataset.

It performed effort estimation as ABE in experiment 3 for 49 trials.

Using ABE     has     mean MMRE 1.4149  over 500 trials.
Li's Data     has  traning MMRE  .55
Li's Data     has  testing MMRE  .42
FWPSABE       has  traning MMRE  .46172 over 49    trials
FWPSABE       has  testing MMRE  .62943 over 49    trials

Even with a low number of generations, FWPSABE outperformed ABE by a bit.  The results here are more promising, and show that random splicing of the data-sets is more harmful to the smaller Albrecht data-set.  This is evidence towards the idea that Li may have randomly spliced the data once.

The data reveals that almost no trials outperformed the goal of Li's datapoint.  This is different from the tests on the Albrecht data set, in which a large number of points were clumped below or near Li's data point.

A larger number of generations may be needed, hence why the "leap of faith" was just that.

--------------------------------------------------------------------------
OBSERVATIONS

The GA rarely finds the absolute minimum MMRE in the function.  This is 0 because sending a chromosome with 0 for all project selection bits to the ABE test causes it to fail.

I ran some tests with very small data, and the GA was able to cheat and find this out.  The difference between Albrecht and Desharnais caused by multiple random-splicing also seems to indicate the FWPSABE method will be better for larger datasets.

I also found highly interesting the case where the Training MMRE was very high, but the Testing MMRE was very low.  This and the opposite were very common in the Albrecht dataset.  This showed that while the random-splicing hurt the performance by contributing bad data-splits, the optimization obtained was still good.