must add:
not outlier removal
boosting

background. related work sould include the sheppard 3, not 2.

need to place this in context. only exporing one factor in isolation. exploring all combiantions is a lenghtly progress and beyond the scope of this aper.

--------
missing from this study is fss. fss is an improtant technique bit bot used in all
papers (e.g. see mendes 2003). We are cruretnyla dding FSS to our rig but the
rreletive merits of fss and/or case selection will have t be revisited in light of the
result o thispaepr. 

adaption: offically, adapation is a big stody in cbr. lots of talk.
riesbeck complains that much of that talk is spurious. we therefore did not explore adaption here- isntead just going
for median results. as eeen in our results, this strategy proved most effective.

scalging and similarity measure. like many ebfore us (e.g. Mendes 2003, Keunig 2008) we:
a) normalize all the columns 0..1 min..max
b) use the standard euclidean simlarilary measure 
sqrt(sum((x0-x1)^2)) 
with
w_x=1. in our planned fSS measures, we plan to use
sqrt(sum((x0-x1)^2))  where w_x is the value of each feature x as found by our feature subset selector.

--------
one arpoahch is to make everything selectable and let some selection mechanism workout which is best (see Chiu 2007 and 
Y.F. Li et al. / The Journal of Systems and Software 82 (2009) 241?252  (2009)). this approach is tempting when we do cannot
find some pattern to the data that can be exploited by a learner. Certanyin the past, we have set up very large experiments where
we have explored many different options via very large nested for loops. the resulting rig took days to run and the results
were inconclusive- the generted effort estimated exhibited such a large varance (promise 2007) that we we could detect
if one method was any better than another.

We have several objections to such brute0force try-all-options approaches.
Such a brute force approah is uncessasry if we can recognize what features of the data select for better learning performance.
Such an informed approach may be able to do better than brute force.
Our second objection is that it can be hard to distinquish the results of such a brte force analysis:
Sometimes,t eh variance of the effort estimates is so hight that it is hard to see what methdos are working best. We have spent many
years trying to tame this variance and even after elaborate FSS schemes, we still get errors over 100% (see Menzies 2006). 
Consequently, there may b no clear winnder even after all that brite forace has be expended.


ur
given the variance of the predictions seen in effort estimation methods, it is can be unclear what is learned from such a brute
force The ready availability of cloud computing facilities makes us expect to see more such massive exploration studies.
However, before execriscing large scale CPU farms, there might be an advantatge in reflecting over the data.


Chiu, N.H., Huang, S.J., 2007. The adjusted analogy-based software effort 
estimation based on similarity distances. Journal of Systems and Software 
80 (4), 628?640. 


--------

This parameter refers to the K number of most similar projects 
from meddes03

The number of analogies refers to the number of most similar cases that will be used 
to generate the estimation. According to Angelis and Stamelos (2000) when small 
sets of data are used it is reasonable to consider only a small number of analogies. 
Several studies in software engineering have restricted their analysis to the closest 
analogy 
(k=1)  (Briand et al., 1999, 2000; Myrveit and Stensrud, 1999). However, 
we decided to use one, two and three analogies, similarly to Jeffery et al. (2001); 
Angelis and Stamelos (2000); Scho?eld (1998); Mendes et al. (2000); Mendes et al. 
(2001a); Jeffery et al. (2000). 


hat is close to the project being estimated. Some studies suggested 

 = 1 (Walkerden and Jeffery, 1999; 
referenced, not comments on
cant find a neighborhood comment

Auer et al., 2006; 
cant find a neigjberhood coomment in these

Chiu and 

uang, 2007). However, we sets K = {1, 2, 3, 4, 5} since many studies 

ecommend K equals to two or three (Shepperd and Scho?eld, 

1997; Mendes et al., 2003; Jorgensen et al., 2003; Huang and Chiu, 2006 


--------
abstract

background: mre not relevant. how abut non-parametric methods deserve
much studys. much prior work trying many learners. rather than
continue that search, we argue that we need to tune our data
minigemethod via feature extraction from the data set.

Method: we refine our  an estimation method by considering under what cases this method will perform best/worst. Then we explore
our training data to isolate and seperate those best/worst cases. Best case test results were found by just learning from the
the best case training data.

An theoretrical drawback with this approach is that by ignoring the hard cases during training, we will also fail on the
hard cases during testing. If was so, then our performance results on test cases not seen during training would be very poor.

Experimentally, we show that this poor performance does not occur. Rather, when we compare our methods to a set of
similar methods, our policy of {\em ignoring the hard training cases}  is very effective. Our results
clearly show that for the harder test cases that generate the largest errors, TEAC results in far smaller errors than any other
method explored in this study.

We therefore make two recommendations. Our general recommendation is that comissioning an
induction algorithm for software engineering data, it is useful to undestand the assumptions of that algorithm as well the kinds of
data that would violate those assumptions. With this knowledge, it may be possible to improve how that algorithm is applied.

Our more specific recommendation relates to software project effort estimation using analogies.
For this estimation task:
\bi
\item The number  of related projects changes with each test instance.
\item Therefore we argue against the current practice of fixing the size of the local neighborhood.
\item Rather, we should determine the local neigherhood on a per-instance basis/
\ei

relefecting over the properties of our data set. S
isolated a core premise from the reasoning-by-analogy community (locality means unifomrity) and tested that on data. 

-----------

sheppard econics roadmap

The work essentially falls into one of three techniques: 
* algorithmic or parametric models 
* induced prediction systems (via some kind of machine 
learning method) 
* human centric techniques (usually referred to as expert 
judgement) 


Experiments with Analogy-X for Software Cost Estimation 
 
Jacky Keung1 , Barbara Kitchenham2  


Our recent research has proposed a method for 
assessing whether data-intensive case-based reasoning 
(sometimes referred to as analogy) is an appropriate 
method for predicting effort on a specific dataset [7] 
[6]. Although usually unstated, the basic hypothesis 
underlying the use of data-intensive case-based 
reasoning for software project effort estimation is: 
 
?Projects that are similar with respect to project 
and product factors such as size and complexity will be 
similar with respect to project effort.? 
 
Based on this principle, tools such as ANGEL [2] 
[8], compute a similarity measure using project and 
product features between a new project and projects in 
an historical database [2]. An effort estimate for the 
new project is then based on the actual effort of the k 
most similar projects in the database. The value of k is 
determined by trial and error for particular dataset. 
There are several alternative strategies for constructing 
the estimate for the new project, for example a simple 
average of the k most similar projects, or a weighted 
average. 
One major problem with this method is that tools 
such as ANGEL will provide an estimate even if the 
data set is completely inappropriate for case-based 
estimation [8]. However, our recent research [7] has 
identified a method for testing whether the hypothesis 
underlying analogy is valid for a particular dataset that 
is analogous to assessing whether a regression line 
produces a statistically significant fit for a particular 
dataset.