must add: not outlier removal boosting background. related work sould include the sheppard 3, not 2. need to place this in context. only exporing one factor in isolation. exploring all combiantions is a lenghtly progress and beyond the scope of this aper. -------- missing from this study is fss. fss is an improtant technique bit bot used in all papers (e.g. see mendes 2003). We are cruretnyla dding FSS to our rig but the rreletive merits of fss and/or case selection will have t be revisited in light of the result o thispaepr. adaption: offically, adapation is a big stody in cbr. lots of talk. riesbeck complains that much of that talk is spurious. we therefore did not explore adaption here- isntead just going for median results. as eeen in our results, this strategy proved most effective. scalging and similarity measure. like many ebfore us (e.g. Mendes 2003, Keunig 2008) we: a) normalize all the columns 0..1 min..max b) use the standard euclidean simlarilary measure sqrt(sum((x0-x1)^2)) with w_x=1. in our planned fSS measures, we plan to use sqrt(sum((x0-x1)^2)) where w_x is the value of each feature x as found by our feature subset selector. -------- one arpoahch is to make everything selectable and let some selection mechanism workout which is best (see Chiu 2007 and Y.F. Li et al. / The Journal of Systems and Software 82 (2009) 241?252 (2009)). this approach is tempting when we do cannot find some pattern to the data that can be exploited by a learner. Certanyin the past, we have set up very large experiments where we have explored many different options via very large nested for loops. the resulting rig took days to run and the results were inconclusive- the generted effort estimated exhibited such a large varance (promise 2007) that we we could detect if one method was any better than another. We have several objections to such brute0force try-all-options approaches. Such a brute force approah is uncessasry if we can recognize what features of the data select for better learning performance. Such an informed approach may be able to do better than brute force. Our second objection is that it can be hard to distinquish the results of such a brte force analysis: Sometimes,t eh variance of the effort estimates is so hight that it is hard to see what methdos are working best. We have spent many years trying to tame this variance and even after elaborate FSS schemes, we still get errors over 100% (see Menzies 2006). Consequently, there may b no clear winnder even after all that brite forace has be expended. ur given the variance of the predictions seen in effort estimation methods, it is can be unclear what is learned from such a brute force The ready availability of cloud computing facilities makes us expect to see more such massive exploration studies. However, before execriscing large scale CPU farms, there might be an advantatge in reflecting over the data. Chiu, N.H., Huang, S.J., 2007. The adjusted analogy-based software effort estimation based on similarity distances. Journal of Systems and Software 80 (4), 628?640. -------- This parameter refers to the K number of most similar projects from meddes03 The number of analogies refers to the number of most similar cases that will be used to generate the estimation. According to Angelis and Stamelos (2000) when small sets of data are used it is reasonable to consider only a small number of analogies. Several studies in software engineering have restricted their analysis to the closest analogy (k=1) (Briand et al., 1999, 2000; Myrveit and Stensrud, 1999). However, we decided to use one, two and three analogies, similarly to Jeffery et al. (2001); Angelis and Stamelos (2000); Scho?eld (1998); Mendes et al. (2000); Mendes et al. (2001a); Jeffery et al. (2000). hat is close to the project being estimated. Some studies suggested = 1 (Walkerden and Jeffery, 1999; referenced, not comments on cant find a neighborhood comment Auer et al., 2006; cant find a neigjberhood coomment in these Chiu and uang, 2007). However, we sets K = {1, 2, 3, 4, 5} since many studies ecommend K equals to two or three (Shepperd and Scho?eld, 1997; Mendes et al., 2003; Jorgensen et al., 2003; Huang and Chiu, 2006 -------- abstract background: mre not relevant. how abut non-parametric methods deserve much studys. much prior work trying many learners. rather than continue that search, we argue that we need to tune our data minigemethod via feature extraction from the data set. Method: we refine our an estimation method by considering under what cases this method will perform best/worst. Then we explore our training data to isolate and seperate those best/worst cases. Best case test results were found by just learning from the the best case training data. An theoretrical drawback with this approach is that by ignoring the hard cases during training, we will also fail on the hard cases during testing. If was so, then our performance results on test cases not seen during training would be very poor. Experimentally, we show that this poor performance does not occur. Rather, when we compare our methods to a set of similar methods, our policy of {\em ignoring the hard training cases} is very effective. Our results clearly show that for the harder test cases that generate the largest errors, TEAC results in far smaller errors than any other method explored in this study. We therefore make two recommendations. Our general recommendation is that comissioning an induction algorithm for software engineering data, it is useful to undestand the assumptions of that algorithm as well the kinds of data that would violate those assumptions. With this knowledge, it may be possible to improve how that algorithm is applied. Our more specific recommendation relates to software project effort estimation using analogies. For this estimation task: \bi \item The number of related projects changes with each test instance. \item Therefore we argue against the current practice of fixing the size of the local neighborhood. \item Rather, we should determine the local neigherhood on a per-instance basis/ \ei relefecting over the properties of our data set. S isolated a core premise from the reasoning-by-analogy community (locality means unifomrity) and tested that on data. ----------- sheppard econics roadmap The work essentially falls into one of three techniques: * algorithmic or parametric models * induced prediction systems (via some kind of machine learning method) * human centric techniques (usually referred to as expert judgement) Experiments with Analogy-X for Software Cost Estimation Jacky Keung1 , Barbara Kitchenham2 Our recent research has proposed a method for assessing whether data-intensive case-based reasoning (sometimes referred to as analogy) is an appropriate method for predicting effort on a specific dataset [7] [6]. Although usually unstated, the basic hypothesis underlying the use of data-intensive case-based reasoning for software project effort estimation is: ?Projects that are similar with respect to project and product factors such as size and complexity will be similar with respect to project effort.? Based on this principle, tools such as ANGEL [2] [8], compute a similarity measure using project and product features between a new project and projects in an historical database [2]. An effort estimate for the new project is then based on the actual effort of the k most similar projects in the database. The value of k is determined by trial and error for particular dataset. There are several alternative strategies for constructing the estimate for the new project, for example a simple average of the k most similar projects, or a weighted average. One major problem with this method is that tools such as ANGEL will provide an estimate even if the data set is completely inappropriate for case-based estimation [8]. However, our recent research [7] has identified a method for testing whether the hypothesis underlying analogy is valid for a particular dataset that is analogous to assessing whether a regression line produces a statistically significant fit for a particular dataset.