\documentclass{article} \usepackage{url} \begin{document} \title{Editorial, special issue, repeatable experiments in software engineering} \author{Tim Menzies,\\ Lane Department of Computer Science and Electrical Engineering,\\ West Virginia University, WV, USA\\ \url{tim@menzies.us}} \maketitle Welcome to the special issue of Empirical Software Engineering on repeatable experiments in software engineering. Earlier and shorter versions of the papers presented here first appeared at the PROMISE 2007 workshop in Minneapolis. The PROMISE project has been running for 4 years now and aims to create large libraries of repeatable experiments in software engineering. PROMISE is somewhat different to other workshops that deal with learning from software data\footnote{e.g. the Mining Software Repositories series co-located with ICSE (\url{http://www.msrconf.org/}) and the new DEFECTS series co-located at ISSTA (\url{http://pages.cpsc.ucalgary.ca/~zimmerth/defects-2008/}).} in two ways. First, PROMISE emphasizes the data mining approach to generalization and, as such, is very concerned with the experimental methods used to generate the results. Second, PROMISE is more than just a workshop series. The project actually has three parts: \begin{enumerate} \item The PROMISE repository \url{http://promisedata.org/?cat=11}: Currently, the repository holds 76 data sets and is growing rapidly (around 40\% more data sets each year). Where applicable, each data set is linked to papers describing current best results from that data (in many cases, those papers are PROMISE publications). All the data sets are stored in a content management system that allows registered users to add comments to the data. In this way, the community's experience with that data (tips, traps for beginners , etc) can be stored in a central web-accessible location. \item The annual PROMISE conference: Each year, the PROMISE community meets to reflect on old results and examine new ones. Starting as a small workshop in ICSE 2005, PROMISE has now become a conference and PROMISE 2009 is a co-located event at ICSE 2009 (Vancouver, British Columbia). \item Journal special issues: Each year, the authors of the best PROMISE papers are invited to revise and extend their papers, then submit them to a peer-reviewed journal. \end{enumerate} This special issue contains four such papers. Each paper generalizes some specific project experience to build general models. The range of models are quite diverse and the conclusions they draw from those models are fascinating. For example, a widely-held view in the software engineering community is that module size is linear to defects; i.e. larger modules have more faults. According to this view, it makes sense to inspect larger modules before exploring smaller ones. In ``Theory of Relative Defect Proneness: Replicated Studies on the Functional Form of the Size-Defect Relationship for Software Modules'', Koru et.al. argues convincingly that this view is 100% wrong. In numerous data sets, they find a power-law relationship between module size and defects where smaller modules are proportionally more fault prone. Therefore, it is far more effective to focus verification efforts on smaller modules before moving to the larger modules. In their paper, Koru et.al. offers some speculations on why this power-law relationship holds and in this editorial, I offer one more. Perhaps in this modern era of refactoring and separation of concerns and auto-generated code, programmers must split their ideas into tiny pieces all across the code base. Once divided in this way it is hard to understand the interactions of all these tiny pieces. Another widely-held view is that the best way to build defect models is by data mining. According to this view, since human experts do not understand all the nuances of their code, we must use automatic data mining to reveal the land mines buried in our systems. In this view, the construction of defect model is a 3-step process: \begin{enumerate} \item collecting domain knowledge (previous results, expert knowledge); \item building a skeleton of the model based on step 1 including as yet unknown parameters; \item estimating the model parameters using historical data. \end{enumerate} Any practitioner in this area will report that step \#3 can be extremely difficult: it often quite hard to obtain reliable data of the required granularity, or of the required volume with which we could later generalize our conclusions. In {\em On the effectiveness of early life cycle defect prediction with Bayesian Nets}, Fenton et al. offer an alternative approach that avoids step \#3. Working with domain experts, Fenton et al. built a causal model (Bayesian net) for predicting the number of residual defects that are likely to be found during independent testing or operational usage. Note that this approach supports step \#1 and \#2, but does not require step \#3. In two respects, their results are most impressive. Firstly, their Bayes nets makes very accurate defect predictions (an $R^2$ of 0.93 between predicted and actual defects). Secondly, since their method does not require detailed domain knowledge it can be applied very early in the process life cycle. Yet another widely-held view is that ``too many cooks spoil the broth''; i.e.. code units will be less fault-prone if they are written and maintained by only a few, or even just one, programmer. Another common belief is that a developer who works for the first time on a file that has previously been written or maintained by others is more likely to introduce faults into the software than programmers who have prior experience with the code. In {\em Do Too Many Cooks Spoil the Broth? Using the Number of Developers to Enhance Defect Prediction Models}, Weyuker \& Ostrand explored the value of adding development team size and experience with specific code units to an existing defect prediction model. They found this information improved defect prediction, but only by a negligible amount. Our final paper is concerned with the assessment of models, once they are built. There exists a large number of competing data mining methods for generating software defect models and deciding which model is ``best'' is a non-trivial task. In {\em Techniques for Evaluating Fault Prediction Models}, Jiang et al. comment that the comparison of fault-prone models is a multi dimensional problem. Rarely will one model or the modeling technique prove to be the ``best'' for all possible uses in software quality assessment. Overall model classification performance is not the ultimate goal in itself. Rather, optimizing the project cost and maximizing the efficiency of software verification procedures typically tops the agenda. This paper describes a methodological generalization of cost sensitive numerical performance indices called ``cost curves'' that offer a succinct graphical comparison of model performance across a wide range of module misclassification costs. This special issue is the result of much work by a large group of people: \begin{itemize} \item First and foremost, I gratefully acknowledge the dedication of the reviewers of these papers. These reviewers were kind enough to review multiple versions of these papers, and to do so in record time. \item Also, special thanks are due to my fellow members of the PROMISE steering committee: Gary Boetticher (general chair), Tom Ostrand, and Gunther Ruhe. \item Last but not least, this issue would not have been possible without the support of the team at the Empirical Software Engineering Journal: Lionel Briand was kind enough to support this issue and Racquel Anievas was exceptionally helpful during the review and production process. \end{itemize} For more information on the PROMISE project, see \url{http://promisedata.org}. I hope that, soon, I will read your papers at a forthcoming PROMISE conference; or that other researchers use the data you contribute to the PROMISE repository. \end{document}