Summary Data mining NASA project data Five examples where data mining found clear quality predictors effort defects In only \underline{one} of those cases is that data source still active. All that lost data What to do? Background GartnerHypeCycle.gif History 1980s (early): AI summer 1980s (late): bubble bursts, AI winter 1990s: great success with planning, data mining 2000s: still, some doubts re AI But, lots of successes of AI (data mining) for SE This talk This stuff works, 5 such success stories The main problem with data mining is organization, not technological Despite clear success, $\frac{4}{5}$ of those data sources have vanished Easier to plan, than act What is data mining? Diamonds in the dust Sumamrization: Not 1000 records But 3 rules Example \#1: text mining issue reports [Menzies, 2007]: 901 NASA records, PITS issue tracker: \{severity, free text\} \begin{tabular}{l cl} severity & frequency& \\ 1 (panic) & 0 &\\ 2 & 311 &\rule{30mm}{2mm}\\ 3 & 356 &\rule{35mm}{2mm}\\ 4 & 208 & \rule{20mm}{2mm}\\ 5 (yawn) & 26 & \rule{2mm}{2mm} \end{tabular} Output All unique words, sorted by ``magic'' (see below) Rules leared from top N words 10-way cross-val results for severity 2: \begin{tabular}{lclr} & r & p & F\\ N words & recall (prob detection) & precision & $\frac{2*r*p}/{r+p}$\\ 100 & 0.81 &0.93 & 0.87 &\rule{43mm}{2mm}\\ 50 & 0.80 & 0.90 & 0.85 &\rule{42mm}{2mm}\\ 25 & 0.79 & 0.03 & 0.85 &\rule{42mm}{2mm}\\ 12 & 0.74 & 0.92 & 0.82 &\rule{41mm}{2mm}\\ 6 & 0.71 & 0.94 & 0.81 &\rule{40mm}{2mm}\\ 3 & 0.74 & 0.82 & 0.78 &\rule{38mm}{2mm} \end{tabular} \begin{verbatim}\scriptsize if (rvm <= 0) and (srs = 3) => severity=4 (52.0/21.0) else if (srs >= 2) => severity=2 (289.0/54.0) else => severity=3 (558.0/246.0) \end{verbatim} Diamonds in the dust Not 9414 words total or 1662 unique, but 3 most predictive. Methods Too numerous to list here Use, don't dictate Basili et.al. ``Learning Organizations'' carefully design their data collection Me, chasing the meat truck use whatever is around. Savage pace of organizational change. Less is more Pruning features is a good idea Yours ain't mine Always best to relearn with local data Less is more (1) Feature subset selection (FSS) $y = \beta_0 + \beta_1f_1 + \beta_2f_2 + \beta_3f_3 ..$. Variance in $y$ reduced by pruning some $x_i$ But don't prune too much: e.g. $\forall x_i, y=\beta_0$ Apply domain knowledge e.g., in text mining, TF*IDF: {\em term frequency inverse document frequency} Frequent terms, but only in a small number of documents TF*IDF = $F[i,j] * log(IDF)$ $F[i,j]$ = frequency of word i in document j $IDF$ = \#documents/(\#documents with i) Above study used top 100 $TF*IDF$ words. Exponential time FSS Try all $2^F$ subsets of $F$ on a {\em target learner} Possible, with small feature sets, with some heuristic search (best first, STALE=5)~\cite{kohavi97}. Linear time FSS (not as thorough as $2^F$: Sort $F$, somehow; Try first $1\le f\le F$ features Above study sorted top 100 TF*IDF terms on infogain Intially: $H(C) = - \sum_{c\in C}p(c)log_2p(c)$. After seeing feature $f$: $H(C|f) = - \sum{x \in f}p(x) \sum_{c in C} p(c|x) log_2p(c|x)$ So $InfoGain = H(C) - H(C|f)$ Less is more (2) Overfitting avoidance After learning ``it'' ... ... Try throwing away bits of ``it'' e.g. RIPPER~\cite{cohen95a} If you learn a conjunction, prune with greedy back select If you learn a set of rules, prune with greedy back select For the surviving rules, try replace it with.. a dumb alternative or a carefully selected modification Very fast: $O(m(log m)^2)$ for $m$ examples Often produces smaller theories than other methods (e.g. see above) : Example \#2: effort estimation NASA COCOMO data: results from Menzies et.al TSE 2006 learners for continous classes many methods, including feature subset selection \begin{tabular}{l|r|r|r} & \muticolumn{3}{c}{ percentile $\frac{pred-actual}{actual}$}\\\cline{3-5} subset & 50\% & 65\% & 75\% \\\hline mode= embedded & -9 & 26 & 60 \\ project= X & -6 & 16 & 46 \\ all & -4 & 12 & 31 \\ year= 1975 & -3 & 19 & 39 \\ mode= semidetached & -3 & 10 & 22 \\ fg= g & -3 & 11 & 29 \\ center= 5 & -3 & 20 & 50 \\ category= missionplanning & -1 & 25 & 50 \\ project= gro & -1 & 9 & 19 \\ center= 2 & 0 & 11 & 21 \\ year= 1980 & 4 & 29 & 58 \\ category= avionicsmonitoring & 6 & 32 & 56 \\\hline mean & 0 & 23 & 48 \end{tabular} Example \#3: Early Lifecycle Defect Detection NASA defect data: 5 projects, [Menzies et.al 2007] \begin{tabular}{|r@{~=~}p{2.5cm}|r@{~=}p{4.1cm}|}\hline \multicolumn{2}{|c|}{Derived}& \multicolumn{2}{|c|}{Raw features}\\\hline co&Consequence& am&Artifact Maturity\\ dv&Development& as&Asset Safety\\ ep&Error Potential& cl&CMM Level\\ pr&Process& cx&Complexity\\ sc&Software Characteristic& di&Degree of Innovation\\ \multicolumn{2}{|c|}{~}&do&Development Organization\\ \multicolumn{2}{|c|}{~}&dt&Use of Defect Tracking System\\ \multicolumn{2}{|c|}{~}& ex&Experience \\ \multicolumn{2}{|c|}{~}& fr&Use of Formal Reviews\\ \multicolumn{2}{|c|}{~}& hs&Human Safety\\ \multicolumn{2}{|c|}{~}& pf&Performance\\ \multicolumn{2}{|c|}{~}& ra&Re-use Approach\\ \multicolumn{2}{|c|}{~}& rm&Use of Risk Management System\\ \multicolumn{2}{|c|}{~}& ss&Size of System\\ \multicolumn{2}{|c|}{~}& uc&Use of Configuration Management\\ \multicolumn{2}{|c|}{~}& us&Use of Standards\\\hline \end{tabular} {\scriptsize \begin{alltt} function CO( tmp) { tmp=0.35*AS + 0.65 *PF; return (round((HS) > tmp) ? HS : tmp) } function EP() {table return round(0.579*DV() + 0.249*PR() + 0.172*SC()) } function SC() { return 0.547*CX + 0.351*DI + 0.102*SS } function DV() { return 0.828*EX + 0.172*DO } function PR() { return 0.226*RA + 0.242*AM + formality() } function formality() { return 0.0955*US + 0.0962*UC + 0.0764*CL + 0.1119*FR + 0.0873*DT + 0.0647*RM } \end{alltt}} Learner= RIPPER Feature subset selection $2^F$ \begin{center} \scriptsize \begin{tabular}{r|l@{}r|rrrr|rl@{~}r} \multicolumn{3}{c}{~}&\multicolumn{4}{c}{$f$-measures (from \eq{F}) }& \\\cline{4-7} treatment &features &\#features&\_12 &\_3 &\_4 &\_5 & \multicolumn{2}{l}{$F= \left(\sum f\right)/4$}&\\\cline{1-9} & & & & & & & & &\\ A &all - L1 - L2 - group(6) &8&0.97 &0.95 &0.97 &0.99 &0.97&\rule{49mm}{2mm} &\\ B &all - L1 - L2 - group(5 + 6) &7&0.95 &0.94 &0.97 &0.96 &0.96&\rule{48mm}{2mm}\\\ C &all - L1 - L2 - group(4 + 5 + 6) &6&0.93 &0.95 &0.98 &0.93 &0.95& \rule{47mm}{2mm}\\ D &all - L1 - L2 &16&0.94 &0.94 &0.93 &0.96 &0.94&\rule{47mm}{2mm}\\ E &all - L1 - L2 - group(3 + 4 + 5 + 6) &4&0.93 &0.97 &0.90 &0.87 &0.92& \rule{46mm}{2mm} &$\Rightarrow$ see \fig{rxE}\\ %F &L1+ L2 &2&0.96 &0.89 &0.97 &0.70 &0.88&\rule{44mm}{2mm}\\ F &\{co*ep, co, ep\} &3&0.94 &0.84 &0.55 &0.70 &0.76&\rule{38mm}{2mm}\\ G &L1 &1&0.67 &0.69 &0.00 &0.46 &0.45&\rule{22mm}{2mm}\\ H &just "us" &1&0.64 &0.60 &0.00 &0.00 &0.31&\rule{15mm}{2mm}\\ I &L2 &1&0.57 &0.00 &0.32 &0.00 &0.22&\rule{11mm}{2mm} \end{tabular} \end{center} \begin{center} \footnotesize \begin{tabular}{rrlr} & &&number of times \\ group & feature ¬es &selected\\\hline 1&us & use of standards & 10\\\hline 2&uc &config management & 9\\ &ra & reuse approach & 9\\ &am & artifact maturity & 9\\\hline 3&fr & formal reviews & 8\\ &ex &experience & 8\\\hline 4&ss& size of system & 7\\\hline 5&rm & risk management & 6\\\hline 6& cl & CMM level & 5\\ & dt & defect tracking & 5\\ & do& development organization & 4\\ & di& degree of innovation & 4\\ &hs&human safety & 3\\ &as& asset safety & 2\\ &cx& complexity & 2\\ &pf& performance &1 \end{tabular} \end{center} \begin{tabular}{|r@{~=~}p{2.5cm}|r@{~=}p{4.1cm}|}\hline \multicolumn{2}{|c|}{Derived}& \multicolumn{2}{|c|}{Raw features}\\\hline co&Consequence& am&Artifact Maturity\\ dv&Development& as&Asset Safety\\ ep&Error Potential& cl&CMM Level\\ pr&Process& cx&Complexity\\ sc&Software Characteristic& di&Degree of Innovation\\ \multicolumn{2}{|c|}{~}&do&Development Organization\\ \multicolumn{2}{|c|}{~}&dt&Use of Defect Tracking System\\ \multicolumn{2}{|c|}{~}& ex&Experience \\ \multicolumn{2}{|c|}{~}& fr&Use of Formal Reviews\\ \multicolumn{2}{|c|}{~}& hs&Human Safety\\ \multicolumn{2}{|c|}{~}& pf&Performance\\ \multicolumn{2}{|c|}{~}& ra&Re-use Approach\\ \multicolumn{2}{|c|}{~}& rm&Use of Risk Management System\\ \multicolumn{2}{|c|}{~}& ss&Size of System\\ \multicolumn{2}{|c|}{~}& uc&Use of Configuration Management\\ \multicolumn{2}{|c|}{~}& us&Use of Standards\\\hline \end{tabular} \begin{center} \begin{tabular}{@{rule~}rrl@{~then ~}c} 1&if& $uc \ge 2 \wedge us =1$ & \_5\\ 2&else if& $am = 3$ & \_5 \\ 3&else if& $uc \ge 2 \wedge am=1 \wedge us \le 2$ & \_5\\ 4&else if& $am=1 \wedge us=2$ & \_4 \\ 5&else if& $us=3 \wedge ra \ge 4$ & \_4 \\ 6&else if& $us=1$ & \_3 \\ 7&else if& $ra=3$ & \_3 \\ 8&else if& true &\_12 \end{tabular} \end{center} Example \#4: Defect prediction from static code Menzies et.al. TSE, 2007 Prior state of the art: \bi \item {\em IEEE Metrics 2002} panel~\cite{shu02} \bi \item manual software reviews find ${\approx}60\%$ of defects \item Raffo: defect detection capability of industrial review methods = $pd=TR(min,mode,max) = TR(35, 50, 65) \item Data mining methods using static code measures (Halstead, McCabes, lines of code, misc) prob \{detection,false alarm\} = \{36,17\}\% \ei Understanding the distributions added a lot. All the numeric distributions were logarithmic so set $num = log(num < 0.00001 ? 0.00001 : num)$ Feature subset selection with infogain \begin{center} %XX remove accuracy \begin{tabular}{rr|rr|c|c} \multicolumn{2}{c}{~}& \multicolumn{2}{c}{\%}&selected attributes&selection\\\cline{3-4} data & N & pd & pf & (see\fig{attrs})& method \\\hline pc1 & 100&48 & 17 & 3, 35, 37&exhaustive subsetting\\ mw1 & 100&52 & 15 & 23, 31, 35&iterative subsetting\\ kc3 & 100&69 & 28 & 16, 24, 26&iterative subsetting\\ cm1 & 100&71 & 27 & 5, 35, 36&iterative subsetting\\ pc2 & 100&72 & 14 & 5, 39&iterative subsetting\\ kc4 & 100&79 & 32 & 3, 13, 31&iterative subsetting\\ pc3 & 100&80 & 35 & 1, 20, 37&iterative subsetting\\ pc4 & 100&98 & 29 & 1, 4, 39&iterative subsetting\\\hline all & 800& 71& 25 &\multicolumn{2}{c}{} \end{tabular} \end{center}} \begin{center} \begin{tabular}{llll} &frequency&\\ ID&in \fig{best}&what& type\\\hline 1&2& {loc\_blanks} & locs\\ 3&2 & {call\_pairs} & misc\\ 4&1 & {loc\_code\_and\_command} & locs\\ 5&2 & {loc\_comments} & locs\\ 13&1 & {edge\_count} & misc\\ 16&1 & {loc\_executable} & locs\\ 20&1 & {I} & H (derived Halstead)\\ 23&1 & {B}& H (derived Halstead)\\ 24&1 & {L}& H (derived Halstead)\\ 26&1 & {T}& H (derived Halstead)\\ 31&2 & {node\_count} & misc\\ 35&3 & {$\mu_2$}& h (raw Halstead)\\ 36&1 & {$\mu_1$}& h (raw Halstead)\\ 37&2 & {number\_of\_lines} & locs\\ 39&2 & {percent\_comments}& misc\\ \end{tabular} \end{center}} for full Fagan inspections~\cite{fagan76} to \mbox{$pd=TR(13, 21, 30)\%$} for less-structured inspections. \ei So, no ``best'' set of a attributes, no ``best'' model Yours ain't mine General methods of learning useful theories Feature subset selection, RIPPER, etc, etc,.. Specific theories quite variable Kitchenham et.al.~\cite{kitch07} take great care to document this effect, They conducted a systematic review of ten projects comparing estimates using historical data {\em within} the same company or {imported} from another. In no case was it better to use data from other sites, and sometimes importing such data yielded significantly worse estimates. used.png \begin{center} \includegraphics[width=3.5in]{infogain.pdf} \end{center} InfoGain for KC3 attributes. Calculated from \eq{infogain}. Lines show means and t-bars show standard deviations after 10 trials on 90\% of the training data (randomly selected). Other examples Example \#5: Martin Shepperd IEEE TSE 2004 NASA SEL defect data: than 200 projects over 15 years. Predcting defetcs accuracy is very high (over 95\%), false-negative rate is very low. In summary... Five NASA data sources Eg \#1: text mining a NASA issue database (PITS) Eg \#2: effort estimation from NASA data (COCOMO) Eg \#3: early life cycle severity prediction (SILAP) Eg \#4: defect prediction from NASA static code data (MDP) Eg \#5: defect prediction (NASA SEL) All of which yield strong predictors for quality (effort, defects) Only {\underline one} of which is still active PITS Recommendations: Stop! Stop debating what data to collect Create a central register for all NASA's software components Includes just component, and what super-component this is ``part-of'' Report to some central location All defect reports COCOMO and features on all sub-systems All reports have {\em no} project identifier Just an anonymous join key to the central register Stop debating how to store data Comma-seperated or XML files, one per component is just fine Stop hiding data Make the anonymous project data open source Lever the international data mining community Stop publishing general models Rather publish general methods for building specific models And methods for selecting relevant data from other sources