Summary

	Data mining NASA project data
	Five examples where data mining found clear quality predictors
		effort
		defects
	In only \underline{one} of those cases is that data source still active.
		All that lost data
	What to do?

Background
	GartnerHypeCycle.gif
	History
		1980s (early): AI summer
		1980s (late): bubble bursts, AI winter
		1990s: great success with planning, data mining
		2000s: still, some doubts re AI
			But, lots of successes of AI (data mining) for SE
	This talk
		This stuff works, 
			5 such success stories
			The main problem with data mining is organization, not technological
				Despite clear success, $\frac{4}{5}$ of those data sources have vanished
				Easier to plan, than act

What is data mining?
	Diamonds in the dust
	Sumamrization: Not 1000 records But 3 rules
	Example \#1: text mining issue reports [Menzies, 2007]: 
		901 NASA records, PITS issue tracker:  \{severity, free text\}
			\begin{tabular}{l cl}
			severity & frequency& \\
			1 (panic)        & 0 &\\
			2        & 311 &\rule{30mm}{2mm}\\
			3        & 356 &\rule{35mm}{2mm}\\
			4        & 208 & \rule{20mm}{2mm}\\
			5 (yawn)       & 26  & \rule{2mm}{2mm}
			\end{tabular}
	Output 
		All unique words, sorted by ``magic'' (see below)
		Rules leared from top N words
		10-way cross-val results for severity 2:
		\begin{tabular}{lclr}
		         & r & p & F\\
		N words  & recall (prob detection) & precision & $\frac{2*r*p}/{r+p}$\\
		100      &    0.81                            &0.93    & 0.87  &\rule{43mm}{2mm}\\
		50       &    0.80                           & 0.90    & 0.85   &\rule{42mm}{2mm}\\
		25       &    0.79                           & 0.03    & 0.85   &\rule{42mm}{2mm}\\
		12       &    0.74                             & 0.92  & 0.82   &\rule{41mm}{2mm}\\
		6        &    0.71                           & 0.94    & 0.81   &\rule{40mm}{2mm}\\
		3        &    0.74                           & 0.82    & 0.78   &\rule{38mm}{2mm}
		\end{tabular}
		\begin{verbatim}\scriptsize
             if (rvm <= 0) and (srs = 3) => severity=4 (52.0/21.0)
        else if (srs >= 2)               => severity=2 (289.0/54.0)
        else                             => severity=3 (558.0/246.0)
		\end{verbatim}
	Diamonds in the dust
		Not  9414 words total or 1662 unique, but 3 most predictive.

Methods
	Too numerous to list here
	Use, don't dictate
		Basili et.al. ``Learning Organizations'' carefully design their data collection
		Me, chasing the meat truck
			use whatever is around.
			Savage pace of organizational change.
	Less is more
		Pruning features is a good idea
	Yours ain't mine
		Always best to relearn with local data

Less is more (1)
	Feature subset selection (FSS)
	$y = \beta_0 + \beta_1f_1 + \beta_2f_2 + \beta_3f_3 ..$.
	Variance in $y$ reduced by pruning some $x_i$
		But don't prune too much: e.g. $\forall x_i, y=\beta_0$ 
	Apply domain knowledge
		e.g., in text mining, TF*IDF: 
			{\em term frequency inverse document frequency}
			Frequent terms, but only in a small number of documents
		TF*IDF = $F[i,j] * log(IDF)$
		$F[i,j]$ = frequency of word i in document j
		$IDF$ =  \#documents/(\#documents with i)
		Above study used top 100 $TF*IDF$ words.
	Exponential time FSS	
		Try all $2^F$ subsets of $F$ on a {\em target learner} 
		Possible, with small feature sets, with some heuristic search (best first, STALE=5)~\cite{kohavi97}.
	Linear time FSS (not as thorough as $2^F$:
		Sort $F$, somehow; 
		Try first $1\le f\le F$ features
		Above study sorted top 100 TF*IDF terms on infogain
			Intially:  $H(C) = - \sum_{c\in C}p(c)log_2p(c)$.
			After seeing feature $f$: $H(C|f) = - \sum{x \in f}p(x) \sum_{c in C} p(c|x) log_2p(c|x)$
			So $InfoGain = H(C) - H(C|f)$

Less is more (2)
	Overfitting avoidance
		After learning ``it'' ...
		... Try throwing away bits of ``it''
	e.g. RIPPER~\cite{cohen95a}
		If you learn a conjunction, prune with  greedy back select
		If you learn a set of rules, prune with greedy back select
		For the surviving rules, try replace it with..
			a dumb alternative 
			or a carefully selected modification
		Very fast: $O(m(log m)^2)$ for $m$ examples
		Often produces smaller theories than other methods (e.g. see above)
		:
Example \#2: effort estimation
	NASA COCOMO data: results from Menzies et.al TSE 2006
	learners for continous classes
	many methods, including feature subset selection
\begin{tabular}{l|r|r|r}
                    & \muticolumn{3}{c}{
			percentile $\frac{pred-actual}{actual}$}\\\cline{3-5}
 subset & 50\% & 65\% & 75\% \\\hline
 mode= embedded & -9 & 26 & 60 \\
 project= X & -6 & 16 & 46 \\
 all & -4 & 12 & 31 \\
 year= 1975 & -3 & 19 & 39 \\
 mode= semidetached & -3 & 10 & 22 \\
 fg= g & -3 & 11 & 29 \\
 center= 5 & -3 & 20 & 50 \\
 category= missionplanning & -1 & 25 & 50 \\
 project= gro & -1 & 9 & 19 \\
 center= 2 & 0 & 11 & 21 \\
 year= 1980 & 4 & 29 & 58 \\
 category= avionicsmonitoring & 6 & 32 & 56 \\\hline
 mean & 0 & 23 & 48
	\end{tabular}


Example \#3: Early Lifecycle Defect Detection
	NASA defect data: 5 projects,		[Menzies et.al 2007]
	\begin{tabular}{|r@{~=~}p{2.5cm}|r@{~=}p{4.1cm}|}\hline
\multicolumn{2}{|c|}{Derived}&
\multicolumn{2}{|c|}{Raw features}\\\hline
	co&Consequence&

am&Artifact Maturity\\

	dv&Development&


as&Asset Safety\\

	ep&Error Potential& 

cl&CMM Level\\
	pr&Process&
cx&Complexity\\
	sc&Software Characteristic& 
di&Degree of Innovation\\

\multicolumn{2}{|c|}{~}&do&Development Organization\\
\multicolumn{2}{|c|}{~}&dt&Use of Defect  Tracking System\\
\multicolumn{2}{|c|}{~}&	ex&Experience \\
\multicolumn{2}{|c|}{~}&	fr&Use of Formal Reviews\\
\multicolumn{2}{|c|}{~}&	hs&Human Safety\\
\multicolumn{2}{|c|}{~}&	pf&Performance\\
\multicolumn{2}{|c|}{~}&	ra&Re-use Approach\\
\multicolumn{2}{|c|}{~}&	rm&Use of Risk Management System\\
\multicolumn{2}{|c|}{~}&	ss&Size of  System\\
\multicolumn{2}{|c|}{~}&	uc&Use of Configuration Management\\
\multicolumn{2}{|c|}{~}&	us&Use of Standards\\\hline
\end{tabular} {\scriptsize
\begin{alltt}
function CO( tmp) { 
        tmp=0.35*AS + 0.65 *PF; 
        return (round((HS) > tmp) ? HS : tmp) } 
function EP() {table
        return round(0.579*DV() + 0.249*PR() + 0.172*SC()) } 
function SC() { 
        return 0.547*CX + 0.351*DI + 0.102*SS } 
function DV() { 
        return 0.828*EX + 0.172*DO } 
function PR() { 
        return 0.226*RA + 0.242*AM + formality() } 
function formality() { 
        return 0.0955*US + 0.0962*UC + 0.0764*CL  +
               0.1119*FR + 0.0873*DT + 0.0647*RM }
\end{alltt}}

Learner= RIPPER

Feature subset selection $2^F$

\begin{center}
\scriptsize
\begin{tabular}{r|l@{}r|rrrr|rl@{~}r}
\multicolumn{3}{c}{~}&\multicolumn{4}{c}{$f$-measures (from \eq{F}) }& \\\cline{4-7}
treatment	&features	&\#features&\_12	&\_3	&\_4	&\_5	& \multicolumn{2}{l}{$F= \left(\sum f\right)/4$}&\\\cline{1-9}
    &               &  &    &       &      &        &    &                 &\\
A	&all - L1 - L2 - group(6)	&8&0.97	&0.95	&0.97	&0.99	&0.97&\rule{49mm}{2mm} &\\
B	&all  - L1 - L2 - group(5 + 6)	&7&0.95	&0.94	&0.97	&0.96	&0.96&\rule{48mm}{2mm}\\\
C	&all - L1 - L2 - group(4 + 5 + 6)	&6&0.93	&0.95	&0.98	&0.93	&0.95& \rule{47mm}{2mm}\\
D	&all - L1 - L2  	&16&0.94	&0.94	&0.93	&0.96	&0.94&\rule{47mm}{2mm}\\
E	&all - L1 - L2 - group(3 + 4 + 5 + 6)	&4&0.93	&0.97	&0.90	&0.87	&0.92& \rule{46mm}{2mm} &$\Rightarrow$ see \fig{rxE}\\
%F	&L1+ L2	&2&0.96	&0.89	&0.97	&0.70	&0.88&\rule{44mm}{2mm}\\
F	&\{co*ep, co, ep\}	&3&0.94	&0.84	&0.55	&0.70	&0.76&\rule{38mm}{2mm}\\
G	&L1	&1&0.67	&0.69	&0.00	&0.46	&0.45&\rule{22mm}{2mm}\\
H	&just "us"	&1&0.64	&0.60	&0.00	&0.00	&0.31&\rule{15mm}{2mm}\\
I	&L2	&1&0.57	&0.00	&0.32	&0.00	&0.22&\rule{11mm}{2mm}
\end{tabular}
\end{center}

\begin{center}
\footnotesize
\begin{tabular}{rrlr}
& &&number of times \\
group & feature	&notes &selected\\\hline
1&us & use of standards & 10\\\hline
2&uc &config management   &        9\\
&ra & reuse approach    &      9\\
&am & artifact maturity     &     9\\\hline

3&fr &  formal reviews    &     8\\
&ex &experience        &   8\\\hline

4&ss& size of system &          7\\\hline

5&rm & risk management &          6\\\hline

6& cl & CMM level &         5\\
& dt & defect tracking  &       5\\
& do& development organization    &     4\\
& di& degree of innovation     &    4\\
&hs&human safety	    &   3\\
&as&   asset safety      &  2\\
&cx&  complexity        & 2\\
&pf& performance          &1
\end{tabular}
\end{center}

\begin{tabular}{|r@{~=~}p{2.5cm}|r@{~=}p{4.1cm}|}\hline
\multicolumn{2}{|c|}{Derived}&
\multicolumn{2}{|c|}{Raw features}\\\hline
	co&Consequence&

am&Artifact Maturity\\

	dv&Development&


as&Asset Safety\\

	ep&Error Potential& 

cl&CMM Level\\
	pr&Process&
cx&Complexity\\
	sc&Software Characteristic& 
di&Degree of Innovation\\

\multicolumn{2}{|c|}{~}&do&Development Organization\\
\multicolumn{2}{|c|}{~}&dt&Use of Defect  Tracking System\\
\multicolumn{2}{|c|}{~}&	ex&Experience \\
\multicolumn{2}{|c|}{~}&	fr&Use of Formal Reviews\\
\multicolumn{2}{|c|}{~}&	hs&Human Safety\\
\multicolumn{2}{|c|}{~}&	pf&Performance\\
\multicolumn{2}{|c|}{~}&	ra&Re-use Approach\\
\multicolumn{2}{|c|}{~}&	rm&Use of Risk Management System\\
\multicolumn{2}{|c|}{~}&	ss&Size of  System\\
\multicolumn{2}{|c|}{~}&	uc&Use of Configuration Management\\
\multicolumn{2}{|c|}{~}&	us&Use of Standards\\\hline
\end{tabular} 
\begin{center}
\begin{tabular}{@{rule~}rrl@{~then ~}c}
1&if&  $uc \ge 2 \wedge us =1$    & \_5\\
2&else if&  $am = 3$ & \_5 \\
3&else if&  $uc \ge 2 \wedge am=1 \wedge us \le 2$ & \_5\\
4&else if&  $am=1 \wedge us=2$  & \_4    \\
5&else if&  $us=3 \wedge ra \ge 4$ & \_4  \\
6&else if&  $us=1$ & \_3        \\
7&else if&  $ra=3$ & \_3            \\
8&else if&  true &\_12              
\end{tabular}
\end{center}

Example \#4: Defect prediction from static code 
	Menzies et.al. TSE, 2007
	Prior state of the art:
		\bi
\item
{\em IEEE Metrics
2002} panel~\cite{shu02} 
	\bi
		\item manual software  reviews find ${\approx}60\%$ of defects
\item
Raffo:  defect detection capability of
industrial review methods = $pd=TR(min,mode,max) = TR(35, 50,
65)
\item Data mining methods using static code measures (Halstead, McCabes,
lines of code, misc)
prob \{detection,false alarm\} = \{36,17\}\%
\ei

Understanding the distributions added a lot.
	All the numeric distributions were logarithmic
	so set $num = log(num < 0.00001 ? 0.00001 : num)$

Feature subset selection with infogain

\begin{center}
%XX remove accuracy
\begin{tabular}{rr|rr|c|c}
\multicolumn{2}{c}{~}& \multicolumn{2}{c}{\%}&selected 
attributes&selection\\\cline{3-4}
data  &  N  & pd      & pf  &   (see\fig{attrs})& method  \\\hline
pc1   & 100&48 & 17 &  3, 35, 37&exhaustive  subsetting\\
mw1   & 100&52 & 15 &  23, 31, 35&iterative subsetting\\
kc3   &  100&69 & 28 & 16, 24, 26&iterative subsetting\\
cm1   & 100&71 & 27  & 5, 35, 36&iterative subsetting\\
pc2   & 100&72 & 14 & 5, 39&iterative subsetting\\
kc4   & 100&79 & 32 &  3, 13, 31&iterative subsetting\\
pc3   & 100&80 & 35 & 1, 20, 37&iterative subsetting\\
pc4   & 100&98  & 29 & 1, 4, 39&iterative subsetting\\\hline
all   & 800& 71& 25 &\multicolumn{2}{c}{}
\end{tabular}
\end{center}}

\begin{center}
\begin{tabular}{llll}
  &frequency&\\
ID&in \fig{best}&what& type\\\hline
1&2&  {loc\_blanks}  & locs\\
3&2 &  {call\_pairs} & misc\\
4&1 &  {loc\_code\_and\_command} & locs\\
5&2 &  {loc\_comments} & locs\\
13&1 &  {edge\_count}   & misc\\
16&1 &  {loc\_executable} & locs\\
20&1 &  {I} & H (derived Halstead)\\
23&1 &   {B}& H (derived Halstead)\\
24&1 &   {L}& H (derived Halstead)\\
26&1 &   {T}& H (derived Halstead)\\
31&2 & {node\_count} & misc\\
35&3 & {$\mu_2$}& h (raw Halstead)\\
36&1 & {$\mu_1$}& h (raw Halstead)\\
37&2 &  {number\_of\_lines} & locs\\
39&2 & {percent\_comments}& misc\\
\end{tabular}
\end{center}} for full Fagan inspections~\cite{fagan76} to
\mbox{$pd=TR(13, 21, 30)\%$} for less-structured inspections.
\ei

So, no ``best'' set of a attributes, no ``best'' model

Yours ain't mine
	General methods of learning useful theories
		Feature subset selection, RIPPER, etc, etc,..
	Specific theories quite variable
Kitchenham et.al.~\cite{kitch07}
take great care to document this effect, They conducted   a systematic
review of ten projects comparing estimates using historical data
{\em within} the same company or {imported} from another. In no
case was it better to use data from other sites, and sometimes
importing such data yielded significantly worse estimates.

used.png

	\begin{center}
\includegraphics[width=3.5in]{infogain.pdf}
\end{center}
InfoGain for KC3 attributes. Calculated from \eq{infogain}.
Lines show means and t-bars show standard deviations after 10 trials
on 90\% of the training data (randomly selected).


Other examples
	Example \#5: Martin Shepperd IEEE TSE 2004
		NASA SEL defect data:  than 200 projects over 15 years. 
	 	Predcting defetcs accuracy is very high (over 95\%),  
			false-negative rate is very low.

In summary...
	Five NASA data sources
		Eg \#1: text mining a NASA issue database (PITS)
		Eg \#2: effort estimation from NASA data (COCOMO)
		Eg \#3: early life cycle severity prediction  (SILAP)
		Eg \#4: defect prediction from NASA static code data (MDP)
		Eg \#5: defect prediction (NASA SEL)
	All of which yield strong predictors for quality (effort, defects)
	Only {\underline one} of which is still active
		PITS

Recommendations: Stop!
	Stop debating what data to collect
		Create a central register for all NASA's software components
			Includes just component, 
			and what super-component this is ``part-of''
		Report to some central location
			All defect reports
			COCOMO and features on all sub-systems
		All reports have {\em no} project identifier
			Just an anonymous join key to the central register
	Stop debating how to store data
		Comma-seperated or XML files, one per component is just fine
	Stop hiding data
		Make the anonymous project data open source
		Lever the international data mining community
	Stop publishing general models
		Rather publish general methods for building specific models
		And methods for selecting relevant data from other sources