\section{Adding Business Knowledge}
\PARstart{I}f limited information content is the problem, then the solution is clear: 
give the
data miner more information.  However, as shown above, 
it is {\em not} useful to use more examples of the {\em same}
features (e.g. more static code features). Rather, we need to add  {\em more} information
about  {\em different} kinds of knowledge.


One differnt kind of knowledge, not found in the training data of \fig{data}
is the criteria $C_i$ by which a learner will be assessed.
This criteria can change, according to the business application.
For example:
\bi
\item In $application_1$, a quality assurance (QA) team has insufficient budget to
inspect all the code. Therefore, they need some {\em sorting} policy that increasing
the defect frequency in modules that are ranked high in the sort. In this application,
the QA team wants to {\em minimize} their inspection effot while {\em maximizing} the
number of defective modules they discover.
\item In $application_2$,
the manager of a new contract between a V\&V consultancy and a client may feel the need
to impress the client. In which case, she may direct her engineers
to scan all the code looking for some single high severity error. The proirities
of $Application_2$ is different to
$application_1$ in that the former is not overly concerned with high false alarm rates and the latter is not
overly concerned with high code coverage.  
\item In $application_3$, a company is maintaining a  large existing code base. In this situation,
there may be no backlog of modules to inspect so there may be no need for data miners to prioritize
the inspection effort.
\ei
In theory, data minign could use the evaluation criteria
$C_i$ to guide their search for better theories.
In practice, this is often not the case. Typically, learners grows their predictive
models using some hard-wired criteria (for a comprehensive 
survey on the range of criteria, see~\cite{furnkranz05}).  
Worse still, the evaluation criteria $C_1$ used by a learner during training
is often different to the evaluation criteria $C_2$ used when studying  test data.
For example:
\bi
\item[$C_1$]:
During {\em training}, a decision-tree learner may stop branching if the diversity of the instances in a 
leaf of a branch\footnote{For numeric classes, this diversity measure might  be the standard deviation of the class feature.
For discrete classes, the diverstity measure might be the entrophy measure used in J48~\cite{witten05}.}
falls below some heuristic threshold.
\item[$C_2$]: 
During {\em testing}, the learned decision-tree might be tested on any of the criteria shown in \fig{measures}.
\ei
By this line of reasoning, we have an explanation for the ceiling effect described above: {\em our learners
are blind to their purpose}. If some business application demands $C_i$ and our learners train on $C_j$ 
($i\not=j$), then it is hardly surprising that they cannot find ways to improve their performance.
Therefore, we need a new kind of learner- one where some user-specified
criteiris is applied during training
{\em and} testing. 

Some learning schemes support some
biassing; for example:
\bi
\item The  {\em cost-sensitive learners} discussed by Elkan~\cite{elkan01};
\item The   {\em ROC ensembles}discussed by Fawcett~\cite{fawcett01}  where
the conclusion is computed from some summation of the conclusions of the ensemble of 
ROC curves\footnote{ROC= receiver-operator characteristic curves such as graphs of
PD-vs-PF or PD-vs-precision}, 
proportionally weighted to
yied a new learner  approximates $C$.
\ei 
At best,  such biasing is only an indirect
control of the hard-wired criteria. If the underlying criteria used to guide the search
is orthogonal to the success criteria of, say, $application_1$, then no amount of cost-sensitive learning
or ensemble combinations will result in a learner that supports that business applciation.

The rest of this paper tests the speculation that a learner that uses  
the {\em same}
evaluation criteria  during training {\em and} testing can break the performance ceiling. 
Our experiments will compare manual methods, some standard data miners, and new learner
calle WHICH~\cite{milton08}. 
WHICH is described in detail Appendix D. The following features of its design are most relevant
to our current discussion:
\bi
\item
WHICH is a very simple learner that just loops over the space of
possible feature ranges, trying various combinations. 
\item The evaluation criteria
$C_i$ is not applied in a post-processing step. Rather, it is wired into the
inner loop of WHICH's learning.
All the current combinations are sorted by the evaluation criteria $C_i$ and highly ranked combinations
are more likely to be picked for future combinations.
\item WHICH's evaluation criteria can be customized. The current implmentation offers 10 different
$C_i$ including all the ones listed in \fig{measure} and it would be trvial to add other criteria.
\item
For this study, we disabled all of WHICH's ``smarts'' {\em except} $C_i$. WHICH can optimize each combnation via
(e.g.)back-selects that discrard superflous details. WHICH could also perform some ehuristic pruning
to speed up the inference time. All those optomizations and pruning 
Apart from $C_i$, WHICH has no ``smarts''. Various implementations options are currently disabl
\item That is, WHICH is {\em only} an eval criter.
\ei

XXX need more detail

\input{application}

% Decsion-tree and rule-based learners
% build their trees/rules  after a search
% through the space of possible boolean combinations
% that test for ranges of the features in a training set. For non-numeric data, those
% boolean combinations include primitives such as equivalance or set membership
% $\{==, \in\}$ as well as combinations of multiple primitives $\{and,or,not\}$.

% Except for trivially small data sets, the space of combnations is too large to be
% explored exhaustively. For example, if the 21 numeric features of
% \fig{attr} are divided in two (for ``low'' and ``high''), then the space of possible combinations
% is dauntingly large (over four trillion: $(2^{21*2}= 4,398,046,511,104)$.

% Hence, all practical tree/rule employ some
% {\em heuristic evaluation criteria} $E$ to prune the space of options that they explore. Numerous heuristics
% have been found to be useful, ranging from simple entrophy-based measures to complex
% multiple-pass growth-and-prune methods such as those used in RIPPER~\cite{cohen95r}. 
% For an extensive survey of
% heuristics used in rule-based learning, see~\cite{furnkranz05}.


% For more details on WHICH, see appendix D.

% but can read all

%  Given $M$ modules, V\&V
%   teams often {\em skew} their inspection according to some domain specific
%   prioritization. Budget considerations may mean some $missed \le M$ number
% of modules are not inspected at the current time.
% These $missed$ modules can add an  inappropriate bias to quality
% assurance (QA). If  the QA activities concentrate on  modules, say {\em A,B,C,D}, then that leaves {\em blind spots} in
% the $missed$ modules {\em E,F,G,H,I,...}. Blind spots can compromise high assurance
% software. Leveson remarks that in modern complex systems, unsafe
% operations often result from an unstudied interaction between
% components~\cite{leven95}. 
% To avoid blind spots, one option is to rigorously assess all aspects
% of all  software modules or other artifacts. But this is
% impractical. Software projects budgets are finite and QA
% effectiveness increases with QA effort. A {\em linear} increase in
% the confidence $C$ that we have found all faults can take {\em
% exponentially} more effort\footnote{
% The
% infamous state space explosion problem imposes
% strict limits on how much a system can be explored
% via automatic formal methods~\cite{me00q}.
% Lowry et.al.~\cite{lowrey98} and Menzies and
%   Cukic~\cite{me99q,me00y} offer numerous other examples where
%   assessment effectiveness is exponential on effort.
% It can also be hard to improve the assurance gained from even
% simple testing.
% A randomly
% selected input to a program will find a fault with probability $x$.
% Voas observes~\cite{voas95} that after $N$ random black-box tests,
% the chance of the inputs not revealing any fault is $(1-x)^N$.
% Hence, the chance $C$ of seeing the fault is $1-(1-x)^N$ which can
% be rearranged to
%  $N(C,x)=\frac{log(1 -
% C)}{log(1-x)}$. 
% Hence, to detect one-in-a-thousand
% module faults, moving $C$ from 90\% to 94\% to 98\% takes 2301,
% 2812, and 3910 black box tests (respectively). 
%  }
%  Exponential
%   cost increase quickly exhausts finite QA resources.  Hence, blind
%   spots can't be avoided and must be managed. 
% One way to manage inspections is the  {\em business $application_1$} 
% shown in \fig{task1}.
% A data miner can learn a predictor using old data. When applied to new data,
% a set of modules are predicted to be faulty.  A QA team can reflect over the predictions
% to sort the $missed$ modules into
% (a)~ those that require urgent inspection, and (b)~others
% than could be inspected later (or never).  
% We say that the inspection team is $\alpha$\% effective at inspecting modules and recognizing the ones that are 
% reall defective.

% %%% Local Variables: 
% %%% mode: latex
% %%% TeX-master: t
% %%% End: