\documentclass[10pt]{article}
\usepackage{rotating}
\usepackage{lineno,times,graphicx}
\usepackage{url}
\usepackage{alltt}
\title{\vspace{1cm}\includegraphics[width=2in]{HAMLETSkullHCSealous.jpg}\\\vspace{1cm}
The HAMLET Project: Progress Report\vspace{1cm}}
\author{Tim Menzies\\LCSEE, WVU\\\url{tim@menzies.us}}
\date{}
\newcommand{\be}{\begin{enumerate}}
\newcommand{\ee}{\end{enumerate}}
\newcommand{\bi}{\begin{itemize}}
\newcommand{\ei}{\end{itemize}}
\newcommand{\eq}[1]{Equation~\ref{eq:#1}}
\newcommand{\fig}[1]{Figure~\ref{fig:#1}}
\usepackage{lastpage,fancyheadings}
\footrulewidth 0.0pt\headrulewidth 0.1pt
\lhead{}\rhead{}\chead{}\rfoot{}\lfoot{}\cfoot{}
\rhead{ {\scriptsize 
\date{\today}
}}

\rhead{{\scriptsize 
\date{\today}
}}
\lhead{\includegraphics[width=6cm]{wvulogo.jpg}}

\usepackage{eso-pic}
\usepackage{color}
\usepackage{type1cm}
\makeatletter
  \AddToShipoutPicture{%
    \setlength{\@tempdimb}{.5\paperwidth}%
    \setlength{\@tempdimc}{.5\paperheight}%
    \setlength{\unitlength}{1pt}%
      \makebox(600,900){\rotatebox{45}{\textcolor[gray]{0.9}{\fontsize{3cm}{3cm}\selectfont{Not for release}}}}}
\makeatother

\begin{document}
\linenumbers
\maketitle
\thispagestyle{fancy}
\pagestyle{fancy}

\vspace{1cm}{\bf About this document:}
This report is the first in a  series of three documents:
\be
\item
This report describes progress on the HAMLET ``anything browser''.
It is an interim internal document between WVU and the National Archives
and is not intended for wider distribution. Report 3, listed below,
will update this document and will be suitable for broad release.
\item
Report 2, due August 30, will describe the difference between the current version of HAMLET
and our proposed new work. 
\item
Report 3, due September 30,
will offer detailed technical information on HAMLET.
\ee

\newpage
\rhead{\thepage\ of \pageref{LastPage}}
\lhead{Menzies: HAMLET - progress report}
\tableofcontents
\newpage
\listoffigures
\newpage
\section{Summary}
HAMLET is an ``anything browser'' that aims to offer advice on what engineering parts are
relevant to a partially complete current design. 

HAMLET makes minimal assumptions about the 
nature of the engineering documents being explored
and  is designed to be ``glue'' that permits the searching of
technical data in large
heterogeneous collections.

Using technology from
AI,
information retrieval, program comprehension, and text mining,
HAMLET allows designers to dig up prior designs, study those designs,
then apply insights from those historical designs to new tasks.

\begin{figure}[!b]
\begin{center}
\includegraphics[width=5in]{archeology.jpg}
\end{center}
\caption{Digging up old designs. From \protect\cite{booch04}.}
\end{figure}

\newpage
\section{Why HAMLET? (motivation)}
A recent NSF-funded workshop\footnote{
Collaborative Expedition Workshop \#74, June 10, 2008, at NSF.
``Overcoming I/O Bottlenecks in Full Data Path Processing: Intelligent, Scalable Data Management from Data Ingest to Computation Enabling Access and Discovery''.
\url{http://colab.cim3.net/cgi-bin/wiki.pl?ExpeditionWorkshop/TowardScalableDataManagement_2008_06_10}
} 
 highlighted current directions
in long term technical document retention. While much progress was reported
on:
\bi
\item
systems issues of handling and sharing very large data collection (e.g. SLASH)
\item
scalable methods of building customization views (e.g. iRODS),
\ei
there was little mention of the cognitive issues of 
how users might 
browse and  synthesize
data from massive data collections of technical documents.

For example, here at WVU, we are mid-way through a review of the
use of STEP/EXPRESS for long term technical document retention\footnote{See
reports from Mucino.}.  STEP/EXPRESS is commonly used as an
inter-lingua to transfer technical data between CAD/CAM packages.
Strange to say, while STEP/EXPRESS is useful for transferring and
understanding technical documents {\em today}, it does not appear
to be suitable for understanding technical documents from {\em
yesterday}.

In theory, there is nothing stopping STEP/EXPRESS from
recording and storing
all aspects of a project.  In many ways, STEP/EXPRESS is
as expressive as other technical document standards (e.g. UML.)
STEP/EXPRESS offers a generic method for storing part-of and isa
information, constraints, types, and the rules associated with a
technical document.
However,
in practice, the theoretical potential of STEP/EXPRESS is not realized for
the following reasons.
 
\subsection{Heterogeneity}
The reality of archival
systems is that  STEP/EXPRESS documents
are stored {\em along side} a much larger set of supporting documents in multiple formats.
A recent study\footnote{\url{http://www.b-eye-network.com/view/2098}} concluded that
\bi
\item
     80 percent of business is conducted on unstructured information;
\item     85 percent of all data stored is held in an unstructured format
(e.g.  the unstructured text descriptions of issues found in PITS);
  \item   Unstructured data doubles every three months;
\ei
That is, if we can learn how to understand large heterogeneous collections,
that include
STEP/EXPRESS knowledge as well as numerous other products
in a wide
variety of formats,
it would be possible to reason and learn from a very wide range of
data.
\subsection{Incomplete meta-knowledge}
Much work has focused on the creation
of cliched sets of EXPRESS schemas. 40 such {\em application protocols} (AP) have been defined~\cite{jardim06}
including 
AP-203 (for geometry).
and AP-213 (for numerical control).  The list of currently defined application protocols
is very extensive (see \fig{aps}).
These APs are the cornerstone of STEP tools: the tools offer specialized support and screens
 import/export facilities for the APs.
For most of these APs, while much effort went into their creation, very few have been stress-tested
in the field within information systems. That is, the majority of these APs have 
been {\em written} more than they have
been {\em read} (exceptions:  the above-mentioned AP-203 and AP-213 are frequently used and reused in 
$21^{st}$ century CAD/CAM manufacturing processes),
\subsection{Incomplete tool support}  Perhaps because of the relative immaturity of the APs, current
CAD/CAM tools offer limited support for the STEP APs. While most tools support geometry (AP-203), the
support for the other  APs in \fig{aps} is minimal (to say the least).
\subsection{ Incomplete design rationale support} 
From a cognitive perspective, STEP/EXPRESS does not support the entire design cycle.
Rather, it only supports the last stages of design and not all the interim
steps along the way. For more on this issue, see the appendix.
\subsection{Limited Historical Use}
For all the above reasons, highly structured technical documents in formats like STEP/EXPRESS
are in the minority in the archival systems
we have examined.  We are aware of large STEP/EXPRESS repositories
but these are often inaccessible for a variety
of reasons.

While this situation might change in the future (e.g. if
all the above issues were suddenly fixed and all organizations switch to using highly
structured technical documentation), the historical record would still be starved
for large numbers of examples.

\begin{figure}
\begin{center}
\begin{tabular}{ll}
AP & area\\\hline
201  &   	Explicit Drafting\\
202  &   	Associative Drafting\\
203  &   	Configuration Controlled Design\\
204  &   	Mechanical Design Using Boundary Representation\\
205  &   	Mechanical Design Using Surface Representation\\
206  &   	Mechanical Design Using Wireframe Representation\\
207  &   	Sheet Metal Dies and Blocks\\
208  &   	Life Cycle Product Change Process\\
209  &   	Design Through Analysis of Composite and Metallic Structures\\
210  &   	Electronic Printed Circuit Assembly, Design and Manufacturing\\
211  &   	Electronics Test Diagnostics and Remanufacture\\
212  &   	Electrotechnical Plants\\
213  &   	Numerical Control Process Plans for Machined Parts\\
214  &   	Core Data for Automotive Mechanical Design Processes\\
215  &   	Ship Arrangement\\
216  &   	Ship Molded Forms\\
217  &   	Ship Piping\\
218  &   	Ship Structures\\
219  &   	Dimensional Inspection Process Planning for CMMs\\
220  &   	Printed Circuit Assembly Manufacturing Planning\\
221  &   	Functional Data and Schematic Representation for Process Plans\\
222  &   	Design Engineering to Manufacturing for Composite Structures\\
223  &   	Exchange of Design and Manufacturing DPD for Composites\\
224  &   	Mechanical Product Definition for Process Planning\\
225  &   	Structural Building Elements Using Explicit Shape Rep\\
226  &   	Shipbuilding Mechanical Systems\\
227  &   	Plant Spatial Configuration\\
228  &   	Building Services\\
229  &   	Design and Manufacturing Information for Forged Parts\\
230  &   	Building Structure frame steelwork\\
231  &   	Process Engineering Data\\
232  &   	Technical Data Packaging\\
233  &   	Systems Engineering Data Representation\\
234  &   	Ship Operational logs, records and messages\\
235  &   	Materials Information for products\\
236  &   	Furniture product and project\\
237  &   	Computational Fluid Dynamics\\
238  &   	Integrated CNC Machining\\
239  &   	Product Life Cycle Support\\
240  &   	Process Planning\\
\end{tabular}
\end{center}
\caption{STEP Application Protocols.}\label{fig:aps} 
\end{figure}

\newpage
\section{What is HAMLET? (overview)}
HAMLET is an ``anything browser'' that glues  together and searches
technical data in large
heterogeneous collections.
HAMLET offers the following services:
\bi
\item Provides an import facility from a variety of artifacts. The generic
parsing facility is designed to be extensible so, in theory, it would be possible
to extend the range of artifacts processed by HAMLET.
\item
Provides standard visualization techniques for those artifacts; e.g. 
fast graphical browsing of used and  used-by links.
While state-of-the-art, these standard visualization techniques quickly
overload and exhaust the user (for example,
see 
\fig{fig8}). Hence, HAMLET uses  
AI, text mining, program comprehension techniques, information retrieval techniques to offer
a set of filtered views on the data.
\ei
\begin{figure}
\begin{center}
\includegraphics[width=5in]{screens/8.jpg}
\end{center}
\caption[Too many neighboring concepts]{
Too many neighboring concepts: note the overdose
of information resulting from a simple visualization
of the query ``what terms are used by the code
in the left-hand-side screen?''. 
This figure shows the terms referenced by the code
sample on the left-hand-side, to a depth of eight links
(i.e. directly references, referenced by something referenced
in the left-hand-side code, referenced by something referenced by
something referenced in the left-hand-side code and so on to
a nested depth of maximum eight links). 
The depth of nesting for the query is controlled by the 
slider at bottom right.
The code snippet is in JAVA but a similar query on an EXPRESS
schema would result in a similar information overload.
The lesson of this figure is that special tools are required:
we should not show {\em all} the nested references; just the
{\em relevant} ones.
HAMLET's display of relevant local information is shown in \fig{sample}.
}\label{fig:fig8}.
\end{figure}
HAMLET's concept of operation is as follows:
\bi
\item
A designer identifies a corpus of historical technical documents that are relevant to a new
design task. This is a fast process and may be as simple as ``give me all designs
that use graphical user interfaces'' or even ``give me all designs created by my company
in the last ten years''.
\item
HAMLET focuses on that corpus. Note that, in the usual case, HAMLET has pre-processed that
corpus and so this focus process is not slow.
\item
The designer starts creating a new design in the same style as that corpus.
Somewhere in that  creation process, the designer asks for an audit of the current,
possibly partially incomplete, design.
\item
HAMLET responds by offered links to sets of related designs, augmented as follows:
\bi
\item ``What else'': i.e. a set of designs that share many features of the designers current ideas;
\item ``What not''; i.e. a set of concepts {\em not} found in related designs.
\ei
\fig{sample} offers an example of these two lists.
\begin{figure}
\begin{center}
\includegraphics[width=5in]{screens/5.jpg}
\end{center}
\caption[A sample HAMLET screen]{A sample HAMLET screen. Top left is some {\em new} technical
specification offered
by the user. Top-right are all the concepts in {\em old} designs related to the 
new design. Note that this top-right screen is 
overly complex to read.  Bottom left are the 50 concepts {\em nearest} to the new design.
In the 50 nearest concepts, there exists certain concepts not found in the new
design.
These are  shown in the {\em what else} list, bottom right.
Also, the new design  contains certain concepts not found in the 50 nearest
concepts. These are shown in the {\em what not} list (also, bottom right).
}\label{fig:sample}
\end{figure}
Designers are not obliged to always add the ``what else'' results or always avoid the 
``what not'' results. However, those two queries will allow a designer to assess their current 
design with respect to the space of prior designs.
\item
The designer and HAMLET enter a feedback loop where the designer reviews HAMLET's ``what else'' and ``what not'' list. HAMLET learns the designer's preferences and, subsequently, uses
that knowledge to offer results relevant to that user's current design task. 
\item
From the feedback loop, preference knowledge can be learned. Given a community of designers
working on related tasks, HAMLET will be able to quickly learn what are relevant prior designs
for that community.
\ei
In short, HAMLET suggests to the designer what ``to do or not to do'' (hence it's name).
\subsection{Generic Parsing}
Internally, HAMLET makes minimal assumptions about the form of the technical document:
\bi
\item 
A document
contains slots and slots can be atomic or point to other documents;.
\item
The network of pointers between documents presents the space of connected designs.
\ei
A generic parser class implements a standard access protocol for this internal model.
By sub-classing that parser, it is possible to quickly process new documents types.
Currently, HAMLET's parsers can import:
\bi
\item STEP/EXPRESS
\item XML
\item Text documents structured as follows: sub-headings within headings, paragraphs
within sub-headings, sentences within paragraphs, words in sentences;
\item JAVA:  
This JAVA import allows ready access to very large corpuses of
structured technical information (i.e. every open source JAVA program
on the web). Hence, in the sequel, we will make extensive use of
JAVA examples since that permit tests of scalability.
\ei
\subsection{The Geometry of Design}  
HAMLET treats technical document comprehension as a geometric problem:
\bi
\item
Old designs are clustered into groups. 
\item A new design represent some point around those
clusters. 
\item To compute ``what else'', HAMLET finds the cluster nearest the new design and 
looks for differences between the new design and the average design in that cluster.
\item
To compute ``what not'', HAMLET looks for parts of the current design that are not usually
found in the nearest cluster.
\ei
While simple in concept, the challenge of HAMLET is three-fold:
\be
\item Doing all the above in a scalable manner; i.e. linear or sub-linear time
processing. HAMLET handles this is a variety ways including methods borrowed from GOOGLE.
\item
Doing all the above for a heterogeneous corpus.
HAMLET handles multiple formats in the corpus by storing them all documents in a minimalistic
internal format
(a document
contains slots and slots can be atomic or point to other documents).
\item
While the minimal format permits rapid extension of HAMLET to other document types,
it raises the issue of false alarms.
Like any information retrieval task, HAMLET returns some false negatives (i.e.
incorrect ``what not'' results)
and false positives (i.e. incorrect ``what else'' results).
HAMLET therefore builds profiles on each user that learn their particular preferences.
\ee
\subsection{Related Work}
HAMLET draws on much of the literature on AI, information retrieval, text mining,
and program comprehension.
Formally, HAMLET is a 
{\em suggestion system} that augments standard queries
with (a)~suggestions that near to the current partially incomplete
design and (b)~suggestions of additions to the current design
that would make it very unusual with respect to the space
of all prior designs~\cite{pu06}.

Having said that,
comprehension of archival technical documentation
has some features that make HAMLET's task different to other systems:
\bi
\item
Zhai et al.~\cite{Zhai04across-collection}
discuss  cross-collection mixture models that seek to discover latent common themes
from large document collections.
This work assumes that all documents are expressed in a simple ``bag of words''
model (i.e. no links between documents). In HAMLET, on the other hand,
documents are stored in  a connected graph showing neighborhoods of related terms. 
\item
The HIPIKAT system of \v{C}ubrani\'{c} and Murpy~\cite{murphy03} explores
comprehension of heterogeneous
technical products.
 Software development projects produce a large number of artifacts,
 including source code, documentation, bug reports, e-mail, newsgroup
 articles, and version information. The information in these artifacts
 can be helpful to a software developer trying to perform a task,
 such as adding a new feature to a system. Unfortunately, it is
 often difficult for a software developer to locate the right
 information amongst the huge amount of data stored.
HIPIKAT recommends relevant software development artifacts based on the context in which a developer requests help from HIPIKAT. While a landmark system, HIPIKAT is hard-wired into
a particular set of development tools (ECLIPSE) and the scalability of the tool
has not be demonstrated by the authors.
\item
Hill et al.'s DORA system~\cite{1321637}
stores JAVA programs using a combination of
``bag of words'' as well as neighborhood hood information. Searching in DORA
is two-fold process where:
\bi
\item
Information retrieval on the bag of words
finds candidate connections
\item
Topology queries on  the candidates' neighbors returns 
the strongly related terms.
\ei
The drawback with DORA is that it assumes non-heterogeneous corpus
(everything in DORA is a JAVA class) and it's search algorithms
will not scale to very large examples (since they are slower
than linear time and memory).
\ei
\newpage 
\section{A Session With HAMLET}
(The following screens were generated from a slightly different version of HAMLET than those shown above.)

\subsection{Logging in}
HAMLET gives each user a unique profile.  By storing decisions made
by the user regarding what kind of things they like/dislike into
these profiles, HAMLET utilizes the latest collaborative filtering
techniques to bring documents similar to those rated as liked to
the top of query results, while filtering out the documents disliked.

\begin{figure}[!h]
\begin{center}
\includegraphics{screens/hello.jpg}
\end{center}
\caption{Logging in.
}\label{fig:logging}
\end{figure}

\newpage
\subsection{The Interface}
HAMLET contains several components other than the UI shown below.  The
pre-processor is a tool used by HAMLET to generate datasets which
can then be loaded into the UI allowing the user to query,
rank, and visualize the documents found in the loaded dataset.
The majority of all machine learning takes place within the
pre-processor.  This is where tasks like term frequency / document
frequency generation, term selection, clustering, and learner
training occurs.  After being run through the pre-processor, each
document within a collection is assigned a vector representation
which describes what terms are present in the document and at what
frequency.
 

\begin{figure}[!h]
\begin{center}
\includegraphics[width=5in]{screens/blank.jpg}
\end{center}
\caption{A HAMLET screen.
}\label{fig:scream}
\end{figure}


\newpage
\subsection{HAMLET output}

Pictured below is HAMLET's output after running a given query (the
unit data text box).  The results are shown in the far left box.
The far right boxes (what else and what not) describe the deltas
between your query and the currently selected item in the list on
the left side.  {\em What else} can be described as ``what do I need to
add to my thing to make it more like the selected thing'', while
{\em what not} is more like ``what do I need to remove from my thing to
make it more like the selected thing''.  The sample dataset being
run here was built from the open source machine learning library,
WEKA, which is written in JAVA.


\begin{figure}[!h]
\begin{center}
\includegraphics[width=5in]{screens/whatelsewhatnot.jpg}
\end{center}
\caption{HAMLET output.
}\label{fig:out}
\end{figure}


\newpage
\subsection{A Web of Connections}
HAMLET is a framework that supports the both the {\em semantic} and
the {\em syntactic} structure inherent in data.  Displayed below
is an example of the former, a web of hyper-links connecting relevant
document to each other (generated from JAVA source code).  When
this kind of information is combined with syntactic information
(the actual text of a document, e.g. term counts) a powerful
information retrieval system can be created that supports the ability
to {\em walk} through the data.  In our application, by clicking
on a document in the graph, you are shown all of the documents
connected to the selected document.  By hopping from document to
document and tweaking visualizations parameters along the way, it
is possible to truly walk through the dataset.

\begin{figure}[!h]
\begin{center}
\includegraphics[width=5in]{screens/web.jpg}
\end{center}
\caption{Displaying connections.
}\label{fig:wb}
\end{figure}

\newpage
\subsection{3-D Visualization}
In addition to visualizing 2-D semantic information, HAMLET also
supports the ability to visualize each document vector as a point
in 3-D space.  To facilitate this HAMLET utilizes dimensionality
reduction in two stages.  In the first stage, the list of all
possible terms in a given collection is analyzed to determine the
most relevant terms (this reduces the dimensionality from around
20,000 to 100).  In the second stage, the 100 dimension document
vectors are run through a fast (nearly linear) dimensionality
reduction algorithm called FastMap which finds the intrinsic geometry
in the 100 dimensional space and projects that into a 3-D space
capable of visualization.

\begin{figure}[!h]
\begin{center}
\includegraphics[width=5in]{screens/3dtwo.jpg}
\end{center}
\caption{ 3-D visualization.
}\label{fig:vis3d}
\end{figure}


\newpage
\section{How Does HAMLET Work? (the details)}
This section offers a technical description of the internals of HAMLET. 
In that description,
the term ``document'' will be used since that is consistent with the information
retrieval literature. Note that HAMLET's ``document'' may be much smaller than,
say, a Microsoft Word .doc file. 
A HAMLET ``document'' is just the smallest unit of comprehension for a particular
type of entry in  an archive.
Indeed, depending on the parser being used,
a ``document'' may be:
\bi
\item An EXPRESS data type
\item A JAVA method
\item A paragraph in an English document
\item Some XML snippet.
\ei

The methods described in this section are in a state of flux. HAMLET is a prototype system
and we are constantly changing the internal algorithms. In particular:
\bi
\item All slower-than-linear algorithms are being replaced with linear algorithms.
\item Our preliminary experiments suggest that many common methods may in fact be
superfluous for comprehension of technical documents. For example: (a)~we may soon be dropping
the stopping and stemming methods described below; (b)~the value of discretization
during InfoGain processing is not clear at this time. 
\item
Any threshold value
described in this 
section (e.g. using the top $k=100$ Tf*IDF terms) will most probably change in the very near
future as we tune HAMLET to problem of archival storage.
\ei
\subsection{Dimensionality Reduction}
A major issue within  HAMLET is {\em dimensionality reduction}.
Standard AI learning methods work well for problems that are nearly all fully described
using dozens (or fewer) attributes~\cite{witten99}.
But a corpus of archival technical documents
must process thousands of unique words,
and any particular document may only mention a few of them~\cite{byrn99,salton83introduction}.
Therefore, before we can apply learning to technical document comprehension, we have to reduce the number
of dimensions (i.e. attributes) in the problem.

There are several standard methods for dimensionality reduction 
such as {\em tokenization, stop lists, stemming, Tf*IDF}, {\em InfoGain},
PCA, and FastMap.
All these methods are discussed below.
\subsubsection{Tokenization}
In HAMLET's parser,
words are reduced to simple tokens via (e.g.) removing all punctuation remarks, then sending all upper case to lower.


\subsubsection{Stop lists}
Another way to reduce dimensionality is to remove ``dull'' words via a 
{\em stop list}
of ``dull'' words. \fig{stop} shows a sample of the stop list used in HAMLET.
\fig{stop} shows code for a stop-list function.
\begin{figure}
\begin{alltt}\scriptsize
a           about    across   again      against
almost      alone    along    already    also
although    always   am       among      amongst
amongst     amount   an       and        another
any         anyhow   anyone   anything   anyway
anywhere    are      around   as         at
...         ...      ...      ...        ...  
\end{alltt}
\caption[Stop words]{24 of the 262 stop words used in this study.}\label{fig:stop}
\end{figure}

\subsubsection{Stemming}
Terms with a common stem will
usually have similar meanings. For example, all these words relate to the same concept.
\bi
\item
        CONNECT
    \item    CONNECTED
     \item   CONNECTING
     \item   CONNECTION
     \item   CONNECTIONS
\ei
Porter's stemming algorithm~\cite{porter80}
 is the standard stemming tool. It repeatedly replies a set of pruning rules to the end of words until
the surviving words are unchanged. The pruning rules ignore the semantics of a word and just perform syntactic pruning (e.g. \fig{stem1}).
\begin{figure} \scriptsize \begin{verbatim}
RULE                EXAMPLE
----------------    -----------------------------
ATIONAL ->  ATE     relational     ->  relate
TIONAL  ->  TION    conditional    ->  condition
                    rational       ->  rational
ENCI    ->  ENCE    valenci        ->  valence
ANCI    ->  ANCE    hesitanci      ->  hesitance
IZER    ->  IZE     digitizer      ->  digitize
ABLI    ->  ABLE    conformabli    ->  conformable
ALLI    ->  AL      radicalli      ->  radical
ENTLI   ->  ENT     differentli    ->  different
ELI     ->  E       vileli         ->  vile
OUSLI   ->  OUS     analogousli    ->  analogous
IZATION ->  IZE     vietnamization ->  vietnamize
ATION   ->  ATE     predication    ->  predicate
ATOR    ->  ATE     operator       ->  operate
ALISM   ->  AL      feudalism      ->  feudal
IVENESS ->  IVE     decisiveness   ->  decisive
FULNESS ->  FUL     hopefulness    ->  hopeful
OUSNESS ->  OUS     callousness    ->  callous
ALITI   ->  AL      formaliti      ->  formal
IVITI   ->  IVE     sensitiviti    ->  sensitive
BILITI  ->  BLE     sensibiliti    ->  sensible
\end{verbatim} \caption{Some stemming rules.}\label{fig:stem1}
\end{figure}

Porter's stemming algorithm has been coded in any 
number of languages\footnote{\url{http://www.tartarus.org/martin/PorterStemmer}} such as the Perl {\em stemming.pl}
used in this study.
\subsubsection{Tf*IDF}
Tf*Idf is shorthand for ``term frequency times inverse document
frequency''. This calculation models the intuition that jargon
usually contains technical words that appear alot, but only in a
small number of paragraphs.  For example, in a document describing
a space craft, the terminology relating to the power supply may be
appear frequently in the sections relating to power, but nowhere
else in the document.

Calculating Tf*Idf is a relatively simple matter:
\bi
\item
Let there be $Words$ number of documents;
\item
Let some word $I$ appear $Word[I]$ number of times inside a set of $Documents$; 
\item
Let $Document[I]$ be the documents containing $I$.
\ei
Then:
\[
Tf*Id = Word[i]/Words*log(Documents/Document[i])
\]

The standard way to use this measure  is to cull all but the $k$ top Tf*Idf ranked
stopped, stemmed tokens. This study used $k=100$.
   
\subsubsection{InfoGain}

According to the $InfoGain$ measure,
the {\em best} words  are those that  
{\em most simplifies} the target
concept (in our case, the distribution of frequencies seen in the terms).
Concept ``simplicity'' is measured using information theory.
Suppose a data set has 80\% severity=5 issues and  20\% severity=1
issues.
Then that data set has a term distribution $C_0$ 
with terms $c(1)=cat$,  $c(2)=dog$ etc
with frequencies (say) $n(1)=0.8$, $n(2)=0.2$ etc
then
number of bits required to encode that 
distribution $C_0$ is $H(C_0)$ defined as follows:
\begin{equation}\label{eq:ent}{
\left.\begin{array}{rcl}
N&=&\sum_{c{\in}C}n(c)\\
p(c)& =& n(c)/N \\
H(C)&=& -{\sum}_{c{\in}C} p(c)log_2p(c)
\end{array}\right\}
}\end{equation}

After discretizing numeric 
data\footnote{E.g. given an attribute's
minimum and maximum values, replace a particular value $n$ with $(n - min)/((max - min)/10)$.
For more on discretization, see~\cite{dou95}.}
then if $A$ is a set of attributes, the number of bits required to encode
a class after observing an attribute is:
\[H(C|A) = -{\sum}_{a{\in}A} p(a) {\sum}_{c{\in}C} p(c|a)log_2(p(c|a)\]
The highest ranked
attribute $A_i$ is the one with the largest {\em information gain};
i.e the one that most
reduces the encoding required for the data {\em after}
using that attribute; i.e.
\begin{equation}\label{eq:infogain}
InfoGain(A_i) = H(C) - H(C|A_i)
\end{equation}
where $H(C)$ comes from \eq{ent}.
In this study, we will use InfoGain to find  the top $N=10$ most informative tokens. 
\subsubsection{PCA and FastMap}
Numerous data mining methods check if the available features 
can be combined in useful ways. These methods offer two useful services:
\be
\item
Latent important structures
within a data set can be discovered.
\item
A large set of features can be mapped to a smaller set, then it becomes
possible for users to manually browse complex data.
\ee
For example,
principal
components analysis (PCA)~\cite{Dil84} has been widely applied to
resolve problems with structural code measurements;
e.g.~\cite{Muns91}. PCA identifies the distinct orthogonal sources
of variation in a data sets, while mapping the raw features onto a set of
uncorrelated features that represent essentially the same
information contained in the original data. For example,
the data shown in two dimensions of \fig{pca} (left-hand-side)
could be approximated in a single latent feature
(right-hand-side).


\begin{figure}[!b]
\begin{center}
\includegraphics[angle=270,width=2in]{pca.pdf}
\end{center}
\caption[PCA: example]{The two features in the left plot can be transferred to the right plot via one latent feature.
}\label{fig:pca}
\end{figure}
Since  PCA combines many features into fewer latent features,
the structure of PCA-based models
may be very simple. For example, previously~\cite{me07c}, we have used PCA
and a decision tree learner to find the  following predictor for defective software modules:~\\

\newcommand{\setuptabs}{
\hspace*{.3in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.1in}
\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}
\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\kill
}

\begin{minipage}{\linewidth}
{\footnotesize 
\begin{tabbing}
\setuptabs
\>if \>$domain_1 \le 0.180$\\
\>then NoDefects\\
\>else \>if $domain_1  >  0.180$\\
\> \>  then \>\> if $domain_1  \le 0.371$ then NoDefects\\
\> \>  else \>\> if $domain_1  >  0.371$ then Defects\\
\end{tabbing}}
\end{minipage}


Here, ``$domain_1$'' is one of the latent features found by PCA. This tree
seems very simple, yet is very hard to explain to business
clients users since ``$domain_1$'' is calculated using a very complex
weighted sum (in this sum,
$v(g), ev(g), iv(g)$ are 
McCabe or Halstead static code metrics~\cite{mccabe76,halstead77}
or variants on line counts): 
\begin{equation}\label{eq:latent}
{\scriptsize
\begin{array}{r@{}c@{~}l}
domain_1&=& 0.241*loc  +0.236*v(g) \\
      &&+0.222*ev(g)+0.236*iv(g)  +0.241*n  \\
      &&+0.238*v -0.086*l +0.199*d   \\
      && +0.216*i  +0.225*e  +0.236*b +0.221*t\\
     && +0.241*lOCode  +0.179*lOComment\\
     && +0.221*lOBlank +0.158*lOCodeAndComment \\
     && +0.163*uniq_Op +0.234*uniq_Opnd\\
     && +0.241*total_Op  +0.241*total_Opnd \\
     && +0.236*branchCount\\
\end{array}}
\end{equation}
Nevertheless, such  latent dimensions can be used to generate visualizations that show users
spatial distances between concepts in technical documents.
For example, \fig{3d} shows a 100-D space of prior designs
converted to a 3-D representation. In the conversion process,
the three top-most domains were computed and the 100-D space mapped to the 3-D space.
\begin{figure}
\begin{center}
\includegraphics[width=4in]{screens/3a.png}
\end{center}
\caption[Displaying a 100-D space of documents]{A 100-D space presented in 3-D.
The point in red shows the current design and the four points in green
show the nearest 4 technical documents (and the gray points show other documents
that have been ``trimmed'' away using the techniques discussed below).}\label{fig:3d}
\end{figure}

PCA is the traditional method of performing dimensionality reduction. It suffers from scale-up problems
(for large data sets with many terms, the calculation of the correlation matrix between all terms is
prohibitively computationally expensive). FastMap is a heuristic stochastic algorithm that performs
the same task as PCA, but do so in far less time and memory~\cite{faloutsos95fastmap}. Our own
experiments with the two methods showed that both yield similar structures.

\subsection{Trimming}
Trimming is a simple heuristic to prune 
remote points. It is a fast and simple method for focusing design reviews
on near-by concepts.

Trimming runs like this:
\bi
\item The user specifies some max distance measure $N$.
\item The $N$  nearest documents to the user's partially specified
design are accessed. These documents are sorted $1,2,3,4...N$ according to
their distance to the current design.
\item The delta between document $i$ and $i+1$ in the sort order
is computed and the document $i$ with the maximum delta (most difference between it
and $i+1$)
is declared ``the trim point''.
\item All documents $i+1,i+2,...N$  are trimmed away.
\ei
For example, \fig{trim} show the number of related documents before
and after trimming to a maximum depth of $N=25$. 
\begin{figure}
\begin{center}
\noindent
\includegraphics[width=5in]{screens/13.jpg}

~\\

\noindent
\includegraphics[width=5in]{screens/12.jpg}
\end{center}
\caption[Trimming]{Trimming. 
Trimming is a simple heuristic to prune 
remote points. It is a fast and simple method for focusing design reviews
on near-by concepts.
The top picture, left-hand-side, shows
in green the 25 documents closest to some partial design developed by the user. 
The contents of the third closest document, highlighted in blue, is shown
in the center screen.
The bottom picture shows the same set of nearby documents, with trimming enabled
(see the check box shown in red).
Note that now only four documents are displayed to the user.
}\label{fig:trim}
\end{figure}
\subsection{Canopy Clustering}
HAMLET adopts a geometric view of technical documents floating in an N-dimensional space (and, perhaps, those dimensions
are reduced by the above methods). The documents are clustered and new designs can be critiqued with respect to
their position relative to old designs.

A naive clustering algorithm runs in $O(N^2)$ where $N$ is the number of terms being clustered
and all terms are assessed with respect to all other terms. For large
archival collections, this is too slow. Various improvements over this naive strategy include ball trees,
KD-trees and cover trees~\cite{bey06}.  While all these methods are useful,
their ability to scale to very large examples is an open question.

An alternative to traditional clustering methods is {\em canopy clustering}~\cite{775116,347123}.
It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical because of the size of the data set.
The algorithm proceeds as follows:
\bi
\item
     Cheaply partition the data into overlapping subsets, called 'canopies'
    \item Perform more expensive clustering, but only within these canopies
\ei
In the case of text mining applications like HAMLET, the initial cheap clustering method
can be performed
using an inverted index; i.e.
a sparse matrix representation in which, for each word, 
we can directly access the list of documents containing that 
word. The 
great majority of the documents, which have no words in 
common with the partial design constructed by the engineering, need never be considered. Thus we 
can use an inverted index to efficiently calculate a distance 
metric that is based on (say) the number of words two documents 
have in common. 

\subsection{Cluster Classification}
A key task with HAMLET is recognizing which cluster is nearest the partial
design offered by HAMLET's user.
Once identified, then we can easily compute the delta between the current design and the centroid
of that cluster. This delta comprises two lists:
\bi
\item What to add (the ``what else'' list);
\item What to remove (the ``what not'' list).
\ei
A key requirement for our classifier is that it handles low frequency counts: we anticipate that archival
data sets will contain many terms, but only very few of them will appear in any particular document or
the user's partial design.

Rennie et al.~\cite{Rennie03tacklingthe}
report a variant of a Naive Bayes classifier called TWCNB that uses noramlized Tf*IDF counts with the following
properties:
\bi
\item  It handles low frequency data sets;
\item It performs almost as well as more complex support vector machine implementations; 
\item Better yet, it is very simple to implement and
runs in linear time (which makes it suitable for scaling up to a very large corpus).
\ei
Currently, we are experimenting with TWCNB within HAMLET.
\subsection{User Profiling with the Rocchio Algorithm}
All information retrieval systems, including HAMLET, suffer from false alarms; i.e.
returning query results that the user does not find relevant. 

The Rocchio algorithm
is a standard method for
pruning the results of an informational retrieval
search with user feedback~\cite{hayes06,hayes07}.
The 
algorithm reports the delta between
the positive document vectors (that related to membership of the positive examples
approved by the user)
and the negative ones (that relate to membership of the negative examples
disapproved by the user).
The
terms in this sum are weighted by tuning parameters that are determined via experimentation
but Joachims recommends weighting the positive information four times higher than
the negative~\cite{joachims97}.

We are currently exploring methods to augment Rocchio with TWCNB. 

\newpage 
\section{Next Steps}
\begin{raggedleft}
{\em Plan to throw one away; you will anyhow.

- Frederick Brooks,
 "The Mythical Man-month" (1975)}

\end{raggedleft}

~\\

In its current form, HAMLET is a tool suitable for in-house use in a research lab.
Much work is required before it can be released as a stand-alone tool for end-users.

\bi
\item

The current system was designed as a throw-away prototype to serve
as an experimental workbench for numerous implementation ideas.
For example, HAMLET is currently implemented in JAVA since this was  a language familiar
to the student programmers working on this project. The limitations of JAVA,
in terms of runtime, are becoming apparent. We wish to avoid a port to ``C'' 
and are looking into better internal JAVA-based data structures.
\item
As to other matters,
the scalability of HAMLET's queries has yet to be demonstrated.
Theoretically, the algorithms within HAMLET's current design run in  linear
time, or less. However, this scalability must be  empirically verified.
\item
Also, the user profile system that takes input from the users, then
dynamically tunes the results of each query, is still being designed.
Until that feature is designed, implemented, and tuned, then users of HAMLET
will suffer from too many false positives.
This user profile feature must be built.
\item
Finally, HAMLET's ability to link between documents in 
heterogeneous collections has yet to be tested.
This test must be conducted.
\ei
\newpage 
\appendix 
\section{Model for Early Life cycle
Decisions} One reason that STEP/EXPRESS is not used more is that,
from a cognitive perspective, it does not support the entire design
cycle.  Rather, it only supports the last stages of design and not
all the interim steps along the way.

\subsection{Early Life cycle Models are Different}

STEP/EXPRESS is a {\em heavy-weight} ontology designed to unambiguously represent
categorical design decisions. There is much evidence that
before human designers make 
categorical decisions, they often prefer to ``play around a little''.
Informal ultra-lightweight sketches
are a common technique for illustrating
and sharing intuitions. For example, in
early life cycle requirements engineering,
software engineers often pepper their discussions with
numerous hastily-drawn
sketches.
Kremer argues convincingly that such visual forms
have numerous advantages for acquiring knowledge
\cite{kremer98}.
Other researchers take a similar view: ``Pictures are superior to texts in a sense that they are abstract,
     instantly comprehensible, and universal.''\cite{hirakawa94}.

Normally, hastily-drawn sketches
are viewed as precursors to some other, more precise
modeling techniques which necessitate further analysis.  Such
further
formalization may not always be practical.
The time and expertise required to build formal models
does not accommodate the pace with which humans tend to revise
their ideas.

Also, such further
formalization may not be advisable.
Goel~\cite{goel92} studied the effects of
diagram formalism on system
design.  In an {\em ill-structured diagram} (such as a freehand sketch using
pencil and paper), the denotation and type of each visual element is
ambiguous.  In a {\em well-structured diagram} (such as those created using MacDraw
or STEP/EXPRESS),
each visual element clearly denotes one thing of one class only.  Goel
found that ill-structured tools generated more design
variants --- more drawings, more ideas, more re-use of old ideas --- than
well-structured
tools. 
\subsection{Beyond STEP- Early Life cycle Models}
The premise of the following early life cycle ultra-lightweight modeling
frameworks is that (a)~this preference criteria is important;
and that (b)~it is hence necessary to include
preference constructs
in the modeling language:
\bi
\item
Mylopoulos' {\em soft-goal} graphs~\cite{mylo92,mylo99}.
represent
knowledge about non-functional requirements.
Primitives in soft goal modeling include statements of partial
influence such as {\em helps} and {\em hurts}.
\item  A common framework in the design rationale
community is a  ``questions-options-criteria'' (QOC)
graph~\cite{shum94}. In  QOC
graphs, questions suggest options and deciding on a certain option can
raise other  questions.   Options   shown  in  a  box denote  selected
options.  Options are assessed  by  criteria and criteria  are gradual
knowledge;  i.e. they {\em  tend to support}  or {\em  tend to reject}
options.  QOCs   can succinctly  summarize   lengthy debates; e.g. 480
sentences  uttered in  a  debate between   two  analysts on  interface
options  can be    displayed    in  a   QOC   graph  on    a    single
page~\cite{maclean96}.
Saaty's Analytic Hierarchy Process (AHP)~\cite{saaty80} is a variant of QOC.
\item Another widely used framework is Quality
Functional Deployment diagrams (QFD)~\cite{akao90}.
Yet another framework is the DDP system developed by Cornford and Feather~\cite{feather03}
Where as AHP and QOC  propagate influences over hierarchies,  QDF (and DDP) propagate  influences over matrices.
\ei

\subsection{Why Record Early Life cycle Models?}
At first glance,
the imprecision of the above modeling frameworks appears to make them inferior
to precise and unambiguous STEP models. However, it is this very imprecision
that makes then valuable. 

Firstly, one reason for preferring the above ultra-lightweight ontologies over
STEP/EXPRESS is that, at least for group design activities:
\bi
\item During fast-paced design discussions,
only brief statements can be collected. Otherwise, the dialogue
within a meeting would be hampered by the time required
to record, in tedious detail, the exchange of ideas.
\item If the ontology gets more elaborate, then experts
may be unable to understand technical arguments from outside their
own field of expertise. 
\ei

Secondly, experience at 
NASA\footnote{This section 
is based on numerous
interviews with NASA designers working on 
the reuse of old Apollo designs for NASA's next-generation 
Mars rocket.
} suggests that a vital missing component of any one design are all the
other designs rejected before arriving at the final system.
Researchers in the design rationale community~\cite{moran96} advise
that these rejected dead-end designs are useful to determine the preference
criteria used in selecting the appropriate design. 

\subsection{Summary}
Modeling languages that capture design option\underline{s} are different than modeling languages
that record the final detailed design. One reason that STEP/EXPRESS is not used more is, we conjecture,
that it is a ``final design capture'' tool and not an ``initial design options exploration'' tool.
While ``final design capture'' tools are useful for transferring current designs between different CAD/CAM
tools, such tools may not be preferred for reviewing and understanding designs decades after they are 
created.

Nevertheless, the above early life cycle modeling tools may be not indicated for the task of comprehending
archival technical documents.
The augmentation of large archival storage with detailed design alternative and preference knowledge
(i.e. like that captured in DDP, AHP,  QFD, or soft goals) requires a detailed
analysis of existing artifacts\footnote{e.g. see the {\em Software Archaeology} example
offered by Grady Booch~\cite{booch04}.}.
Hence it may not be indicated for archival storage where the corpus size
is very large and there is  a lack of access to the relevant domain experts.

Recent advances in program comprehension and text mining indicate that it may be
possible to infer
such preference knowledge from multiple instances of finished designs.
Hence, this work.
\newpage
\bibliographystyle{abbrv}
\addcontentsline{toc}{section}{References}
\bibliography{refs,refstimm}

\end{document}