\documentclass[10pt]{article} \usepackage{rotating} \usepackage{lineno,times,graphicx} \usepackage{url} \usepackage{alltt} \title{\vspace{1cm}\includegraphics[width=2in]{HAMLETSkullHCSealous.jpg}\\\vspace{1cm} The HAMLET Project: Progress Report\vspace{1cm}} \author{Tim Menzies\\LCSEE, WVU\\\url{tim@menzies.us}} \date{} \newcommand{\be}{\begin{enumerate}} \newcommand{\ee}{\end{enumerate}} \newcommand{\bi}{\begin{itemize}} \newcommand{\ei}{\end{itemize}} \newcommand{\eq}[1]{Equation~\ref{eq:#1}} \newcommand{\fig}[1]{Figure~\ref{fig:#1}} \usepackage{lastpage,fancyheadings} \footrulewidth 0.0pt\headrulewidth 0.1pt \lhead{}\rhead{}\chead{}\rfoot{}\lfoot{}\cfoot{} \rhead{ {\scriptsize \date{\today} }} \rhead{{\scriptsize \date{\today} }} \lhead{\includegraphics[width=6cm]{wvulogo.jpg}} \usepackage{eso-pic} \usepackage{color} \usepackage{type1cm} \makeatletter \AddToShipoutPicture{% \setlength{\@tempdimb}{.5\paperwidth}% \setlength{\@tempdimc}{.5\paperheight}% \setlength{\unitlength}{1pt}% \makebox(600,900){\rotatebox{45}{\textcolor[gray]{0.9}{\fontsize{3cm}{3cm}\selectfont{Not for release}}}}} \makeatother \begin{document} \linenumbers \maketitle \thispagestyle{fancy} \pagestyle{fancy} \vspace{1cm}{\bf About this document:} This report is the first in a series of three documents: \be \item This report describes progress on the HAMLET ``anything browser''. It is an interim internal document between WVU and the National Archives and is not intended for wider distribution. Report 3, listed below, will update this document and will be suitable for broad release. \item Report 2, due August 30, will describe the difference between the current version of HAMLET and our proposed new work. \item Report 3, due September 30, will offer detailed technical information on HAMLET. \ee \newpage \rhead{\thepage\ of \pageref{LastPage}} \lhead{Menzies: HAMLET - progress report} \tableofcontents \newpage \listoffigures \newpage \section{Summary} HAMLET is an ``anything browser'' that aims to offer advice on what engineering parts are relevant to a partially complete current design. HAMLET makes minimal assumptions about the nature of the engineering documents being explored and is designed to be ``glue'' that permits the searching of technical data in large heterogeneous collections. Using technology from AI, information retrieval, program comprehension, and text mining, HAMLET allows designers to dig up prior designs, study those designs, then apply insights from those historical designs to new tasks. \begin{figure}[!b] \begin{center} \includegraphics[width=5in]{archeology.jpg} \end{center} \caption{Digging up old designs. From \protect\cite{booch04}.} \end{figure} \newpage \section{Why HAMLET? (motivation)} A recent NSF-funded workshop\footnote{ Collaborative Expedition Workshop \#74, June 10, 2008, at NSF. ``Overcoming I/O Bottlenecks in Full Data Path Processing: Intelligent, Scalable Data Management from Data Ingest to Computation Enabling Access and Discovery''. \url{http://colab.cim3.net/cgi-bin/wiki.pl?ExpeditionWorkshop/TowardScalableDataManagement_2008_06_10} } highlighted current directions in long term technical document retention. While much progress was reported on: \bi \item systems issues of handling and sharing very large data collection (e.g. SLASH) \item scalable methods of building customization views (e.g. iRODS), \ei there was little mention of the cognitive issues of how users might browse and synthesize data from massive data collections of technical documents. For example, here at WVU, we are mid-way through a review of the use of STEP/EXPRESS for long term technical document retention\footnote{See reports from Mucino.}. STEP/EXPRESS is commonly used as an inter-lingua to transfer technical data between CAD/CAM packages. Strange to say, while STEP/EXPRESS is useful for transferring and understanding technical documents {\em today}, it does not appear to be suitable for understanding technical documents from {\em yesterday}. In theory, there is nothing stopping STEP/EXPRESS from recording and storing all aspects of a project. In many ways, STEP/EXPRESS is as expressive as other technical document standards (e.g. UML.) STEP/EXPRESS offers a generic method for storing part-of and isa information, constraints, types, and the rules associated with a technical document. However, in practice, the theoretical potential of STEP/EXPRESS is not realized for the following reasons. \subsection{Heterogeneity} The reality of archival systems is that STEP/EXPRESS documents are stored {\em along side} a much larger set of supporting documents in multiple formats. A recent study\footnote{\url{http://www.b-eye-network.com/view/2098}} concluded that \bi \item 80 percent of business is conducted on unstructured information; \item 85 percent of all data stored is held in an unstructured format (e.g. the unstructured text descriptions of issues found in PITS); \item Unstructured data doubles every three months; \ei That is, if we can learn how to understand large heterogeneous collections, that include STEP/EXPRESS knowledge as well as numerous other products in a wide variety of formats, it would be possible to reason and learn from a very wide range of data. \subsection{Incomplete meta-knowledge} Much work has focused on the creation of cliched sets of EXPRESS schemas. 40 such {\em application protocols} (AP) have been defined~\cite{jardim06} including AP-203 (for geometry). and AP-213 (for numerical control). The list of currently defined application protocols is very extensive (see \fig{aps}). These APs are the cornerstone of STEP tools: the tools offer specialized support and screens import/export facilities for the APs. For most of these APs, while much effort went into their creation, very few have been stress-tested in the field within information systems. That is, the majority of these APs have been {\em written} more than they have been {\em read} (exceptions: the above-mentioned AP-203 and AP-213 are frequently used and reused in $21^{st}$ century CAD/CAM manufacturing processes), \subsection{Incomplete tool support} Perhaps because of the relative immaturity of the APs, current CAD/CAM tools offer limited support for the STEP APs. While most tools support geometry (AP-203), the support for the other APs in \fig{aps} is minimal (to say the least). \subsection{ Incomplete design rationale support} From a cognitive perspective, STEP/EXPRESS does not support the entire design cycle. Rather, it only supports the last stages of design and not all the interim steps along the way. For more on this issue, see the appendix. \subsection{Limited Historical Use} For all the above reasons, highly structured technical documents in formats like STEP/EXPRESS are in the minority in the archival systems we have examined. We are aware of large STEP/EXPRESS repositories but these are often inaccessible for a variety of reasons. While this situation might change in the future (e.g. if all the above issues were suddenly fixed and all organizations switch to using highly structured technical documentation), the historical record would still be starved for large numbers of examples. \begin{figure} \begin{center} \begin{tabular}{ll} AP & area\\\hline 201 & Explicit Drafting\\ 202 & Associative Drafting\\ 203 & Configuration Controlled Design\\ 204 & Mechanical Design Using Boundary Representation\\ 205 & Mechanical Design Using Surface Representation\\ 206 & Mechanical Design Using Wireframe Representation\\ 207 & Sheet Metal Dies and Blocks\\ 208 & Life Cycle Product Change Process\\ 209 & Design Through Analysis of Composite and Metallic Structures\\ 210 & Electronic Printed Circuit Assembly, Design and Manufacturing\\ 211 & Electronics Test Diagnostics and Remanufacture\\ 212 & Electrotechnical Plants\\ 213 & Numerical Control Process Plans for Machined Parts\\ 214 & Core Data for Automotive Mechanical Design Processes\\ 215 & Ship Arrangement\\ 216 & Ship Molded Forms\\ 217 & Ship Piping\\ 218 & Ship Structures\\ 219 & Dimensional Inspection Process Planning for CMMs\\ 220 & Printed Circuit Assembly Manufacturing Planning\\ 221 & Functional Data and Schematic Representation for Process Plans\\ 222 & Design Engineering to Manufacturing for Composite Structures\\ 223 & Exchange of Design and Manufacturing DPD for Composites\\ 224 & Mechanical Product Definition for Process Planning\\ 225 & Structural Building Elements Using Explicit Shape Rep\\ 226 & Shipbuilding Mechanical Systems\\ 227 & Plant Spatial Configuration\\ 228 & Building Services\\ 229 & Design and Manufacturing Information for Forged Parts\\ 230 & Building Structure frame steelwork\\ 231 & Process Engineering Data\\ 232 & Technical Data Packaging\\ 233 & Systems Engineering Data Representation\\ 234 & Ship Operational logs, records and messages\\ 235 & Materials Information for products\\ 236 & Furniture product and project\\ 237 & Computational Fluid Dynamics\\ 238 & Integrated CNC Machining\\ 239 & Product Life Cycle Support\\ 240 & Process Planning\\ \end{tabular} \end{center} \caption{STEP Application Protocols.}\label{fig:aps} \end{figure} \newpage \section{What is HAMLET? (overview)} HAMLET is an ``anything browser'' that glues together and searches technical data in large heterogeneous collections. HAMLET offers the following services: \bi \item Provides an import facility from a variety of artifacts. The generic parsing facility is designed to be extensible so, in theory, it would be possible to extend the range of artifacts processed by HAMLET. \item Provides standard visualization techniques for those artifacts; e.g. fast graphical browsing of used and used-by links. While state-of-the-art, these standard visualization techniques quickly overload and exhaust the user (for example, see \fig{fig8}). Hence, HAMLET uses AI, text mining, program comprehension techniques, information retrieval techniques to offer a set of filtered views on the data. \ei \begin{figure} \begin{center} \includegraphics[width=5in]{screens/8.jpg} \end{center} \caption[Too many neighboring concepts]{ Too many neighboring concepts: note the overdose of information resulting from a simple visualization of the query ``what terms are used by the code in the left-hand-side screen?''. This figure shows the terms referenced by the code sample on the left-hand-side, to a depth of eight links (i.e. directly references, referenced by something referenced in the left-hand-side code, referenced by something referenced by something referenced in the left-hand-side code and so on to a nested depth of maximum eight links). The depth of nesting for the query is controlled by the slider at bottom right. The code snippet is in JAVA but a similar query on an EXPRESS schema would result in a similar information overload. The lesson of this figure is that special tools are required: we should not show {\em all} the nested references; just the {\em relevant} ones. HAMLET's display of relevant local information is shown in \fig{sample}. }\label{fig:fig8}. \end{figure} HAMLET's concept of operation is as follows: \bi \item A designer identifies a corpus of historical technical documents that are relevant to a new design task. This is a fast process and may be as simple as ``give me all designs that use graphical user interfaces'' or even ``give me all designs created by my company in the last ten years''. \item HAMLET focuses on that corpus. Note that, in the usual case, HAMLET has pre-processed that corpus and so this focus process is not slow. \item The designer starts creating a new design in the same style as that corpus. Somewhere in that creation process, the designer asks for an audit of the current, possibly partially incomplete, design. \item HAMLET responds by offered links to sets of related designs, augmented as follows: \bi \item ``What else'': i.e. a set of designs that share many features of the designers current ideas; \item ``What not''; i.e. a set of concepts {\em not} found in related designs. \ei \fig{sample} offers an example of these two lists. \begin{figure} \begin{center} \includegraphics[width=5in]{screens/5.jpg} \end{center} \caption[A sample HAMLET screen]{A sample HAMLET screen. Top left is some {\em new} technical specification offered by the user. Top-right are all the concepts in {\em old} designs related to the new design. Note that this top-right screen is overly complex to read. Bottom left are the 50 concepts {\em nearest} to the new design. In the 50 nearest concepts, there exists certain concepts not found in the new design. These are shown in the {\em what else} list, bottom right. Also, the new design contains certain concepts not found in the 50 nearest concepts. These are shown in the {\em what not} list (also, bottom right). }\label{fig:sample} \end{figure} Designers are not obliged to always add the ``what else'' results or always avoid the ``what not'' results. However, those two queries will allow a designer to assess their current design with respect to the space of prior designs. \item The designer and HAMLET enter a feedback loop where the designer reviews HAMLET's ``what else'' and ``what not'' list. HAMLET learns the designer's preferences and, subsequently, uses that knowledge to offer results relevant to that user's current design task. \item From the feedback loop, preference knowledge can be learned. Given a community of designers working on related tasks, HAMLET will be able to quickly learn what are relevant prior designs for that community. \ei In short, HAMLET suggests to the designer what ``to do or not to do'' (hence it's name). \subsection{Generic Parsing} Internally, HAMLET makes minimal assumptions about the form of the technical document: \bi \item A document contains slots and slots can be atomic or point to other documents;. \item The network of pointers between documents presents the space of connected designs. \ei A generic parser class implements a standard access protocol for this internal model. By sub-classing that parser, it is possible to quickly process new documents types. Currently, HAMLET's parsers can import: \bi \item STEP/EXPRESS \item XML \item Text documents structured as follows: sub-headings within headings, paragraphs within sub-headings, sentences within paragraphs, words in sentences; \item JAVA: This JAVA import allows ready access to very large corpuses of structured technical information (i.e. every open source JAVA program on the web). Hence, in the sequel, we will make extensive use of JAVA examples since that permit tests of scalability. \ei \subsection{The Geometry of Design} HAMLET treats technical document comprehension as a geometric problem: \bi \item Old designs are clustered into groups. \item A new design represent some point around those clusters. \item To compute ``what else'', HAMLET finds the cluster nearest the new design and looks for differences between the new design and the average design in that cluster. \item To compute ``what not'', HAMLET looks for parts of the current design that are not usually found in the nearest cluster. \ei While simple in concept, the challenge of HAMLET is three-fold: \be \item Doing all the above in a scalable manner; i.e. linear or sub-linear time processing. HAMLET handles this is a variety ways including methods borrowed from GOOGLE. \item Doing all the above for a heterogeneous corpus. HAMLET handles multiple formats in the corpus by storing them all documents in a minimalistic internal format (a document contains slots and slots can be atomic or point to other documents). \item While the minimal format permits rapid extension of HAMLET to other document types, it raises the issue of false alarms. Like any information retrieval task, HAMLET returns some false negatives (i.e. incorrect ``what not'' results) and false positives (i.e. incorrect ``what else'' results). HAMLET therefore builds profiles on each user that learn their particular preferences. \ee \subsection{Related Work} HAMLET draws on much of the literature on AI, information retrieval, text mining, and program comprehension. Formally, HAMLET is a {\em suggestion system} that augments standard queries with (a)~suggestions that near to the current partially incomplete design and (b)~suggestions of additions to the current design that would make it very unusual with respect to the space of all prior designs~\cite{pu06}. Having said that, comprehension of archival technical documentation has some features that make HAMLET's task different to other systems: \bi \item Zhai et al.~\cite{Zhai04across-collection} discuss cross-collection mixture models that seek to discover latent common themes from large document collections. This work assumes that all documents are expressed in a simple ``bag of words'' model (i.e. no links between documents). In HAMLET, on the other hand, documents are stored in a connected graph showing neighborhoods of related terms. \item The HIPIKAT system of \v{C}ubrani\'{c} and Murpy~\cite{murphy03} explores comprehension of heterogeneous technical products. Software development projects produce a large number of artifacts, including source code, documentation, bug reports, e-mail, newsgroup articles, and version information. The information in these artifacts can be helpful to a software developer trying to perform a task, such as adding a new feature to a system. Unfortunately, it is often difficult for a software developer to locate the right information amongst the huge amount of data stored. HIPIKAT recommends relevant software development artifacts based on the context in which a developer requests help from HIPIKAT. While a landmark system, HIPIKAT is hard-wired into a particular set of development tools (ECLIPSE) and the scalability of the tool has not be demonstrated by the authors. \item Hill et al.'s DORA system~\cite{1321637} stores JAVA programs using a combination of ``bag of words'' as well as neighborhood hood information. Searching in DORA is two-fold process where: \bi \item Information retrieval on the bag of words finds candidate connections \item Topology queries on the candidates' neighbors returns the strongly related terms. \ei The drawback with DORA is that it assumes non-heterogeneous corpus (everything in DORA is a JAVA class) and it's search algorithms will not scale to very large examples (since they are slower than linear time and memory). \ei \newpage \section{A Session With HAMLET} (The following screens were generated from a slightly different version of HAMLET than those shown above.) \subsection{Logging in} HAMLET gives each user a unique profile. By storing decisions made by the user regarding what kind of things they like/dislike into these profiles, HAMLET utilizes the latest collaborative filtering techniques to bring documents similar to those rated as liked to the top of query results, while filtering out the documents disliked. \begin{figure}[!h] \begin{center} \includegraphics{screens/hello.jpg} \end{center} \caption{Logging in. }\label{fig:logging} \end{figure} \newpage \subsection{The Interface} HAMLET contains several components other than the UI shown below. The pre-processor is a tool used by HAMLET to generate datasets which can then be loaded into the UI allowing the user to query, rank, and visualize the documents found in the loaded dataset. The majority of all machine learning takes place within the pre-processor. This is where tasks like term frequency / document frequency generation, term selection, clustering, and learner training occurs. After being run through the pre-processor, each document within a collection is assigned a vector representation which describes what terms are present in the document and at what frequency. \begin{figure}[!h] \begin{center} \includegraphics[width=5in]{screens/blank.jpg} \end{center} \caption{A HAMLET screen. }\label{fig:scream} \end{figure} \newpage \subsection{HAMLET output} Pictured below is HAMLET's output after running a given query (the unit data text box). The results are shown in the far left box. The far right boxes (what else and what not) describe the deltas between your query and the currently selected item in the list on the left side. {\em What else} can be described as ``what do I need to add to my thing to make it more like the selected thing'', while {\em what not} is more like ``what do I need to remove from my thing to make it more like the selected thing''. The sample dataset being run here was built from the open source machine learning library, WEKA, which is written in JAVA. \begin{figure}[!h] \begin{center} \includegraphics[width=5in]{screens/whatelsewhatnot.jpg} \end{center} \caption{HAMLET output. }\label{fig:out} \end{figure} \newpage \subsection{A Web of Connections} HAMLET is a framework that supports the both the {\em semantic} and the {\em syntactic} structure inherent in data. Displayed below is an example of the former, a web of hyper-links connecting relevant document to each other (generated from JAVA source code). When this kind of information is combined with syntactic information (the actual text of a document, e.g. term counts) a powerful information retrieval system can be created that supports the ability to {\em walk} through the data. In our application, by clicking on a document in the graph, you are shown all of the documents connected to the selected document. By hopping from document to document and tweaking visualizations parameters along the way, it is possible to truly walk through the dataset. \begin{figure}[!h] \begin{center} \includegraphics[width=5in]{screens/web.jpg} \end{center} \caption{Displaying connections. }\label{fig:wb} \end{figure} \newpage \subsection{3-D Visualization} In addition to visualizing 2-D semantic information, HAMLET also supports the ability to visualize each document vector as a point in 3-D space. To facilitate this HAMLET utilizes dimensionality reduction in two stages. In the first stage, the list of all possible terms in a given collection is analyzed to determine the most relevant terms (this reduces the dimensionality from around 20,000 to 100). In the second stage, the 100 dimension document vectors are run through a fast (nearly linear) dimensionality reduction algorithm called FastMap which finds the intrinsic geometry in the 100 dimensional space and projects that into a 3-D space capable of visualization. \begin{figure}[!h] \begin{center} \includegraphics[width=5in]{screens/3dtwo.jpg} \end{center} \caption{ 3-D visualization. }\label{fig:vis3d} \end{figure} \newpage \section{How Does HAMLET Work? (the details)} This section offers a technical description of the internals of HAMLET. In that description, the term ``document'' will be used since that is consistent with the information retrieval literature. Note that HAMLET's ``document'' may be much smaller than, say, a Microsoft Word .doc file. A HAMLET ``document'' is just the smallest unit of comprehension for a particular type of entry in an archive. Indeed, depending on the parser being used, a ``document'' may be: \bi \item An EXPRESS data type \item A JAVA method \item A paragraph in an English document \item Some XML snippet. \ei The methods described in this section are in a state of flux. HAMLET is a prototype system and we are constantly changing the internal algorithms. In particular: \bi \item All slower-than-linear algorithms are being replaced with linear algorithms. \item Our preliminary experiments suggest that many common methods may in fact be superfluous for comprehension of technical documents. For example: (a)~we may soon be dropping the stopping and stemming methods described below; (b)~the value of discretization during InfoGain processing is not clear at this time. \item Any threshold value described in this section (e.g. using the top $k=100$ Tf*IDF terms) will most probably change in the very near future as we tune HAMLET to problem of archival storage. \ei \subsection{Dimensionality Reduction} A major issue within HAMLET is {\em dimensionality reduction}. Standard AI learning methods work well for problems that are nearly all fully described using dozens (or fewer) attributes~\cite{witten99}. But a corpus of archival technical documents must process thousands of unique words, and any particular document may only mention a few of them~\cite{byrn99,salton83introduction}. Therefore, before we can apply learning to technical document comprehension, we have to reduce the number of dimensions (i.e. attributes) in the problem. There are several standard methods for dimensionality reduction such as {\em tokenization, stop lists, stemming, Tf*IDF}, {\em InfoGain}, PCA, and FastMap. All these methods are discussed below. \subsubsection{Tokenization} In HAMLET's parser, words are reduced to simple tokens via (e.g.) removing all punctuation remarks, then sending all upper case to lower. \subsubsection{Stop lists} Another way to reduce dimensionality is to remove ``dull'' words via a {\em stop list} of ``dull'' words. \fig{stop} shows a sample of the stop list used in HAMLET. \fig{stop} shows code for a stop-list function. \begin{figure} \begin{alltt}\scriptsize a about across again against almost alone along already also although always am among amongst amongst amount an and another any anyhow anyone anything anyway anywhere are around as at ... ... ... ... ... \end{alltt} \caption[Stop words]{24 of the 262 stop words used in this study.}\label{fig:stop} \end{figure} \subsubsection{Stemming} Terms with a common stem will usually have similar meanings. For example, all these words relate to the same concept. \bi \item CONNECT \item CONNECTED \item CONNECTING \item CONNECTION \item CONNECTIONS \ei Porter's stemming algorithm~\cite{porter80} is the standard stemming tool. It repeatedly replies a set of pruning rules to the end of words until the surviving words are unchanged. The pruning rules ignore the semantics of a word and just perform syntactic pruning (e.g. \fig{stem1}). \begin{figure} \scriptsize \begin{verbatim} RULE EXAMPLE ---------------- ----------------------------- ATIONAL -> ATE relational -> relate TIONAL -> TION conditional -> condition rational -> rational ENCI -> ENCE valenci -> valence ANCI -> ANCE hesitanci -> hesitance IZER -> IZE digitizer -> digitize ABLI -> ABLE conformabli -> conformable ALLI -> AL radicalli -> radical ENTLI -> ENT differentli -> different ELI -> E vileli -> vile OUSLI -> OUS analogousli -> analogous IZATION -> IZE vietnamization -> vietnamize ATION -> ATE predication -> predicate ATOR -> ATE operator -> operate ALISM -> AL feudalism -> feudal IVENESS -> IVE decisiveness -> decisive FULNESS -> FUL hopefulness -> hopeful OUSNESS -> OUS callousness -> callous ALITI -> AL formaliti -> formal IVITI -> IVE sensitiviti -> sensitive BILITI -> BLE sensibiliti -> sensible \end{verbatim} \caption{Some stemming rules.}\label{fig:stem1} \end{figure} Porter's stemming algorithm has been coded in any number of languages\footnote{\url{http://www.tartarus.org/martin/PorterStemmer}} such as the Perl {\em stemming.pl} used in this study. \subsubsection{Tf*IDF} Tf*Idf is shorthand for ``term frequency times inverse document frequency''. This calculation models the intuition that jargon usually contains technical words that appear alot, but only in a small number of paragraphs. For example, in a document describing a space craft, the terminology relating to the power supply may be appear frequently in the sections relating to power, but nowhere else in the document. Calculating Tf*Idf is a relatively simple matter: \bi \item Let there be $Words$ number of documents; \item Let some word $I$ appear $Word[I]$ number of times inside a set of $Documents$; \item Let $Document[I]$ be the documents containing $I$. \ei Then: \[ Tf*Id = Word[i]/Words*log(Documents/Document[i]) \] The standard way to use this measure is to cull all but the $k$ top Tf*Idf ranked stopped, stemmed tokens. This study used $k=100$. \subsubsection{InfoGain} According to the $InfoGain$ measure, the {\em best} words are those that {\em most simplifies} the target concept (in our case, the distribution of frequencies seen in the terms). Concept ``simplicity'' is measured using information theory. Suppose a data set has 80\% severity=5 issues and 20\% severity=1 issues. Then that data set has a term distribution $C_0$ with terms $c(1)=cat$, $c(2)=dog$ etc with frequencies (say) $n(1)=0.8$, $n(2)=0.2$ etc then number of bits required to encode that distribution $C_0$ is $H(C_0)$ defined as follows: \begin{equation}\label{eq:ent}{ \left.\begin{array}{rcl} N&=&\sum_{c{\in}C}n(c)\\ p(c)& =& n(c)/N \\ H(C)&=& -{\sum}_{c{\in}C} p(c)log_2p(c) \end{array}\right\} }\end{equation} After discretizing numeric data\footnote{E.g. given an attribute's minimum and maximum values, replace a particular value $n$ with $(n - min)/((max - min)/10)$. For more on discretization, see~\cite{dou95}.} then if $A$ is a set of attributes, the number of bits required to encode a class after observing an attribute is: \[H(C|A) = -{\sum}_{a{\in}A} p(a) {\sum}_{c{\in}C} p(c|a)log_2(p(c|a)\] The highest ranked attribute $A_i$ is the one with the largest {\em information gain}; i.e the one that most reduces the encoding required for the data {\em after} using that attribute; i.e. \begin{equation}\label{eq:infogain} InfoGain(A_i) = H(C) - H(C|A_i) \end{equation} where $H(C)$ comes from \eq{ent}. In this study, we will use InfoGain to find the top $N=10$ most informative tokens. \subsubsection{PCA and FastMap} Numerous data mining methods check if the available features can be combined in useful ways. These methods offer two useful services: \be \item Latent important structures within a data set can be discovered. \item A large set of features can be mapped to a smaller set, then it becomes possible for users to manually browse complex data. \ee For example, principal components analysis (PCA)~\cite{Dil84} has been widely applied to resolve problems with structural code measurements; e.g.~\cite{Muns91}. PCA identifies the distinct orthogonal sources of variation in a data sets, while mapping the raw features onto a set of uncorrelated features that represent essentially the same information contained in the original data. For example, the data shown in two dimensions of \fig{pca} (left-hand-side) could be approximated in a single latent feature (right-hand-side). \begin{figure}[!b] \begin{center} \includegraphics[angle=270,width=2in]{pca.pdf} \end{center} \caption[PCA: example]{The two features in the left plot can be transferred to the right plot via one latent feature. }\label{fig:pca} \end{figure} Since PCA combines many features into fewer latent features, the structure of PCA-based models may be very simple. For example, previously~\cite{me07c}, we have used PCA and a decision tree learner to find the following predictor for defective software modules:~\\ \newcommand{\setuptabs}{ \hspace*{.3in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.1in} \hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in} \hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\kill } \begin{minipage}{\linewidth} {\footnotesize \begin{tabbing} \setuptabs \>if \>$domain_1 \le 0.180$\\ \>then NoDefects\\ \>else \>if $domain_1 > 0.180$\\ \> \> then \>\> if $domain_1 \le 0.371$ then NoDefects\\ \> \> else \>\> if $domain_1 > 0.371$ then Defects\\ \end{tabbing}} \end{minipage} Here, ``$domain_1$'' is one of the latent features found by PCA. This tree seems very simple, yet is very hard to explain to business clients users since ``$domain_1$'' is calculated using a very complex weighted sum (in this sum, $v(g), ev(g), iv(g)$ are McCabe or Halstead static code metrics~\cite{mccabe76,halstead77} or variants on line counts): \begin{equation}\label{eq:latent} {\scriptsize \begin{array}{r@{}c@{~}l} domain_1&=& 0.241*loc +0.236*v(g) \\ &&+0.222*ev(g)+0.236*iv(g) +0.241*n \\ &&+0.238*v -0.086*l +0.199*d \\ && +0.216*i +0.225*e +0.236*b +0.221*t\\ && +0.241*lOCode +0.179*lOComment\\ && +0.221*lOBlank +0.158*lOCodeAndComment \\ && +0.163*uniq_Op +0.234*uniq_Opnd\\ && +0.241*total_Op +0.241*total_Opnd \\ && +0.236*branchCount\\ \end{array}} \end{equation} Nevertheless, such latent dimensions can be used to generate visualizations that show users spatial distances between concepts in technical documents. For example, \fig{3d} shows a 100-D space of prior designs converted to a 3-D representation. In the conversion process, the three top-most domains were computed and the 100-D space mapped to the 3-D space. \begin{figure} \begin{center} \includegraphics[width=4in]{screens/3a.png} \end{center} \caption[Displaying a 100-D space of documents]{A 100-D space presented in 3-D. The point in red shows the current design and the four points in green show the nearest 4 technical documents (and the gray points show other documents that have been ``trimmed'' away using the techniques discussed below).}\label{fig:3d} \end{figure} PCA is the traditional method of performing dimensionality reduction. It suffers from scale-up problems (for large data sets with many terms, the calculation of the correlation matrix between all terms is prohibitively computationally expensive). FastMap is a heuristic stochastic algorithm that performs the same task as PCA, but do so in far less time and memory~\cite{faloutsos95fastmap}. Our own experiments with the two methods showed that both yield similar structures. \subsection{Trimming} Trimming is a simple heuristic to prune remote points. It is a fast and simple method for focusing design reviews on near-by concepts. Trimming runs like this: \bi \item The user specifies some max distance measure $N$. \item The $N$ nearest documents to the user's partially specified design are accessed. These documents are sorted $1,2,3,4...N$ according to their distance to the current design. \item The delta between document $i$ and $i+1$ in the sort order is computed and the document $i$ with the maximum delta (most difference between it and $i+1$) is declared ``the trim point''. \item All documents $i+1,i+2,...N$ are trimmed away. \ei For example, \fig{trim} show the number of related documents before and after trimming to a maximum depth of $N=25$. \begin{figure} \begin{center} \noindent \includegraphics[width=5in]{screens/13.jpg} ~\\ \noindent \includegraphics[width=5in]{screens/12.jpg} \end{center} \caption[Trimming]{Trimming. Trimming is a simple heuristic to prune remote points. It is a fast and simple method for focusing design reviews on near-by concepts. The top picture, left-hand-side, shows in green the 25 documents closest to some partial design developed by the user. The contents of the third closest document, highlighted in blue, is shown in the center screen. The bottom picture shows the same set of nearby documents, with trimming enabled (see the check box shown in red). Note that now only four documents are displayed to the user. }\label{fig:trim} \end{figure} \subsection{Canopy Clustering} HAMLET adopts a geometric view of technical documents floating in an N-dimensional space (and, perhaps, those dimensions are reduced by the above methods). The documents are clustered and new designs can be critiqued with respect to their position relative to old designs. A naive clustering algorithm runs in $O(N^2)$ where $N$ is the number of terms being clustered and all terms are assessed with respect to all other terms. For large archival collections, this is too slow. Various improvements over this naive strategy include ball trees, KD-trees and cover trees~\cite{bey06}. While all these methods are useful, their ability to scale to very large examples is an open question. An alternative to traditional clustering methods is {\em canopy clustering}~\cite{775116,347123}. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical because of the size of the data set. The algorithm proceeds as follows: \bi \item Cheaply partition the data into overlapping subsets, called 'canopies' \item Perform more expensive clustering, but only within these canopies \ei In the case of text mining applications like HAMLET, the initial cheap clustering method can be performed using an inverted index; i.e. a sparse matrix representation in which, for each word, we can directly access the list of documents containing that word. The great majority of the documents, which have no words in common with the partial design constructed by the engineering, need never be considered. Thus we can use an inverted index to efficiently calculate a distance metric that is based on (say) the number of words two documents have in common. \subsection{Cluster Classification} A key task with HAMLET is recognizing which cluster is nearest the partial design offered by HAMLET's user. Once identified, then we can easily compute the delta between the current design and the centroid of that cluster. This delta comprises two lists: \bi \item What to add (the ``what else'' list); \item What to remove (the ``what not'' list). \ei A key requirement for our classifier is that it handles low frequency counts: we anticipate that archival data sets will contain many terms, but only very few of them will appear in any particular document or the user's partial design. Rennie et al.~\cite{Rennie03tacklingthe} report a variant of a Naive Bayes classifier called TWCNB that uses noramlized Tf*IDF counts with the following properties: \bi \item It handles low frequency data sets; \item It performs almost as well as more complex support vector machine implementations; \item Better yet, it is very simple to implement and runs in linear time (which makes it suitable for scaling up to a very large corpus). \ei Currently, we are experimenting with TWCNB within HAMLET. \subsection{User Profiling with the Rocchio Algorithm} All information retrieval systems, including HAMLET, suffer from false alarms; i.e. returning query results that the user does not find relevant. The Rocchio algorithm is a standard method for pruning the results of an informational retrieval search with user feedback~\cite{hayes06,hayes07}. The algorithm reports the delta between the positive document vectors (that related to membership of the positive examples approved by the user) and the negative ones (that relate to membership of the negative examples disapproved by the user). The terms in this sum are weighted by tuning parameters that are determined via experimentation but Joachims recommends weighting the positive information four times higher than the negative~\cite{joachims97}. We are currently exploring methods to augment Rocchio with TWCNB. \newpage \section{Next Steps} \begin{raggedleft} {\em Plan to throw one away; you will anyhow. - Frederick Brooks, "The Mythical Man-month" (1975)} \end{raggedleft} ~\\ In its current form, HAMLET is a tool suitable for in-house use in a research lab. Much work is required before it can be released as a stand-alone tool for end-users. \bi \item The current system was designed as a throw-away prototype to serve as an experimental workbench for numerous implementation ideas. For example, HAMLET is currently implemented in JAVA since this was a language familiar to the student programmers working on this project. The limitations of JAVA, in terms of runtime, are becoming apparent. We wish to avoid a port to ``C'' and are looking into better internal JAVA-based data structures. \item As to other matters, the scalability of HAMLET's queries has yet to be demonstrated. Theoretically, the algorithms within HAMLET's current design run in linear time, or less. However, this scalability must be empirically verified. \item Also, the user profile system that takes input from the users, then dynamically tunes the results of each query, is still being designed. Until that feature is designed, implemented, and tuned, then users of HAMLET will suffer from too many false positives. This user profile feature must be built. \item Finally, HAMLET's ability to link between documents in heterogeneous collections has yet to be tested. This test must be conducted. \ei \newpage \appendix \section{Model for Early Life cycle Decisions} One reason that STEP/EXPRESS is not used more is that, from a cognitive perspective, it does not support the entire design cycle. Rather, it only supports the last stages of design and not all the interim steps along the way. \subsection{Early Life cycle Models are Different} STEP/EXPRESS is a {\em heavy-weight} ontology designed to unambiguously represent categorical design decisions. There is much evidence that before human designers make categorical decisions, they often prefer to ``play around a little''. Informal ultra-lightweight sketches are a common technique for illustrating and sharing intuitions. For example, in early life cycle requirements engineering, software engineers often pepper their discussions with numerous hastily-drawn sketches. Kremer argues convincingly that such visual forms have numerous advantages for acquiring knowledge \cite{kremer98}. Other researchers take a similar view: ``Pictures are superior to texts in a sense that they are abstract, instantly comprehensible, and universal.''\cite{hirakawa94}. Normally, hastily-drawn sketches are viewed as precursors to some other, more precise modeling techniques which necessitate further analysis. Such further formalization may not always be practical. The time and expertise required to build formal models does not accommodate the pace with which humans tend to revise their ideas. Also, such further formalization may not be advisable. Goel~\cite{goel92} studied the effects of diagram formalism on system design. In an {\em ill-structured diagram} (such as a freehand sketch using pencil and paper), the denotation and type of each visual element is ambiguous. In a {\em well-structured diagram} (such as those created using MacDraw or STEP/EXPRESS), each visual element clearly denotes one thing of one class only. Goel found that ill-structured tools generated more design variants --- more drawings, more ideas, more re-use of old ideas --- than well-structured tools. \subsection{Beyond STEP- Early Life cycle Models} The premise of the following early life cycle ultra-lightweight modeling frameworks is that (a)~this preference criteria is important; and that (b)~it is hence necessary to include preference constructs in the modeling language: \bi \item Mylopoulos' {\em soft-goal} graphs~\cite{mylo92,mylo99}. represent knowledge about non-functional requirements. Primitives in soft goal modeling include statements of partial influence such as {\em helps} and {\em hurts}. \item A common framework in the design rationale community is a ``questions-options-criteria'' (QOC) graph~\cite{shum94}. In QOC graphs, questions suggest options and deciding on a certain option can raise other questions. Options shown in a box denote selected options. Options are assessed by criteria and criteria are gradual knowledge; i.e. they {\em tend to support} or {\em tend to reject} options. QOCs can succinctly summarize lengthy debates; e.g. 480 sentences uttered in a debate between two analysts on interface options can be displayed in a QOC graph on a single page~\cite{maclean96}. Saaty's Analytic Hierarchy Process (AHP)~\cite{saaty80} is a variant of QOC. \item Another widely used framework is Quality Functional Deployment diagrams (QFD)~\cite{akao90}. Yet another framework is the DDP system developed by Cornford and Feather~\cite{feather03} Where as AHP and QOC propagate influences over hierarchies, QDF (and DDP) propagate influences over matrices. \ei \subsection{Why Record Early Life cycle Models?} At first glance, the imprecision of the above modeling frameworks appears to make them inferior to precise and unambiguous STEP models. However, it is this very imprecision that makes then valuable. Firstly, one reason for preferring the above ultra-lightweight ontologies over STEP/EXPRESS is that, at least for group design activities: \bi \item During fast-paced design discussions, only brief statements can be collected. Otherwise, the dialogue within a meeting would be hampered by the time required to record, in tedious detail, the exchange of ideas. \item If the ontology gets more elaborate, then experts may be unable to understand technical arguments from outside their own field of expertise. \ei Secondly, experience at NASA\footnote{This section is based on numerous interviews with NASA designers working on the reuse of old Apollo designs for NASA's next-generation Mars rocket. } suggests that a vital missing component of any one design are all the other designs rejected before arriving at the final system. Researchers in the design rationale community~\cite{moran96} advise that these rejected dead-end designs are useful to determine the preference criteria used in selecting the appropriate design. \subsection{Summary} Modeling languages that capture design option\underline{s} are different than modeling languages that record the final detailed design. One reason that STEP/EXPRESS is not used more is, we conjecture, that it is a ``final design capture'' tool and not an ``initial design options exploration'' tool. While ``final design capture'' tools are useful for transferring current designs between different CAD/CAM tools, such tools may not be preferred for reviewing and understanding designs decades after they are created. Nevertheless, the above early life cycle modeling tools may be not indicated for the task of comprehending archival technical documents. The augmentation of large archival storage with detailed design alternative and preference knowledge (i.e. like that captured in DDP, AHP, QFD, or soft goals) requires a detailed analysis of existing artifacts\footnote{e.g. see the {\em Software Archaeology} example offered by Grady Booch~\cite{booch04}.}. Hence it may not be indicated for archival storage where the corpus size is very large and there is a lack of access to the relevant domain experts. Recent advances in program comprehension and text mining indicate that it may be possible to infer such preference knowledge from multiple instances of finished designs. Hence, this work. \newpage \bibliographystyle{abbrv} \addcontentsline{toc}{section}{References} \bibliography{refs,refstimm} \end{document}