Thesis Background Notes =========================================== * Stick to only semantic information and disregard structural information to increase generality among domains? (not limit it to just source code but also plain english), we can always build the links like your in ICSE paper * Compare different methods of dimensionality reduction and their effect on cluster accuracy? (LSI, FastMap, Tf*Idf term selection, InfoGain, ?) * Compare different clustering methods (complete sub-graph based, k-means, genic, BUNCH) * Use metrics proposed in ICSE '01 paper * should we use cluster labeling? short-term goals: * Research distance metrics effect on clustering * Research/Implement ways of measuring clustering accuracy * write abstract * Build datasets * Implement LSI (can we just use pca on tfidf?) * Fix version of hamlet for demos