BEGIN { FS=OFS=","} @include "lib.awk" @include "reader.awk" @include "dist.awk" """ Spectral Projection ------------------- Spectral learners use eigenvectors to reason about data. Eigenvectors are a construct derived from a table of data that combine related features and ignore irrelevant features Eigenvectors are linear combinations of the raw features where the value _cos(θ)_ between a raw feature and an eigenvector shows the impact of the former on the later (if _cos(θ)=0_ then _θ=90_ degrees and that feature does not contribute to the overall direction of the data. An example of spectral learning is the spectral clusterer implemented in [tiles.auk](?tiles.auk). But before we look at that, we need the `project` support tool shown in this file. As a pre-processor to spectral learning, the `project` code projects N-dimensional data onto a 2-d space defined by an approximation to the major eigenvector of some data. This approxiamtion is generated using the FASTMAP heuristic that finds the dimension of the greatest variability. FASTMAP is similar to Principal Component Analysis (PCA). and defines the direction of the greatest variability to be a line drawn between two most distant points in the examples. This line approximates the first component of PCA, but it is generated in much faster time (linear time for FASTMAP, quadratic time for PCA). The second dimension is defined by drawing a perpendicular line to the first dimension. Studies showed the effectiveness of FASTMAP for software engineering data: + _Tim Menzies, Andrew Butcher, David Cok, Andrian Marcus, Lucas Layman, Forrest Shull, Burak Turhan, Thomas Zimmermann, "Local versus Global Lessons for Defect Prediction and Effort Estimation," IEEE Transactions on Software Engineering, vol. 39, no. 6, pp. 822-834, June, 2013_ The FASTMAP heuristic works as follows: + Pick any example _Z_ at random; + Find the example _X_ furthest from _Z_. + Find the example _Y_ furthest from _X_. + Find the distance _c=dist(X,Y)_. + For all examples _E_, find the distance _a,b_ to _X,Y_. ## Details + The raw distance measure is the Euclidean measure used before in [dist.auk](?dist.auk). + From the cosine rule, we can now project and example onto a line from _X_ to _Y_ at _x=(a^2 + c^2 - b^2)/(2c))_. + From Pythagoras, we can find _y=sqrt(a^2 -x^2)_. ## Code Top level-driver: """ function project(_Table,t, d,east,west,x,y) { d = anyi(data) east = furthest(d, _Table[t]) west = furthest(east,_Table[t]) project0(east,west,_Table[t],x,y) return widen(_Table,t,x,y) } """ The worker: + If we ever find something further away than `c`, then start again (see the recursive calls to `project0`). Note that this worker populates to arrays `x,y` whose index are row numbers and whose values are the _(x,y)_ values of row number _d_. """ function project0(east,west,_Table,x,y, a,b,c,d,some,bigger) { printf("+") bigger = 1.05 some = 0.000001 # handles a tedious div/zero error c = dist(data[east],data[west],_Table) for(d in data) { a = dist(data[d],data[east],_Table) b = dist(data[d],data[west],_Table) # print a,b,c if (b > c*bigger) return project0(east,d,_Table,x,y) if (a > c*bigger) return project0(d,west,_Table,x,y) printf(".") x[d]= (a^2 + c^2 - b^2) / (2*c + some) y[d]= (a^2 - x[d]^2)^0.5 } } """ Once these _(x,y)_ values are calculated for each example, we can build a new table with two extra columns for these new values. """ function widen(_Table,t,x,y, w,d,wider,adds,c,hell) { copy(name[t],adds) push("$_XX",adds) push("$_YY",adds) push("$_Hell",adds) push("_ZZ",adds) w = "_" t makeTable(adds,w,_Table) for(d in data[t]) { hell= fromHell(data[t][d],_Table[t]) copy(data[t][d],wider) push(x[d], wider) push(y[d], wider) push(hell,wider) push(0, wider) addRow(wider,_Table[w]) } return w } """ ## Output Here's the 26 dimensions of [nasa93demo](https://raw.github.com/timm/auk/master/data/nasa93dem.csv) """