@include "project.awk" @include "table.awk" """ Spectral Clustering ------------------- Spectral learners use eigenvectors to reason about data. Eigenvectors are a construct derived from a table of data that combine related features and ignore irrelevant features Eigenvectors are linear combinations of the raw features where the value _cos(θ)_ between a raw feature and an eigenvector shows the impact of the former on the later (if _cos(θ)=0_ then _θ=90_ degrees and that feature does not contribute to the overall direction of the data. An example of spectral learning is the spectral clusterer shown in this file. `Tiles` finds the first two dimensions of all rows then recursively partitions those two dimensions on their median point. ## Code ### Tiles This function first calls [project.auk](?project.auk) which populates two arras _(x,y)_ (you can't see it here since those arrays are part of the `_Tables_` structure. Then, it determines a minimum number of examples, less than which the recursion should stop (see `big`). The recursion then starts `tiles`. """ function tiles(_Tables,t, _Tile,m,cl,xy) { xy = project(_Tables,t) tiny = 4 pre = "" m = length(datas[xy]) big = 2*sqrt(m) cl = 1 watch= 1 centers="centroids" makeTable(names[xy],centers,_Tables) tiles0(_Tables[xy],_Tile) tiles4(1,m,1,m,_Tables[xy],_Tile,_Tables,cl) return centers } """ This code four indexes that mark the min and max range of the _(x,y)_ ranges (see the last line of `tiles` that calls `1,m,1,m` for _xmin,xmax,ymin,ymax_). """ function tiles0(_Table0,_Tile, x,y,z,d,at) { x = colnum0["$_XX"] y = colnum0["$_YY"] z = colnum0["_ZZ"] for(d in data0) { at[d]["d"] = d at[d]["x"] = data0[d][x] at[d]["y"] = data0[d][y] } asort(at,xs,"xsort") asort(at,ys,"ysort") } """ Three tricks: + Sort all the `x,y` values _once_ in `tiles0` then pass them sorted to the recursive call. + `Tiles4` calls `tile1` which recursively calls `tiles`. + The code carries around an id number called `cl` which may be updated `tile1` if ever a new cluster is created + We also carry around one additional table that stores the centroids of the new created tables. If we make a new leaf cluster, we also add the centroid of the new cluster to the _clusters_ table. Code: """ function tiles4(x0,x2,y0,y2,_Table0,_Tile,_Tables,cl, x,y) { x= x0 + int((x2 - x0)/2) y= y0 + int((y2 - y0)/2) cl= tile1(x0 ,x ,y0 ,y, _Table0,_Tile,_Tables,cl) cl= tile1(x0 ,x ,y+1 ,y2, _Table0,_Tile,_Tables,cl) cl= tile1(x+1 ,x2 ,y0 ,y, _Table0,_Tile,_Tables,cl) cl= tile1(x+1 ,x2 ,y+1 ,y2, _Table0,_Tile,_Tables,cl) return cl } """ The code in `tile1` runs as follows: + First, it reflects over the ranges `x0,x2` and `y0,y2` to see what examples fall into the range `x0` to `x2` and `y0` to `y2`. The index of each such row is stored in the `has` array. + Then the if statement that tests for `watch` optionally prints some debug information. + Thirdly, + If there are enough items in `has`, then recurse to try splitting that into four using `tiles4`. + Else, if there enough examples, create a new lead cluster and add all the items in `has` to a new table. + The new cluster gets a new id number which means that we have to increment the `++cl` cluster index value. Note that `tile1` and `tiles4` must return the cluster id after all the cluster values `cl` are updated. """ function tile1(x0,x2,y0,y2, _Table0,_Tile,_Tables,cl, x,y,has) { #----------------- # search : find things inside x0..x2 and y0..y2 for(x=x0; x<=x2; x++) for(y=y0; y<=y2; y++) if (xs[x]["d"] == ys[y]["d"]) has[x] = xs[x]["d"] #------------------- # show : some debug info (optional) if (watch) print sprintf("%3s: ",cl) pre x0,x2,y0,y2, "#" length(has) #------------------- # recurse : when there is enough data if (length(has) >= big) { pre = pre "|.." return tiles4(x0,x2,y0,y2,_Table0,_Tile,_Tables,cl) } # ------------------ # otherwise, new cluster: make a new leaf, only when there is enough data if (length(has) > tiny) makeNewTable(has, ++cl, # <=== we are now making a new cluster _Table0,_Tile,_Tables) #------------------ # keep track of the number of leaf clusters seen so far return cl } """ Note the trace when `watch=1` over a data set with 93 rows. Observe how, at the top-level, it works from _1,47_ then _48,93_ (the last number of each line is the number of examples in that split): 1: 1,47,1,47,#22 1: |..1,24,1,24,#8 2: |..1,24,25,47,#6 3: |..25,47,1,24,#0 3: |..25,47,25,47,#8 4: 1,47,48,93,#25 4: |..1,24,48,70,#7 5: |..1,24,71,93,#3 5: |..25,47,48,70,#3 5: |..25,47,71,93,#12 6: 48,93,1,47,#25 6: |..48,70,1,24,#0 6: |..48,70,25,47,#5 7: |..71,93,1,24,#16 8: |..71,93,25,47,#4 8: 48,93,48,93,#21 8: |..48,70,48,70,#12 9: |..48,70,71,93,#6 10: |..71,93,48,70,#1 10: |..71,93,71,93,#2 Here's some helper code. For each leaf cluster it: + Creates one new table for each leaf + Adds the centroid for that new table to a global table called `centers` that holds all the centroids. Code: """ function makeNewTable(has,cl,_Table0,_Tile,_Tables, z,one,d,row1,row2) { cl = cl*100 makeTable(name0,cl,_Tables) z = colnum0["_ZZ"] for(one in has) { d = has[one] copy(data0[d],row1) row1[z] = cl addRow(row1,_Tables[cl]) } centroid(_Tables[cl],row2) row2[z] = cl addRow(row2,_Tables[centers]) } """ Note the last three lines of the above: it adds the centroid of the new cluster to the _centroids_ table (which is called `centers`). ## Demo Code It prints the generated tables, including the `centers` table. """ function _tiles( _Tables, t,centers,name) { # tilesMain("data/nasa93dem.csv") tilesMain("/home/timm/svns/mine/mine/trunk/doc/13/fraun/fraun-oct15-2013.csv") } function tilesMain(f, _Tables, t,centers,name) { resetSeed() t=0 readcsv(f,t,_Tables) centers= tiles(_Tables,t) tableprint(_Tables[centers],"%5.3f") for(name in names) if(name != t) tableprint(_Tables[name],"%5.3f") } # # function _tiles1( _Table) { # readcsv("data/autompg.csv",0,_Table) # tiles(_Table[0]) # } """ Here are the centroids from [nasa93dem](https://raw.github.com/timm/auk/master/data/nasa93dem.csv), map onto an _(x,y)_ space. And here are the output (all the centroids and all the other clusters): [tiles.out](https://raw.github.com/timm/auk/master/var/tiles.out) Sample plotting data in a file called `mydata.sh`: gnuplot<