/timm's /charming /python /tricks
Download
where.py.
Read more on How to be Charming (in Python).
001: """ 002: 003: (Can be view as 004: [html](http://menzies.us/cs472/?niching) or [raw 005: Python](http://unbox.org/open/trunk/472/14/spring/src/where.py).) 006: 007: How to Find Niche Solutions 008: =========================== 009: 010: ### Why Niche? 011: 012: When exploring complex multi-objective surfaces, it 013: is trite to summarize that space with one number 014: such _max distance from hell_ (where _hell_ is the 015: point in objective space where all goals have their 016: worse values). For example, all the following points 017: have the same _distance from hell_, but they reflect 018: very different solutions: 019: <center> 020: <img src="http://unbox.org/open/trunk/472/14/spring/doc/img/gunsbutter.png"> 021: </center> 022: 023: A better way to do it is _niching_; i.e. generate 024: the Pareto frontier, then cluster that frontier and 025: report a (small) number of random samples from each 026: cluster. In this approach, "distance from hell" can 027: still be used internally to guide a quick search for 028: places to expand the frontier. But after that, we 029: can isolate interesting different parts of the 030: solution space. 031: 032: For example, when exploring options for configuring 033: London Ambulances, Veerappa and Letier use optimization to 034: reject thousands of options (shown in red) in order 035: to isolate clusters of better solutions C1,C2,etc 036: (shown in green). In this example, the goal is to 037: minimize X and maximize Y . 038: <center> 039: <img src="http://unbox.org/open/trunk/472/14/spring/doc/img/amus.png" width=400> 040: </center> 041: 042: (REF: V. Veerappa and E. Letier, "Understanding 043: clusters of optimal solutions in multi-objective 044: decision problems, in RE' 2011, Trento, Italy, 045: 2011, pp. 89-98.) 046: 047: ### What is a Niche? 048: 049: According to Deb and Goldberg (1989) _a niche is 050: viewed as an organism's task in the environment and 051: a species is a collection of organisms with similar 052: features_. 053: 054: + REF: Kalyanmoy Deb and David 055: E. Goldberg. 1989. [An Investigation of Niche and 056: Species Formation in Genetic Function 057: Optimization](http://goo.gl/oLIxo1). In 058: Proceedings of the 3rd International Conference on 059: Genetic Algorithms, J. David Schaffer (Ed.). Morgan 060: Kaufmann Publishers Inc., San Francisco, CA, USA, 061: 42-50. 062: 063: Their definition seems to suggest that niching means 064: clustering in objective space and species are the 065: things that form in each such cluster. Strange to 066: say, some of their examples report _niches_ using 067: decision space so all we can say is that it is an 068: engineering decision whether or not you _niche_ in 069: objective space or _niche_ in decision space. Note 070: that the above example from Veerappa and Letier 071: build niches in objective space while my code, shown 072: below, builds niches in decision space. 073: 074: ### How to Niche? 075: 076: In any case, in decision or objective space, the 077: open issue is now to fun the niches? A naive 078: approach is to compute distances between all 079: individuals. This can be very slow, especially if 080: this calculation is called repeatedly deep inside 081: the inner-most loop of a program. In practice, any 082: program working with distance spends most of its 083: time computing those measures. Various proposals 084: exist to prevent that happening: 085: 086: + _Canopy clustering_: McCallum, A.; Nigam, K.; and 087: Ungar L.H. (2000) [ Efficient Clustering of High 088: Dimensional Data Sets with Application to 089: Reference Matching](http://goo.gl/xwIzN), 090: Proceedings of the sixth ACM SIGKDD international 091: conference on Knowledge discovery and data 092: mining, 169-178 093: + _Incremental stochastic k-means_ [Web-Scale 094: K-Means Clustering](http://goo.gl/V8BQs), 095: WWW 2010, April 26-30, 2010, Raleigh, North 096: Carolina, USA. 097: + _Triangle inequality tricks_ (which work very well 098: indeed) [Making k-means Even Faster](http://goo.gl/hk3Emn). 099: G Hamerly - 2010 SIAM International Conference on 100: Data Mining. 101: 102: My own favorite trick is WHERE, shown below. It uses 103: a data mining trick to recursively divide the space 104: of decisions in two, then four, then eight, 105: etc. REF: [Local vs. global lessons for defect 106: prediction and effort 107: estimation](http://menzies.us/pdf/12gense.pdf) T 108: Menzies, A Butcher, D Cok, A Marcus, L Layman, F 109: Shull, B Turhan, IEEE Transactions on Software 110: Engineering 29 (6), 822-834, 2012. 111: 112: WHERE uses a linear-time trick (called the FastMap 113: heuristic) to find two distant solutions- these are 114: the _poles_ called _west_ and _east_. Note that the 115: following takes only _O(2N)_ distance calculations: 116: 117: """ 118: def fastmap(m,data): 119: "Divide data into two using distance to two distant items." 120: one = any(data) # 1) pick anything 121: west = furthest(m,one,data) # 2) west is as far as you can go from anything 122: east = furthest(m,west,data) # 3) east is as far as you can go from west 123: c = dist(m,west,east) 124: # now find everyone's distance 125: xsum, lst = 0.0,[] 126: for one in data: 127: a = dist(m,one,west) 128: b = dist(m,one,east) 129: x = (a*a + c*c - b*b)/(2*c) # cosine rule 130: xsum += x 131: lst += [(x,one)] 132: # now cut data according to the mean distance 133: cut, wests, easts = xsum/len(data), [], [] 134: for x,one in lst: 135: where = wests if x < cut else easts 136: where += [one] 137: return wests,west, easts,east 138: """ 139: 140: In the above: 141: 142: + _m_ is some model that generates candidate 143: solutions that we wish to niche. 144: + _(west,east)_ are not _the_ most distant points 145: (that would require _N*N) distance 146: calculations). But they are at least very distant 147: to each other. 148: 149: This code needs some helper functions. _Dist_ uses 150: the standard Euclidean measure. Note that you tune 151: what it uses to define the niches (decisions or 152: objectives) using the _what_ parameter: 153: 154: """ 155: def dist(m,i,j, 156: what = lambda x: x.dec): 157: "Euclidean distance 0 <= d <= 1 between decisions" 158: d1,d2 = what(i), what(j) 159: n = len(d1) 160: deltas = 0 161: for d in range(n): 162: n1 = norm(m, d, d1[d]) 163: n2 = norm(m, d, d2[d]) 164: inc = (n1-n2)**2 165: deltas += inc 166: return deltas**0.5 / n**0.5 167: """ 168: 169: The _Dist_ function normalizes all the raw values zero to one. 170: 171: """ 172: def norm(m,x,n) : 173: "Normalizes value n on variable x within model m 0..1" 174: return (n - lo(m,x)) / (hi(m,x) - lo(m,x) + 0.0001) 175: """ 176: 177: Now we can define _furthest_: 178: 179: """ 180: def furthest(m,i,all, 181: init = 0, 182: better = lambda x,y: x>y): 183: "find which of all is furthest from 'i'" 184: out,d= i,init 185: for j in all: 186: if not i == j: 187: tmp = dist(m,i,j) 188: if better(tmp,d): out,d = j,tmp 189: return out 190: """ 191: 192: WHERE finds everyone's else's distance from the poles 193: and divide the data on the mean point of those 194: distances. This all stops if: 195: 196: + Any division has _tooFew_ solutions (say, 197: less than _sqrt_ of the total number of 198: solutions). 199: + Something has gone horribly wrong and you are 200: recursing _tooDeep_ 201: 202: This code is controlled by a set of _slots_. For 203: example, if _slots.pruning_ is true, we may ignore 204: some sub-tree (this process is discussed, later on). 205: Also, if _slots.verbose_ is true, the _show_ 206: function prints out a little tree showing the 207: progress (and to print indents in that tree, we use 208: the string _slots.b4_). For example, here's WHERE 209: dividing 100 solutions: 210: 211: 100 212: |.. 50 213: |.. |.. 25 214: |.. |.. |.. 11 215: |.. |.. |.. |.. 6. 216: |.. |.. |.. |.. 5. 217: |.. |.. |.. 14 218: |.. |.. |.. |.. 6. 219: |.. |.. |.. |.. 8. 220: |.. |.. 25 221: |.. |.. |.. 12 222: |.. |.. |.. |.. 5. 223: |.. |.. |.. |.. 7. 224: |.. |.. |.. 13 225: |.. |.. |.. |.. 5. 226: |.. |.. |.. |.. 8. 227: |.. 50 228: |.. |.. 25 229: |.. |.. |.. 13 230: |.. |.. |.. |.. 7. 231: |.. |.. |.. |.. 6. 232: |.. |.. |.. 12 233: |.. |.. |.. |.. 5. 234: |.. |.. |.. |.. 7. 235: |.. |.. 25 236: |.. |.. |.. 11 237: |.. |.. |.. |.. 5. 238: |.. |.. |.. |.. 6. 239: |.. |.. |.. 14 240: |.. |.. |.. |.. 7. 241: |.. |.. |.. |.. 7. 242: 243: Here's the slots: 244: 245: """ 246: class Slots(): 247: "Place to read/write named slots." 248: id = -1 249: def __init__(i,**d) : 250: i.id = Slots.id = Slots.id + 1 251: i.override(d) 252: def override(i,d): i.__dict__.update(d); return i 253: def __eq__(i,j) : return i.id == j.id 254: def __ne__(i,j) : return i.id != j.id 255: def __repr__(i) : return '{' + showd(i.__dict__) + '}' 256: 257: def where0(**other): 258: return Slots(minSize = 10, # min leaf size 259: depthMin= 2, # no pruning till this depth 260: depthMax= 10, # max tree depth 261: wriggle = 0.2, # min difference of 'better' 262: prune = True, # pruning enabled? 263: b4 = '|.. ', # indent string 264: verbose = False, # show trace info? 265: hedges = 0.38 # strict=0.38,relax=0.17 266: ).override(other) 267: """ 268: 269: WHERE returns clusters, where each cluster contains 270: multiple solutions. 271: 272: """ 273: def where(m,data,slots=where0()): 274: out = [] 275: where1(m,data,slots,0,out) 276: return out 277: 278: def where1(m, data, slots, lvl, out): 279: def tooDeep(): return lvl > slots.depthMax 280: def tooFew() : return len(data) < slots.minSize 281: def show(suffix): 282: if slots.verbose: 283: print slots.b4*lvl + str(len(data)) + suffix 284: if tooDeep() or tooFew(): 285: show(".") 286: out += [data] 287: else: 288: show("") 289: wests,west, easts,east = fastmap(m,data) 290: goLeft, goRight = maybePrune(m,slots,lvl,west,east) 291: if goLeft: 292: where1(m, wests, slots, lvl+1, out) 293: if goRight: 294: where1(m, easts, slots, lvl+1, out) 295: """ 296: 297: Is this useful? Well, in the following experiment, I 298: clustered 32, 64, 128, 256 individuals using WHERE or 299: a dumb greedy approach called GAC that (a) finds 300: everyone's closest neighbor; (b) combines each such 301: pair into a super-node; (c) then repeats 302: (recursively) for the super-nodes. 303: 304: <center> 305: <img src="http://unbox.org/open/trunk/472/14/spring/doc/img/gacWHEREruntimes.png"> 306: </center> 307: 308: 309: 310: WHERE is _much_ faster than GAC since it builds 311: a tree of cluster of height log(N) by, at each 312: step, making only O(2N) calls to FastMap. 313: 314: ### Experimental Extensions 315: 316: Lately I've been experimenting with a system that 317: prunes as it divides the data. GALE checks for 318: domination between the poles and ignores data in 319: halves with a dominated pole. This means that for 320: _N_ solutions we only ever have to evaluate 321: _2*log(N)_ of them- which is useful if each 322: evaluation takes a long time. 323: 324: The niches found in this way 325: contain non-dominated poles; i.e. they are 326: approximations to the Pareto frontier. 327: Preliminary results show that this is a useful 328: approach but you should treat those results with a 329: grain of salt. 330: 331: In any case, this code supports that pruning as an 332: optional extra (and is enabled using the 333: _slots.pruning_ flag). In summary, this code says if 334: the scores for the poles are more different that 335: _slots.wriggle_ and one pole has a better score than 336: the other, then ignore the other pole. 337: 338: """ 339: def maybePrune(m,slots,lvl,west,east): 340: "Usually, go left then right, unless dominated." 341: goLeft, goRight = True,True # default 342: if slots.prune and lvl >= slots.depthMin: 343: sw = scores(m, west) 344: se = scores(m, east) 345: if abs(sw - se) > slots.wriggle: # big enough to consider 346: if se > sw: goLeft = False # no left 347: if sw > se: goRight = False # no right 348: return goLeft, goRight 349: """ 350: 351: Note that I do not allow pruning until we have 352: descended at least _slots.depthMin_ into the tree. 353: 354: 355: Support Code 356: ------------ 357: 358: ### Dull, low-level stuff 359: 360: """ 361: import sys,math,random 362: sys.dont_write_bytecode = True 363: 364: def go(f): 365: "A decorator that runs code at load time." 366: print "\n# ---|", f.__name__,"|-----------------" 367: if f.__doc__: print "#", f.__doc__ 368: f() 369: 370: # random stuff 371: by = lambda x: random.uniform(0,x) 372: seed = random.seed 373: any = random.choice 374: 375: # pretty-prints for list 376: def gs(lst) : return [g(x) for x in lst] 377: def g(x) : return float('%g' % x) 378: """ 379: 380: ### More interesting, low-level stuff 381: 382: """ 383: def timing(f,repeats=10): 384: "How long does 'f' take to run?" 385: import time 386: time1 = time.clock() 387: for _ in range(repeats): 388: f() 389: return (time.clock() - time1)*1.0/repeats 390: 391: def showd(d): 392: "Pretty print a dictionary." 393: def one(k,v): 394: if isinstance(v,list): 395: v = gs(v) 396: if isinstance(v,float): 397: return ":%s %g" % (k,v) 398: return ":%s %s" % (k,v) 399: return ' '.join([one(k,v) for k,v in 400: sorted(d.items()) 401: if not "_" in k]) 402: 403: class Num: 404: "An Accumulator for numbers" 405: def __init__(i): i.n = i.m2 = i.mu = 0.0 406: def s(i) : return (i.m2/(i.n - 1))**0.5 407: def __add__(i,x): 408: i.n += 1 409: delta = x - i.mu 410: i.mu += delta*1.0/i.n 411: i.m2 += delta*(x - i.mu) 412: """ 413: 414: ### Model-specific Stuff 415: 416: WHERE talks to models via the the following model-specific functions. 417: Here, we must invent some made-up model that builds 418: individuals with 4 decisions and 3 objectives. 419: In practice, you would **start** here to build hooks from WHERE into your model 420: (which is the **m** passed in to these functions). 421: 422: """ 423: def decisions(m) : return [0,1,2,3,4] 424: def objectives(m): return [0,1,2,3] 425: def lo(m,x) : return 0.0 426: def hi(m,x) : return 1.0 427: def w(m,o) : return 1 # min,max is -1,1 428: def score(m, individual): 429: d, o = individual.dec, individual.obj 430: o[0] = d[0] * d[1] 431: o[1] = d[2] ** d[3] 432: o[2] = 2.0*d[3]*d[4] / (d[3] + d[4]) 433: o[3] = d[2]/(1 + math.e**(-1*d[2])) 434: individual.changed = True 435: """ 436: 437: The call to 438: ### Model-general stuff 439: 440: Using the model-specific stuff, WHERE defines some 441: useful general functions. 442: 443: """ 444: def some(m,x) : 445: "with variable x of model m, pick one value at random" 446: return lo(m,x) + by(hi(m,x) - lo(m,x)) 447: 448: def candidate(m): 449: "Return an unscored individual." 450: return Slots(changed = True, 451: scores=None, 452: obj = [None] * len(objectives(m)), 453: dec = [some(m,d) 454: for d in decisions(m)]) 455: 456: def scores(m,t): 457: "Score an individual." 458: if t.changed: 459: score(m,t) 460: new, n = 0, 0 461: for o in objectives(m): 462: x = t.obj[o] 463: n += 1 464: tmp = norm(m,o,x)**2 465: if w(m,o) < 0: 466: tmp = 1 - tmp 467: new += tmp 468: t.scores = (new**0.5)*1.0 / (n**0.5) 469: t.changed = False 470: return t.scores 471: """ 472: 473: ### Demo stuff 474: 475: To run these at load time, add _@go_ (uncommented) on the line before. 476: 477: Checks that we can find lost and distant things: 478: 479: """ 480: @go 481: def _distances(): 482: def closest(m,i,all): 483: return furthest(m,i,all,10**32,lambda x,y: x < y) 484: random.seed(1) 485: m = "any" 486: pop = [candidate(m) for _ in range(4)] 487: for i in pop: 488: j = closest(m,i,pop) 489: k = furthest(m,i,pop) 490: print "\n",\ 491: gs(i.dec), g(scores(m,i)),"\n",\ 492: gs(j.dec),"closest ", g(dist(m,i,j)),"\n",\ 493: gs(k.dec),"furthest", g(dist(m,i,k)) 494: print i 495: """ 496: 497: A standard call to WHERE, pruning disabled: 498: 499: """ 500: @go 501: def _where(): 502: random.seed(1) 503: m, max, pop, kept = "model",100, [], Num() 504: for _ in range(max): 505: one = candidate(m) 506: kept + scores(m,one) 507: pop += [one] 508: slots = where0(verbose = True, 509: minSize = max**0.5, 510: prune = False, 511: wriggle = 0.3*kept.s()) 512: n=0 513: for leaf in where(m, pop, slots): 514: n += len(leaf) 515: print n,slots 516: """ 517: 518: Compares WHERE to GAC: 519: 520: """ 521: @go 522: def _whereTiming(): 523: def allPairs(data): 524: n = 8.0/3*(len(data)**2 - 1) #numevals WHERE vs GAC 525: for _ in range(int(n+0.5)): 526: d1 = any(data) 527: d2 = any(data) 528: dist("M",d1,d2) 529: random.seed(1) 530: for max in [32,64,128,256]: 531: m, pop, kept = "model",[], Num() 532: for _ in range(max): 533: one = candidate(m) 534: kept + scores(m,one) 535: pop += [one] 536: slots = where0(verbose = False, 537: minSize = 2, # emulate GAC 538: depthMax=1000000, 539: prune = False, 540: wriggle = 0.3*kept.s()) 541: t1 = timing(lambda : where(m, pop, slots),10) 542: t2 = timing(lambda : allPairs(pop),10) 543: print max,t1,t2, int(100*t2/t1) 544:
This file is part of Timm's charming Python tricks.
© 2014, Tim Menzies:
tim.menzies@gmail.com,
http://menzies.us.
Timm's charming Python tricks are free software: you can redistribute it and/or modify it under the terms of the GNU Lesser Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Timm's charming Python tricks are distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU Lesser Public License along with Foobar. If not, see http://www.gnu.org/licenses.