/timm's /charming /python /tricks

where.py

Download where.py.
Read more on How to be Charming (in Python).


001: """
002: 
003: (Can be view as
004: [html](http://menzies.us/cs472/?niching) or [raw
005: Python](http://unbox.org/open/trunk/472/14/spring/src/where.py).)
006: 
007: How to Find Niche Solutions
008: ===========================
009: 
010: ### Why Niche?
011: 
012: When exploring complex multi-objective surfaces, it
013: is trite to summarize that space with one number
014: such _max distance from hell_ (where _hell_ is the
015: point in objective space where all goals have their
016: worse values). For example, all the following points
017: have the same _distance from hell_, but they reflect
018: very different solutions:
019: <center>
020: <img src="http://unbox.org/open/trunk/472/14/spring/doc/img/gunsbutter.png">
021: </center>
022: 
023: A better way to do it is _niching_; i.e. generate
024: the Pareto frontier, then cluster that frontier and
025: report a (small) number of random samples from each
026: cluster.  In this approach, "distance from hell" can
027: still be used internally to guide a quick search for
028: places to expand the frontier. But after that, we
029: can isolate interesting different parts of the
030: solution space.
031: 
032: For example, when exploring options for configuring
033: London Ambulances, Veerappa and Letier  use optimization to
034: reject thousands of options (shown in red) in order
035: to isolate clusters of better solutions C1,C2,etc
036: (shown in green). In this example, the goal is to
037: minimize X and maximize Y .
038: <center>
039: <img src="http://unbox.org/open/trunk/472/14/spring/doc/img/amus.png" width=400>
040: </center>
041: 
042: (REF: V. Veerappa and E. Letier, "Understanding
043:  clusters of optimal solutions in multi-objective
044:  decision problems, in RE' 2011, Trento, Italy,
045:  2011, pp. 89-98.)
046: 
047: ### What is a Niche?
048: 
049: According to Deb and Goldberg (1989) _a niche is
050: viewed as an organism's task in the environment and
051: a species is a collection of organisms with similar
052: features_. 
053: 
054: + REF: Kalyanmoy Deb and David
055: E. Goldberg. 1989.  [An Investigation of Niche and
056: Species Formation in Genetic Function
057: Optimization](http://goo.gl/oLIxo1). In
058: Proceedings of the 3rd International Conference on
059: Genetic Algorithms, J. David Schaffer (Ed.). Morgan
060: Kaufmann Publishers Inc., San Francisco, CA, USA,
061: 42-50.
062: 
063: Their definition seems to suggest that niching means
064: clustering in objective space and species are the
065: things that form in each such cluster. Strange to
066: say, some of their examples report _niches_ using
067: decision space so all we can say is that it is an
068: engineering decision whether or not you _niche_ in
069: objective space or _niche_ in decision space. Note
070: that the above example from Veerappa and Letier
071: build niches in objective space while my code, shown
072: below, builds niches in decision space.
073: 
074: ### How to Niche?
075: 
076: In any case, in decision or objective space, the
077: open issue is now to fun the niches? A naive
078: approach is to compute distances between all
079: individuals. This can be very slow, especially if
080: this calculation is called repeatedly deep inside
081: the inner-most loop of a program. In practice, any
082: program working with distance spends most of its
083: time computing those measures.  Various proposals
084: exist to prevent that happening:
085: 
086: + _Canopy clustering_: McCallum, A.; Nigam, K.; and
087:    Ungar L.H. (2000) [ Efficient Clustering of High
088:    Dimensional Data Sets with Application to
089:    Reference Matching](http://goo.gl/xwIzN),
090:    Proceedings of the sixth ACM SIGKDD international
091:    conference on Knowledge discovery and data
092:    mining, 169-178 
093: + _Incremental stochastic k-means_ [Web-Scale
094:   K-Means Clustering](http://goo.gl/V8BQs),
095:   WWW 2010, April 26-30, 2010, Raleigh, North
096:   Carolina, USA.
097: + _Triangle inequality tricks_ (which work very well
098:   indeed) [Making k-means Even Faster](http://goo.gl/hk3Emn).
099:   G Hamerly - 2010 SIAM International Conference on
100:   Data Mining.
101: 
102: My own favorite trick is WHERE, shown below. It uses
103: a data mining trick to recursively divide the space
104: of decisions in two, then four, then eight,
105: etc. REF: [Local vs. global lessons for defect
106: prediction and effort
107: estimation](http://menzies.us/pdf/12gense.pdf) T
108: Menzies, A Butcher, D Cok, A Marcus, L Layman, F
109: Shull, B Turhan, IEEE Transactions on Software
110: Engineering 29 (6), 822-834, 2012.
111: 
112: WHERE uses a linear-time trick (called the FastMap
113: heuristic) to find two distant solutions- these are
114: the _poles_ called _west_ and _east_. Note that the
115: following takes only _O(2N)_ distance calculations:
116: 
117: """
118: def fastmap(m,data):
119:   "Divide data into two using distance to two distant items."
120:   one  = any(data)             # 1) pick anything
121:   west = furthest(m,one,data)  # 2) west is as far as you can go from anything
122:   east = furthest(m,west,data) # 3) east is as far as you can go from west
123:   c    = dist(m,west,east)
124:   # now find everyone's distance
125:   xsum, lst = 0.0,[]
126:   for one in data:
127:     a = dist(m,one,west)
128:     b = dist(m,one,east)
129:     x = (a*a + c*c - b*b)/(2*c) # cosine rule
130:     xsum += x
131:     lst  += [(x,one)]
132:   # now cut data according to the mean distance
133:   cut, wests, easts = xsum/len(data), [], []
134:   for x,one in lst:
135:     where = wests if x < cut else easts 
136:     where += [one]
137:   return wests,west, easts,east
138: """
139: 
140: In the above:
141: 
142: + _m_ is some model that generates candidate
143:   solutions that we wish to niche.
144: + _(west,east)_ are not _the_ most distant points
145:   (that would require _N*N) distance
146:   calculations). But they are at least very distant
147:   to each other.
148: 
149: This code needs some helper functions. _Dist_ uses
150: the standard Euclidean measure. Note that you tune
151: what it uses to define the niches (decisions or
152: objectives) using the _what_ parameter:
153: 
154: """
155: def dist(m,i,j,
156:          what = lambda x: x.dec):
157:   "Euclidean distance 0 <= d <= 1 between decisions"
158:   d1,d2  = what(i), what(j)
159:   n      = len(d1)
160:   deltas = 0
161:   for d in range(n):
162:     n1 = norm(m, d, d1[d])
163:     n2 = norm(m, d, d2[d])
164:     inc = (n1-n2)**2
165:     deltas += inc
166:   return deltas**0.5 / n**0.5
167: """
168: 
169: The _Dist_ function normalizes all the raw values zero to one.
170: 
171: """
172: def norm(m,x,n) : 
173:   "Normalizes value n on variable x within model m 0..1"
174:   return (n - lo(m,x)) / (hi(m,x) - lo(m,x) + 0.0001)
175: """
176: 
177: Now we can define _furthest_:
178: 
179: """
180: def furthest(m,i,all,
181:              init = 0,
182:              better = lambda x,y: x>y):
183:   "find which of all is furthest from 'i'"
184:   out,d= i,init
185:   for j in all:
186:     if not i == j:
187:       tmp = dist(m,i,j)
188:       if better(tmp,d): out,d = j,tmp
189:   return out
190: """
191: 
192: WHERE finds everyone's else's distance from the poles
193:   and divide the data on the mean point of those
194:   distances.  This all stops if:
195: 
196: +  Any division has _tooFew_ solutions (say,
197:   less than _sqrt_ of the total number of
198:   solutions).
199: + Something has gone horribly wrong and you are
200:   recursing _tooDeep_
201: 
202: This code is controlled by a set of _slots_.  For
203: example, if _slots.pruning_ is true, we may ignore
204: some sub-tree (this process is discussed, later on).
205: Also, if _slots.verbose_ is true, the _show_
206: function prints out a little tree showing the
207: progress (and to print indents in that tree, we use
208: the string _slots.b4_).  For example, here's WHERE
209: dividing 100 solutions:
210:     
211:     100
212:     |.. 50
213:     |.. |.. 25
214:     |.. |.. |.. 11
215:     |.. |.. |.. |.. 6.
216:     |.. |.. |.. |.. 5.
217:     |.. |.. |.. 14
218:     |.. |.. |.. |.. 6.
219:     |.. |.. |.. |.. 8.
220:     |.. |.. 25
221:     |.. |.. |.. 12
222:     |.. |.. |.. |.. 5.
223:     |.. |.. |.. |.. 7.
224:     |.. |.. |.. 13
225:     |.. |.. |.. |.. 5.
226:     |.. |.. |.. |.. 8.
227:     |.. 50
228:     |.. |.. 25
229:     |.. |.. |.. 13
230:     |.. |.. |.. |.. 7.
231:     |.. |.. |.. |.. 6.
232:     |.. |.. |.. 12
233:     |.. |.. |.. |.. 5.
234:     |.. |.. |.. |.. 7.
235:     |.. |.. 25
236:     |.. |.. |.. 11
237:     |.. |.. |.. |.. 5.
238:     |.. |.. |.. |.. 6.
239:     |.. |.. |.. 14
240:     |.. |.. |.. |.. 7.
241:     |.. |.. |.. |.. 7.
242: 
243: Here's the slots:
244: 
245: """ 
246: class Slots():
247:   "Place to read/write named slots."
248:   id = -1
249:   def __init__(i,**d) : 
250:     i.id = Slots.id = Slots.id + 1
251:     i.override(d)
252:   def override(i,d): i.__dict__.update(d); return i
253:   def __eq__(i,j)  : return i.id == j.id   
254:   def __ne__(i,j)  : return i.id != j.id   
255:   def __repr__(i)  : return '{' + showd(i.__dict__) + '}'
256: 
257: def where0(**other):
258:   return Slots(minSize  = 10,    # min leaf size
259:                depthMin= 2,      # no pruning till this depth
260:                depthMax= 10,     # max tree depth
261:                wriggle = 0.2,    # min difference of 'better'
262:                prune   = True,   # pruning enabled?
263:                b4      = '|.. ', # indent string
264:                verbose = False,  # show trace info?
265:                hedges  = 0.38    # strict=0.38,relax=0.17
266:    ).override(other)
267: """
268: 
269: WHERE returns clusters, where each cluster contains
270: multiple solutions.
271: 
272: """
273: def where(m,data,slots=where0()):
274:   out = []
275:   where1(m,data,slots,0,out)
276:   return out
277: 
278: def where1(m, data, slots, lvl, out):
279:   def tooDeep(): return lvl > slots.depthMax
280:   def tooFew() : return len(data) < slots.minSize
281:   def show(suffix): 
282:     if slots.verbose: 
283:       print slots.b4*lvl + str(len(data)) + suffix
284:   if tooDeep() or tooFew():
285:     show(".")
286:     out += [data]
287:   else:
288:     show("")
289:     wests,west, easts,east = fastmap(m,data)
290:     goLeft, goRight = maybePrune(m,slots,lvl,west,east)
291:     if goLeft: 
292:       where1(m, wests, slots, lvl+1, out)
293:     if goRight: 
294:       where1(m, easts, slots, lvl+1, out)
295: """
296: 
297: Is this useful? Well, in the following experiment, I
298: clustered 32, 64, 128, 256 individuals using WHERE or
299: a dumb greedy approach called GAC that (a) finds
300: everyone's closest neighbor; (b) combines each such
301: pair into a super-node; (c) then repeats
302: (recursively) for the super-nodes.
303: 
304: <center>
305: <img src="http://unbox.org/open/trunk/472/14/spring/doc/img/gacWHEREruntimes.png">
306: </center>
307: 
308: 
309: 
310: WHERE is _much_ faster than GAC since it builds
311: a tree of cluster of height log(N) by, at each
312: step, making only  O(2N) calls to FastMap.
313: 
314: ### Experimental Extensions
315: 
316: Lately I've been experimenting with a system that
317: prunes as it divides the data. GALE checks for
318: domination between the poles and ignores data in
319: halves with a dominated pole. This means that for
320: _N_ solutions we only ever have to evaluate
321: _2*log(N)_ of them- which is useful if each
322: evaluation takes a long time.  
323: 
324: The niches found in this way
325: contain non-dominated poles; i.e. they are
326: approximations to the Pareto frontier.
327: Preliminary results show that this is a useful
328: approach but you should treat those results with a
329: grain of salt.
330: 
331: In any case, this code supports that pruning as an
332: optional extra (and is enabled using the
333: _slots.pruning_ flag). In summary, this code says if
334: the scores for the poles are more different that
335: _slots.wriggle_ and one pole has a better score than
336: the other, then ignore the other pole.
337: 
338: """
339: def maybePrune(m,slots,lvl,west,east):
340:   "Usually, go left then right, unless dominated."
341:   goLeft, goRight = True,True # default
342:   if  slots.prune and lvl >= slots.depthMin:
343:     sw = scores(m, west)
344:     se = scores(m, east)
345:     if abs(sw - se) > slots.wriggle: # big enough to consider
346:       if se > sw: goLeft   = False   # no left
347:       if sw > se: goRight  = False   # no right
348:   return goLeft, goRight
349: """
350: 
351: Note that I do not allow pruning until we have
352: descended at least _slots.depthMin_ into the tree.
353: 
354: 
355: Support Code
356: ------------
357: 
358: ### Dull, low-level stuff
359: 
360: """
361: import sys,math,random
362: sys.dont_write_bytecode = True
363: 
364: def go(f):
365:   "A decorator that runs code at load time."
366:   print "\n# ---|", f.__name__,"|-----------------"
367:   if f.__doc__: print "#", f.__doc__
368:   f()
369: 
370: # random stuff
371: by   = lambda x: random.uniform(0,x) 
372: seed = random.seed
373: any  = random.choice
374: 
375: # pretty-prints for list
376: def gs(lst) : return [g(x) for x in lst]
377: def g(x)    : return float('%g' % x) 
378: """
379: 
380: ### More interesting, low-level stuff
381: 
382: """
383: def timing(f,repeats=10):
384:   "How long does 'f' take to run?"
385:   import time
386:   time1 = time.clock()
387:   for _ in range(repeats):
388:     f()
389:   return (time.clock() - time1)*1.0/repeats
390: 
391: def showd(d):
392:   "Pretty print a dictionary."
393:   def one(k,v):
394:     if isinstance(v,list):
395:       v = gs(v)
396:     if isinstance(v,float):
397:       return ":%s %g" % (k,v)
398:     return ":%s %s" % (k,v)
399:   return ' '.join([one(k,v) for k,v in
400:                     sorted(d.items())
401:                      if not "_" in k])
402: 
403: class Num:
404:   "An Accumulator for numbers"
405:   def __init__(i): i.n = i.m2 = i.mu = 0.0
406:   def s(i)       : return (i.m2/(i.n - 1))**0.5
407:   def __add__(i,x):
408:     i.n   += 1    
409:     delta  = x - i.mu
410:     i.mu  += delta*1.0/i.n
411:     i.m2  += delta*(x - i.mu)
412: """
413: 
414: ### Model-specific Stuff
415: 
416: WHERE talks to models via the the following model-specific functions.
417: Here, we must invent some made-up model that builds
418: individuals with 4 decisions and 3 objectives.
419: In practice, you would **start** here to build hooks from WHERE into your model
420: (which is the **m** passed in to these functions).
421: 
422: """
423: def decisions(m) : return [0,1,2,3,4]
424: def objectives(m): return [0,1,2,3]
425: def lo(m,x)      : return 0.0
426: def hi(m,x)      : return  1.0
427: def w(m,o)       : return 1 # min,max is -1,1
428: def score(m, individual):
429:   d, o = individual.dec, individual.obj
430:   o[0] = d[0] * d[1]
431:   o[1] = d[2] ** d[3]
432:   o[2] = 2.0*d[3]*d[4] / (d[3] + d[4])
433:   o[3] =  d[2]/(1 + math.e**(-1*d[2]))
434:   individual.changed = True
435: """
436: 
437: The call to 
438: ### Model-general stuff
439: 
440: Using the model-specific stuff, WHERE defines some
441: useful general functions.
442: 
443: """
444: def some(m,x) :
445:   "with variable x of model m, pick one value at random" 
446:   return lo(m,x) + by(hi(m,x) - lo(m,x))
447: 
448: def candidate(m):
449:   "Return an unscored individual."
450:   return Slots(changed = True,
451:             scores=None, 
452:             obj = [None] * len(objectives(m)),
453:             dec = [some(m,d) 
454:                    for d in decisions(m)])
455: 
456: def scores(m,t):
457:   "Score an individual."
458:   if t.changed:
459:     score(m,t)
460:     new, n = 0, 0
461:     for o in objectives(m):
462:       x   = t.obj[o]
463:       n  += 1
464:       tmp = norm(m,o,x)**2
465:     if w(m,o) < 0: 
466:       tmp = 1 - tmp
467:     new += tmp 
468:     t.scores = (new**0.5)*1.0 / (n**0.5)
469:     t.changed = False
470:   return t.scores
471: """
472: 
473: ### Demo stuff
474: 
475: To run these at load time, add _@go_ (uncommented) on the line before.
476: 
477: Checks that we can find lost and distant things:
478: 
479: """
480: @go
481: def _distances():
482:   def closest(m,i,all):
483:     return furthest(m,i,all,10**32,lambda x,y: x < y)
484:   random.seed(1)
485:   m   = "any"
486:   pop = [candidate(m) for _ in range(4)]  
487:   for i in pop:
488:     j = closest(m,i,pop)
489:     k = furthest(m,i,pop)
490:     print "\n",\
491:           gs(i.dec), g(scores(m,i)),"\n",\
492:           gs(j.dec),"closest ", g(dist(m,i,j)),"\n",\
493:           gs(k.dec),"furthest", g(dist(m,i,k))
494:     print i
495: """
496: 
497: A standard call to WHERE, pruning disabled:
498: 
499: """
500: @go
501: def _where():
502:   random.seed(1)
503:   m, max, pop, kept = "model",100, [], Num()
504:   for _ in range(max):
505:     one = candidate(m)
506:     kept + scores(m,one)
507:     pop += [one]
508:   slots = where0(verbose = True,
509:                minSize = max**0.5,
510:                prune   = False,
511:                wriggle = 0.3*kept.s())
512:   n=0
513:   for leaf in where(m, pop, slots):
514:     n += len(leaf)
515:   print n,slots
516: """
517: 
518: Compares WHERE to GAC:
519: 
520: """
521: @go
522: def _whereTiming():
523:   def allPairs(data):
524:     n = 8.0/3*(len(data)**2 - 1) #numevals WHERE vs GAC
525:     for _ in range(int(n+0.5)):
526:       d1 = any(data)
527:       d2 = any(data)
528:       dist("M",d1,d2)
529:   random.seed(1)
530:   for max in [32,64,128,256]:
531:     m, pop, kept = "model",[], Num()
532:     for _ in range(max):
533:       one = candidate(m)
534:       kept + scores(m,one)
535:       pop += [one]
536:     slots = where0(verbose = False,
537:                 minSize = 2, # emulate GAC
538:                 depthMax=1000000,
539:                 prune   = False,
540:                 wriggle = 0.3*kept.s())
541:     t1 =  timing(lambda : where(m, pop, slots),10)
542:     t2 =  timing(lambda : allPairs(pop),10)
543:     print max,t1,t2, int(100*t2/t1)
544: 

This file is part of Timm's charming Python tricks.
© 2014, Tim Menzies: tim.menzies@gmail.com, http://menzies.us.

Timm's charming Python tricks are free software: you can redistribute it and/or modify it under the terms of the GNU Lesser Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Timm's charming Python tricks are distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU Lesser Public License along with Foobar. If not, see http://www.gnu.org/licenses.