/timm's /charming /python /tricks
Download
where.py.
Read more on How to be Charming (in Python).
001: """
002:
003: (Can be view as
004: [html](http://menzies.us/cs472/?niching) or [raw
005: Python](http://unbox.org/open/trunk/472/14/spring/src/where.py).)
006:
007: How to Find Niche Solutions
008: ===========================
009:
010: ### Why Niche?
011:
012: When exploring complex multi-objective surfaces, it
013: is trite to summarize that space with one number
014: such _max distance from hell_ (where _hell_ is the
015: point in objective space where all goals have their
016: worse values). For example, all the following points
017: have the same _distance from hell_, but they reflect
018: very different solutions:
019: <center>
020: <img src="http://unbox.org/open/trunk/472/14/spring/doc/img/gunsbutter.png">
021: </center>
022:
023: A better way to do it is _niching_; i.e. generate
024: the Pareto frontier, then cluster that frontier and
025: report a (small) number of random samples from each
026: cluster. In this approach, "distance from hell" can
027: still be used internally to guide a quick search for
028: places to expand the frontier. But after that, we
029: can isolate interesting different parts of the
030: solution space.
031:
032: For example, when exploring options for configuring
033: London Ambulances, Veerappa and Letier use optimization to
034: reject thousands of options (shown in red) in order
035: to isolate clusters of better solutions C1,C2,etc
036: (shown in green). In this example, the goal is to
037: minimize X and maximize Y .
038: <center>
039: <img src="http://unbox.org/open/trunk/472/14/spring/doc/img/amus.png" width=400>
040: </center>
041:
042: (REF: V. Veerappa and E. Letier, "Understanding
043: clusters of optimal solutions in multi-objective
044: decision problems, in RE' 2011, Trento, Italy,
045: 2011, pp. 89-98.)
046:
047: ### What is a Niche?
048:
049: According to Deb and Goldberg (1989) _a niche is
050: viewed as an organism's task in the environment and
051: a species is a collection of organisms with similar
052: features_.
053:
054: + REF: Kalyanmoy Deb and David
055: E. Goldberg. 1989. [An Investigation of Niche and
056: Species Formation in Genetic Function
057: Optimization](http://goo.gl/oLIxo1). In
058: Proceedings of the 3rd International Conference on
059: Genetic Algorithms, J. David Schaffer (Ed.). Morgan
060: Kaufmann Publishers Inc., San Francisco, CA, USA,
061: 42-50.
062:
063: Their definition seems to suggest that niching means
064: clustering in objective space and species are the
065: things that form in each such cluster. Strange to
066: say, some of their examples report _niches_ using
067: decision space so all we can say is that it is an
068: engineering decision whether or not you _niche_ in
069: objective space or _niche_ in decision space. Note
070: that the above example from Veerappa and Letier
071: build niches in objective space while my code, shown
072: below, builds niches in decision space.
073:
074: ### How to Niche?
075:
076: In any case, in decision or objective space, the
077: open issue is now to fun the niches? A naive
078: approach is to compute distances between all
079: individuals. This can be very slow, especially if
080: this calculation is called repeatedly deep inside
081: the inner-most loop of a program. In practice, any
082: program working with distance spends most of its
083: time computing those measures. Various proposals
084: exist to prevent that happening:
085:
086: + _Canopy clustering_: McCallum, A.; Nigam, K.; and
087: Ungar L.H. (2000) [ Efficient Clustering of High
088: Dimensional Data Sets with Application to
089: Reference Matching](http://goo.gl/xwIzN),
090: Proceedings of the sixth ACM SIGKDD international
091: conference on Knowledge discovery and data
092: mining, 169-178
093: + _Incremental stochastic k-means_ [Web-Scale
094: K-Means Clustering](http://goo.gl/V8BQs),
095: WWW 2010, April 26-30, 2010, Raleigh, North
096: Carolina, USA.
097: + _Triangle inequality tricks_ (which work very well
098: indeed) [Making k-means Even Faster](http://goo.gl/hk3Emn).
099: G Hamerly - 2010 SIAM International Conference on
100: Data Mining.
101:
102: My own favorite trick is WHERE, shown below. It uses
103: a data mining trick to recursively divide the space
104: of decisions in two, then four, then eight,
105: etc. REF: [Local vs. global lessons for defect
106: prediction and effort
107: estimation](http://menzies.us/pdf/12gense.pdf) T
108: Menzies, A Butcher, D Cok, A Marcus, L Layman, F
109: Shull, B Turhan, IEEE Transactions on Software
110: Engineering 29 (6), 822-834, 2012.
111:
112: WHERE uses a linear-time trick (called the FastMap
113: heuristic) to find two distant solutions- these are
114: the _poles_ called _west_ and _east_. Note that the
115: following takes only _O(2N)_ distance calculations:
116:
117: """
118: def fastmap(m,data):
119: "Divide data into two using distance to two distant items."
120: one = any(data) # 1) pick anything
121: west = furthest(m,one,data) # 2) west is as far as you can go from anything
122: east = furthest(m,west,data) # 3) east is as far as you can go from west
123: c = dist(m,west,east)
124: # now find everyone's distance
125: xsum, lst = 0.0,[]
126: for one in data:
127: a = dist(m,one,west)
128: b = dist(m,one,east)
129: x = (a*a + c*c - b*b)/(2*c) # cosine rule
130: xsum += x
131: lst += [(x,one)]
132: # now cut data according to the mean distance
133: cut, wests, easts = xsum/len(data), [], []
134: for x,one in lst:
135: where = wests if x < cut else easts
136: where += [one]
137: return wests,west, easts,east
138: """
139:
140: In the above:
141:
142: + _m_ is some model that generates candidate
143: solutions that we wish to niche.
144: + _(west,east)_ are not _the_ most distant points
145: (that would require _N*N) distance
146: calculations). But they are at least very distant
147: to each other.
148:
149: This code needs some helper functions. _Dist_ uses
150: the standard Euclidean measure. Note that you tune
151: what it uses to define the niches (decisions or
152: objectives) using the _what_ parameter:
153:
154: """
155: def dist(m,i,j,
156: what = lambda x: x.dec):
157: "Euclidean distance 0 <= d <= 1 between decisions"
158: d1,d2 = what(i), what(j)
159: n = len(d1)
160: deltas = 0
161: for d in range(n):
162: n1 = norm(m, d, d1[d])
163: n2 = norm(m, d, d2[d])
164: inc = (n1-n2)**2
165: deltas += inc
166: return deltas**0.5 / n**0.5
167: """
168:
169: The _Dist_ function normalizes all the raw values zero to one.
170:
171: """
172: def norm(m,x,n) :
173: "Normalizes value n on variable x within model m 0..1"
174: return (n - lo(m,x)) / (hi(m,x) - lo(m,x) + 0.0001)
175: """
176:
177: Now we can define _furthest_:
178:
179: """
180: def furthest(m,i,all,
181: init = 0,
182: better = lambda x,y: x>y):
183: "find which of all is furthest from 'i'"
184: out,d= i,init
185: for j in all:
186: if not i == j:
187: tmp = dist(m,i,j)
188: if better(tmp,d): out,d = j,tmp
189: return out
190: """
191:
192: WHERE finds everyone's else's distance from the poles
193: and divide the data on the mean point of those
194: distances. This all stops if:
195:
196: + Any division has _tooFew_ solutions (say,
197: less than _sqrt_ of the total number of
198: solutions).
199: + Something has gone horribly wrong and you are
200: recursing _tooDeep_
201:
202: This code is controlled by a set of _slots_. For
203: example, if _slots.pruning_ is true, we may ignore
204: some sub-tree (this process is discussed, later on).
205: Also, if _slots.verbose_ is true, the _show_
206: function prints out a little tree showing the
207: progress (and to print indents in that tree, we use
208: the string _slots.b4_). For example, here's WHERE
209: dividing 100 solutions:
210:
211: 100
212: |.. 50
213: |.. |.. 25
214: |.. |.. |.. 11
215: |.. |.. |.. |.. 6.
216: |.. |.. |.. |.. 5.
217: |.. |.. |.. 14
218: |.. |.. |.. |.. 6.
219: |.. |.. |.. |.. 8.
220: |.. |.. 25
221: |.. |.. |.. 12
222: |.. |.. |.. |.. 5.
223: |.. |.. |.. |.. 7.
224: |.. |.. |.. 13
225: |.. |.. |.. |.. 5.
226: |.. |.. |.. |.. 8.
227: |.. 50
228: |.. |.. 25
229: |.. |.. |.. 13
230: |.. |.. |.. |.. 7.
231: |.. |.. |.. |.. 6.
232: |.. |.. |.. 12
233: |.. |.. |.. |.. 5.
234: |.. |.. |.. |.. 7.
235: |.. |.. 25
236: |.. |.. |.. 11
237: |.. |.. |.. |.. 5.
238: |.. |.. |.. |.. 6.
239: |.. |.. |.. 14
240: |.. |.. |.. |.. 7.
241: |.. |.. |.. |.. 7.
242:
243: Here's the slots:
244:
245: """
246: class Slots():
247: "Place to read/write named slots."
248: id = -1
249: def __init__(i,**d) :
250: i.id = Slots.id = Slots.id + 1
251: i.override(d)
252: def override(i,d): i.__dict__.update(d); return i
253: def __eq__(i,j) : return i.id == j.id
254: def __ne__(i,j) : return i.id != j.id
255: def __repr__(i) : return '{' + showd(i.__dict__) + '}'
256:
257: def where0(**other):
258: return Slots(minSize = 10, # min leaf size
259: depthMin= 2, # no pruning till this depth
260: depthMax= 10, # max tree depth
261: wriggle = 0.2, # min difference of 'better'
262: prune = True, # pruning enabled?
263: b4 = '|.. ', # indent string
264: verbose = False, # show trace info?
265: hedges = 0.38 # strict=0.38,relax=0.17
266: ).override(other)
267: """
268:
269: WHERE returns clusters, where each cluster contains
270: multiple solutions.
271:
272: """
273: def where(m,data,slots=where0()):
274: out = []
275: where1(m,data,slots,0,out)
276: return out
277:
278: def where1(m, data, slots, lvl, out):
279: def tooDeep(): return lvl > slots.depthMax
280: def tooFew() : return len(data) < slots.minSize
281: def show(suffix):
282: if slots.verbose:
283: print slots.b4*lvl + str(len(data)) + suffix
284: if tooDeep() or tooFew():
285: show(".")
286: out += [data]
287: else:
288: show("")
289: wests,west, easts,east = fastmap(m,data)
290: goLeft, goRight = maybePrune(m,slots,lvl,west,east)
291: if goLeft:
292: where1(m, wests, slots, lvl+1, out)
293: if goRight:
294: where1(m, easts, slots, lvl+1, out)
295: """
296:
297: Is this useful? Well, in the following experiment, I
298: clustered 32, 64, 128, 256 individuals using WHERE or
299: a dumb greedy approach called GAC that (a) finds
300: everyone's closest neighbor; (b) combines each such
301: pair into a super-node; (c) then repeats
302: (recursively) for the super-nodes.
303:
304: <center>
305: <img src="http://unbox.org/open/trunk/472/14/spring/doc/img/gacWHEREruntimes.png">
306: </center>
307:
308:
309:
310: WHERE is _much_ faster than GAC since it builds
311: a tree of cluster of height log(N) by, at each
312: step, making only O(2N) calls to FastMap.
313:
314: ### Experimental Extensions
315:
316: Lately I've been experimenting with a system that
317: prunes as it divides the data. GALE checks for
318: domination between the poles and ignores data in
319: halves with a dominated pole. This means that for
320: _N_ solutions we only ever have to evaluate
321: _2*log(N)_ of them- which is useful if each
322: evaluation takes a long time.
323:
324: The niches found in this way
325: contain non-dominated poles; i.e. they are
326: approximations to the Pareto frontier.
327: Preliminary results show that this is a useful
328: approach but you should treat those results with a
329: grain of salt.
330:
331: In any case, this code supports that pruning as an
332: optional extra (and is enabled using the
333: _slots.pruning_ flag). In summary, this code says if
334: the scores for the poles are more different that
335: _slots.wriggle_ and one pole has a better score than
336: the other, then ignore the other pole.
337:
338: """
339: def maybePrune(m,slots,lvl,west,east):
340: "Usually, go left then right, unless dominated."
341: goLeft, goRight = True,True # default
342: if slots.prune and lvl >= slots.depthMin:
343: sw = scores(m, west)
344: se = scores(m, east)
345: if abs(sw - se) > slots.wriggle: # big enough to consider
346: if se > sw: goLeft = False # no left
347: if sw > se: goRight = False # no right
348: return goLeft, goRight
349: """
350:
351: Note that I do not allow pruning until we have
352: descended at least _slots.depthMin_ into the tree.
353:
354:
355: Support Code
356: ------------
357:
358: ### Dull, low-level stuff
359:
360: """
361: import sys,math,random
362: sys.dont_write_bytecode = True
363:
364: def go(f):
365: "A decorator that runs code at load time."
366: print "\n# ---|", f.__name__,"|-----------------"
367: if f.__doc__: print "#", f.__doc__
368: f()
369:
370: # random stuff
371: by = lambda x: random.uniform(0,x)
372: seed = random.seed
373: any = random.choice
374:
375: # pretty-prints for list
376: def gs(lst) : return [g(x) for x in lst]
377: def g(x) : return float('%g' % x)
378: """
379:
380: ### More interesting, low-level stuff
381:
382: """
383: def timing(f,repeats=10):
384: "How long does 'f' take to run?"
385: import time
386: time1 = time.clock()
387: for _ in range(repeats):
388: f()
389: return (time.clock() - time1)*1.0/repeats
390:
391: def showd(d):
392: "Pretty print a dictionary."
393: def one(k,v):
394: if isinstance(v,list):
395: v = gs(v)
396: if isinstance(v,float):
397: return ":%s %g" % (k,v)
398: return ":%s %s" % (k,v)
399: return ' '.join([one(k,v) for k,v in
400: sorted(d.items())
401: if not "_" in k])
402:
403: class Num:
404: "An Accumulator for numbers"
405: def __init__(i): i.n = i.m2 = i.mu = 0.0
406: def s(i) : return (i.m2/(i.n - 1))**0.5
407: def __add__(i,x):
408: i.n += 1
409: delta = x - i.mu
410: i.mu += delta*1.0/i.n
411: i.m2 += delta*(x - i.mu)
412: """
413:
414: ### Model-specific Stuff
415:
416: WHERE talks to models via the the following model-specific functions.
417: Here, we must invent some made-up model that builds
418: individuals with 4 decisions and 3 objectives.
419: In practice, you would **start** here to build hooks from WHERE into your model
420: (which is the **m** passed in to these functions).
421:
422: """
423: def decisions(m) : return [0,1,2,3,4]
424: def objectives(m): return [0,1,2,3]
425: def lo(m,x) : return 0.0
426: def hi(m,x) : return 1.0
427: def w(m,o) : return 1 # min,max is -1,1
428: def score(m, individual):
429: d, o = individual.dec, individual.obj
430: o[0] = d[0] * d[1]
431: o[1] = d[2] ** d[3]
432: o[2] = 2.0*d[3]*d[4] / (d[3] + d[4])
433: o[3] = d[2]/(1 + math.e**(-1*d[2]))
434: individual.changed = True
435: """
436:
437: The call to
438: ### Model-general stuff
439:
440: Using the model-specific stuff, WHERE defines some
441: useful general functions.
442:
443: """
444: def some(m,x) :
445: "with variable x of model m, pick one value at random"
446: return lo(m,x) + by(hi(m,x) - lo(m,x))
447:
448: def candidate(m):
449: "Return an unscored individual."
450: return Slots(changed = True,
451: scores=None,
452: obj = [None] * len(objectives(m)),
453: dec = [some(m,d)
454: for d in decisions(m)])
455:
456: def scores(m,t):
457: "Score an individual."
458: if t.changed:
459: score(m,t)
460: new, n = 0, 0
461: for o in objectives(m):
462: x = t.obj[o]
463: n += 1
464: tmp = norm(m,o,x)**2
465: if w(m,o) < 0:
466: tmp = 1 - tmp
467: new += tmp
468: t.scores = (new**0.5)*1.0 / (n**0.5)
469: t.changed = False
470: return t.scores
471: """
472:
473: ### Demo stuff
474:
475: To run these at load time, add _@go_ (uncommented) on the line before.
476:
477: Checks that we can find lost and distant things:
478:
479: """
480: @go
481: def _distances():
482: def closest(m,i,all):
483: return furthest(m,i,all,10**32,lambda x,y: x < y)
484: random.seed(1)
485: m = "any"
486: pop = [candidate(m) for _ in range(4)]
487: for i in pop:
488: j = closest(m,i,pop)
489: k = furthest(m,i,pop)
490: print "\n",\
491: gs(i.dec), g(scores(m,i)),"\n",\
492: gs(j.dec),"closest ", g(dist(m,i,j)),"\n",\
493: gs(k.dec),"furthest", g(dist(m,i,k))
494: print i
495: """
496:
497: A standard call to WHERE, pruning disabled:
498:
499: """
500: @go
501: def _where():
502: random.seed(1)
503: m, max, pop, kept = "model",100, [], Num()
504: for _ in range(max):
505: one = candidate(m)
506: kept + scores(m,one)
507: pop += [one]
508: slots = where0(verbose = True,
509: minSize = max**0.5,
510: prune = False,
511: wriggle = 0.3*kept.s())
512: n=0
513: for leaf in where(m, pop, slots):
514: n += len(leaf)
515: print n,slots
516: """
517:
518: Compares WHERE to GAC:
519:
520: """
521: @go
522: def _whereTiming():
523: def allPairs(data):
524: n = 8.0/3*(len(data)**2 - 1) #numevals WHERE vs GAC
525: for _ in range(int(n+0.5)):
526: d1 = any(data)
527: d2 = any(data)
528: dist("M",d1,d2)
529: random.seed(1)
530: for max in [32,64,128,256]:
531: m, pop, kept = "model",[], Num()
532: for _ in range(max):
533: one = candidate(m)
534: kept + scores(m,one)
535: pop += [one]
536: slots = where0(verbose = False,
537: minSize = 2, # emulate GAC
538: depthMax=1000000,
539: prune = False,
540: wriggle = 0.3*kept.s())
541: t1 = timing(lambda : where(m, pop, slots),10)
542: t2 = timing(lambda : allPairs(pop),10)
543: print max,t1,t2, int(100*t2/t1)
544:
This file is part of Timm's charming Python tricks.
© 2014, Tim Menzies:
tim.menzies@gmail.com,
http://menzies.us.
Timm's charming Python tricks are free software: you can redistribute it and/or modify it under the terms of the GNU Lesser Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Timm's charming Python tricks are distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU Lesser Public License along with Foobar. If not, see http://www.gnu.org/licenses.