/timm's /charming /python /tricks
statsnotes.py

Download statsnotes.py.
Read more on How to be Charming (in Python).
001: """
002: <em>(Note to students: before you get scared about all this, note that I've coded
003: this up for you- see the demos in
004: [statsd.py](http://unbox.org/open/trunk/472/14/spring/var/code/statsd.html).
005: But you should not just use things
006: you do not understand. So read and enjoy.)</em>
007: 
008: Comparing Different Optimizers
009: =============================
010: 
011: 
012: 
013: For the most part, we are concerned with very
014: high-level issues that strike to the heart of the
015: human condition:
016: 
017: - What does it mean to find controlling principles in the world?
018: - How can we find those principles better, faster, cheaper?
019: 
020: But sometimes we have to leave those lofty heights
021: to discuss more pragmatic issues. Specifically, how
022: to present the results of an optimizer and,
023: sometimes, how to compare and rank the results from
024: different optimizers.
025: 
026: Note that there is no best way, and often the way we
027: present results depends on our goals, the data we
028: are procesing, and the audience we are trying to
029: reach.  So the statistical methods discussed below
030: are more like first-pass approximations to something
031: you may have to change extensively, depending on the
032: task at hand.
033: 
034: In any case, in order to have at least
035: one report that that you quickly generate, then....
036: 
037: Theory
038: ------
039: 
040: The test that one optimizer is better than another can be recast
041: as four checks on the _distribution_ of performance scores. 
042: 
043: 1. Visualize the data, somehow.
044: 2. Check if the distributions are _significantly different_;
045: 3. Check if the central tendency of one distribution is _better_ 
046:    than the other; e.g. compare their mean values.
047: 4. Check the different between the central tendencies is not some _small effect_.
048: 
049: The first step is very important. Stats should
050: always be used as sanity checks on intuitions gained
051: by other means. So look at the data before making,
052: possibly bogus, inferences from it. For example,
053: here are some charts showing the effects on a
054: population as we apply more and more of some
055: treatment. Note that the mean of the populations
056: remains unchanged, yet we might still endorse the
057: treatment since it reduces the uncertainty
058: associated with each population.
059: 
060: 
061: <center>
062: <img width=300 
063: src="http://unbox.org/open/trunk/472/14/spring/doc/img/index_customers_clip_image002.jpg">
064: </center>
065: 
066: 
067: One possible bogus inference would be to apply the
068: third test without the second since if the second
069: _significance_ test fails, then the third _better_ test could give misleading results.  
070: For example, returning
071: to the above distributions, note  the large overlap
072: in the top two curves in those plots.  When
073: distributions exhibit a very large overlap, it is
074: very hard to determine if one is really different to
075: the other.  So large variances can mean that even if
076: the means are _better_, we cannot really say that
077: the values in one distribution are usually better
078: than the other.
079: 
080: 
081: 
082: Pragmatics
083: ----------
084: 
085: There are several pragmatic issues associated with these tests.
086: 
087: 
088: ### Experimental Design 
089: 
090: In a very famous quote, [Ernest Rutherford](http://en.wikipedia.org/wiki/Ernest_Rutherford) once said
091: 
092: +  _"If your experiment needs statistics, you ought to have done a better experiment_".
093: 
094: In this view, it is a mistake to perform arcane statistical tests on large sets
095: of results- far better to design your experiments better. 
096: 
097: In this subject,
098: we have seen one such paper that 
099: [explored multiple options in optimization](http://unbox.org/doc/pso/Off-The-Shelf_PSO.pdf)
100: but they did not explore all combinations. Rather, they walked through the space
101: of options comparing a few options at a time. 
102: 
103: While 
104: [that's a good paper](http://unbox.org/doc/pso/Off-The-Shelf_PSO.pdf), it does not explore
105: _interaction effects_ where some options, in combination with others, have unintended
106: side-effects. So all hail [Ernest Rutherford](http://en.wikipedia.org/wiki/Ernest_Rutherford)
107: but sometimes you just gotta look at many options.
108: 
109: So, on with the show.
110: 
111: 
112: ### Effect size
113: 
114: Much recent commentary has
115: remarked that two sets of numbers can be
116: _significantly different_, but the size of that
117: difference is so small as to be very boring. For
118: example, the blue and red lines in the following are
119: significantly different:
120: 
121: <center>
122: <img width=300 
123: src="http://unbox.org/open/trunk/472/14/spring/doc/img/distributed.png">
124: </center>
125: 
126: But when you look at the _size_ of 
127: the difference, it is so small that you have
128: to say its kinda boring. The lesson here is that 
129: even if two distributions are _significantly difference_
130: and even if their means are _better_, then if the _effect size_
131: is very small then we should not report a difference in the population.
132: 
133: In the following, we will use the _a12_ test for effect size.
134: 
135: 
136: ### Handling Distributions of Many Shapes
137: 
138: Data may be _shaped_ in funny ways. Consider the following
139: distributions:
140: 
141: <center>
142: <img width=400 
143: src="http://unbox.org/open/trunk/472/14/spring/doc/img/dists.gif">
144: </center>
145: 
146: Note that  these are all not smooth symmetrical bell-shaped curves
147: that rise to a single maximum value. Why is this important? Well, several
148: of the widely used tests of _statistical significance_ assume such shapes.
149: One method that does not is the _bootstrap sampling_ method discussed below- but
150: that requires hundreds to thousands of resamples of the domain (which can be
151: slow for very large distributions). So, pragmatically, if we want to avoid
152: assuming that the data fits some particular shape then we need to somehow
153: restrict the number of times we apply slower methods like bootstrapping.
154: 
155: 
156: Note that one reason we use the _a12_ test for effect size is that
157: this particular test makes no assumptions about the shape of the data being explored.
158: 
159: Note also that bootstrapping can be very, very slow indeed. A constant plea
160: in the bootstrapping literature is "tell us how to make it run faster".
161: In the following, we will minimize the calls to bootstrapping as follows:
162: 
163: + Only bootstrap if the effect size is not small;
164: 
165: That is, instead of checking for effect size _after_ checking for _significant difference_,
166: we check _before_. This is important since it avoid unnecessary calls (and very slow)
167: calls to bootstrapping.
168: 
169: ### The Confidence Problem
170: 
171: A second pragmatic issue is the _confidence_
172: issue. Suppose we have four optimizers, each of
173: which is controlled by four parameters
174: (e.g. mutation rate etc), and you are testing these
175: on 10 models (Fonseca, ZDT, etc). If you break the
176: parameters into (say) three big chunks (e.g. lo,
177: medium, high), in effect you are comparing results
178: on 10*4*3<sup>4</sup>=3,240 _treatments_; i.e. over
179: 5,000,000 comparisons.  Now it is sensible to
180: explore treatments within each model separately, which means you are
181: "only" comparing pairs of 324 numbers, which is
182: still over 50,000 comparisons.
183: 
184: Why is this a problem? Well, tests for
185: _significantly difference_ report the probability
186: that members of population1 do not overlap
187: population2. In practice, most populations overlap a
188: little, so standard practice is to restrict the
189: overlap test to some small number; e.g. report that
190: two populations are significantly different if 95%
191: of the members are not found in the overlap.
192: 
193: Now can you see the problem? If these tests are 95% confident
194: and we run 324 such comparisons tests, our results
195: now have confidence 0.95<sup>50,000</sup>=0.0000061%
196: confident (i.e. not confident at all.
197: 
198: An alternative to the above is to sort all the
199: results from different treatments on their mean value,
200: then compare the optimizer at position _i_ to position
201: _i+1_. 
202: Note that this procedure requires _N-1_ comparisons
203: for _N_ treatments.
204: So now for 324 treatments, our statistical tests are
205: confident to 0.95<sup>323</sup>=0.000006 percent- which
206: is better than before but still very very very poor.
207: 
208: Yet another alternative is to sort the data then
209: conduct a binary chop on the results, and only run
210: the significance tests on each chop.  For our 323
211: treatments, this will generate results with
212: a confidence of   0.95<sup>log2(323)</sup>=65%.
213: 
214: This can be improved further but only chopping the
215: data at points where the mean value of the
216: treatments in each chop are most different.  This is
217: the _Scott-Knott_ procedure discussed below.  If
218: some population has mean _mu_ and some chop divides
219: the _n_ members of that population into _n1_ and
220: _n2_ groups with means _mu1_ and _mu2_, then the
221: expected value of the difference the mean before and
222: after the chop is:
223:  
224:      delta = n1/n*(mu - mu1)**2 + n2/n*(mu - mu2)**2  
225: 
226: Scott-Knott picks the chops that  maximizes _delta_.
227: In the code shown below, that Scott-Knott also rejects chops
228: that:
229: 
230: - Generate tiny groups in the data- e.g. less than 4 individuals in any chop;
231: - Are different by only some _small effect_ implemented using, say,  the _a12_ test discussed below.
232: 
233: As mentioned above, this second step (running _a12_ before checking for significant differences)
234: is a way to reduce the time required for bootstrapping.
235: 
236: More importantly, the fewer chops we explore, the fewer times we run
237: bootstrap and the fewer times we run into the confidence problem.
238: Just as a "what-if", support our Scott-Knott means we only
239: have to chop the data 20 times (which is actually more that usual).
240: If so, the our confidence in the conclusions would be 0.95<sup>log2(20)</sup> which is
241: 80 percent- which is not bad for a study looking at 324 options.
242: 
243: 
244: Enough theory, on with the code
245: 
246: 
247: ## How to...
248: 
249: ### Visualization
250: 
251: 
252: As said above, stats should
253: always be used as sanity checks on intuitions gained
254: by other means.  
255: 
256: 
257: Suppose we had two optimizers which in a 10 repeated
258: runs generated performance from two models:
259: 
260: """
261:     def _tile2():
262:       def show(lst):
263:          return xtile(lst,lo=0, hi=1,width=25,
264:                       show= lambda s:" %3.2f" % s)
265:       print "one", show([0.21, 0.29, 0.28, 0.32, 0.32, 
266:                          0.28, 0.29, 0.41, 0.42, 0.48])
267:       print "two", show([0.71, 0.92, 0.80, 0.79, 0.78, 
268:                          0.9,  0.71, 0.82, 0.79, 0.98])
269: """
270: 
271: When faced with new data, always chant the following mantra:
272: 
273: + _First_ visualize it to get some intuitions;
274: + _Then_ apply some statistics to double check those intuitions.
275: 
276: That is, it is _strong recommended_ that, prior
277: doing any statistical work, an analyst generates a
278: visualization of the data. Percentile charts a
279: simple way to display very large populations in very
280: little space. For example, here are our results from
281: _one_, displayed on a range from 0.00 to 1.00.
282: 
283:     one         * --|            , 0.28,  0.29,  0.32,  0.41,  0.48
284:     two             |    -- * -- , 0.71,  0.79,  0.80,  0.90,  0.98
285: 
286: In this percentile chart, the 2nd and 3rd
287: percentiles as little dashes left and right of the
288: median value, shown with a _"*"_, (learner _two_'s
289: 3rd percentile is so small that it actually
290: disappears in this display).  The vertical bar _"|"_
291: shows half way between the display's min and max (in
292: this case, that would be (0.0+1.00)/2= 0.50)
293: 
294: From the above, we could write a little report that
295: shows the mean and rank performance of the two
296: learners.
297: 
298:     one  :mu 0.32 :rank 1
299:     two  :mu 0.80 :rank 2
300: 
301: From this report, if the goal was to maximize some
302: factor, then it is clear we would recommend _two_
303: over optimizer__one_.
304: 
305: #### Xtile
306: 
307: The advantage of percentile charts is that we can
308: show a lot of data in very little space. For
309: example, here's 2000 numbers shown as a _quintile_
310: chart on two lines.
311: 
312: + Quintiles divide the data into the 10th, 30th,
313:   50th, 70th, 90th percentile.
314: + Dashes (_"-"_) mark the range (10,30)th and
315:   (70,90)th percentiles;
316: + White space marks the ranges (30,50)th and
317:   (50,70)th percentiles.
318: 
319: Consider two distributions, of 1000 samples each:
320: one shows square root of a _rand()_ and the other
321: shows the square of a _rand()_.
322: """
323: 
324:     def _tile() :
325:       import random
326:       r = random.random
327:       def show(lst):
328:         return xtile(lst,lo=0, hi=1,width=25,
329:                      show= lambda s:" %3.2f" % s)
330:       print "one", show([r()**0.5 for x in range(1000)])
331:       print "two", show([r()**2   for x in range(1000)])
332: 
333: """
334: 
335: In the following quintile  charts, we show these distributions:
336: 
337: + The range is 0 to 1.
338: + One line shows the square of 1000 random numbers;
339: + The other line shows the square root of 1000 random numbers;
340:   
341: Note the brevity of the display:
342:         
343:         one        -----|    *  ---  , 0.32,  0.55,  0.70,  0.84,  0.95
344:     two --    *     |--------    , 0.01,  0.10,  0.27,  0.51,  0.85
345: 
346: As before, the median value, shown with a _"*"_; and
347: the point half-way between min and max (in this
348: case, 0.5) is shown as a vertical bar _"|"_.
349: 
350: For details on how to draw percentile charts, _xtiles_
351: in [do.py](http://unbox.org/open/trunk/472/14/spring/var/code/do.html)
352: 
353: 
354: ### The Scott-Knott Procedure
355: 
356: 
357: Suppose we have four optimizers which have been run
358: four times each, there is more wriggle in their
359: results:
360: 
361:     one   0.34 0.49 0.51  0.8
362:     two   0.6  0.9  0.8   0.9
363:     three 0.7  0.9  0.8   0.6
364:     four  0.2  0.3  0.35  0.4
365: 
366: To rank these, we'd first sort them by their mean
367: and maybe add a little bar chart to the
368: 
369:     four  :mu 0.3125   ******
370:     one   :mu 0.535    ***********
371:     three :mu 0.75     ***************
372:     two   :mu 0.8      ****************
373: 
374: Intuitively, the learners seem to fall into three groups:
375: 
376: + Highest scores: two and three;
377: + Lowest scores: four;
378: + Somewhere in-between: one.
379: 
380: The problem though is that it is a little hard to
381: check if those are the right groups since some of
382: these learners generate numbers very close to each
383: other.
384: 
385: Just as an aside, part of the problem is
386: insufficient experimentation. Four runs barely
387: exercises the optimizers so it does not given any of
388: them a chance to show their true worth. I recommend
389: at 10 to 20 repeats- but then another problem
390: arises; i.e. too many numbers to read and
391: understand.
392: 
393: If we explore the above using a _Scott-Knott_
394: procedure, then we would:
395: 
396: + Sort the learners by their mean (as above)
397: + Recursively _cut_ the list in two, stopping when 
398:   one half of the cut was similar
399:   to the other half. 
400: + <em>Scott AJ and Knott M (1974) Cluster analysis method for
401:    grouping means in the analysis of variance. Biometrics 30:
402:    507-512.</em>
403:   
404: There any many ways to find the _cut_. Following the
405: recommendations of Mittas and Angelis, we proceed as
406: follows.
407: 
408: +  <em>Nikolaos Mittas, Lefteris Angelis: Ranking and Clustering Software Cost Estimation 
409:    Models through a Multiple Comparisons Algorithm. IEEE Trans. Software Eng. 39(4): 537-551 (2013)</em>
410: 
411: Mittas and Angelsis find the cut by collecting:
412: 
413: + The mean `mu` of all the data below the cut;
414: + The mean `mu0,mu1` of all the data below,above the cut.
415: + Then return the cut that   most divides the data.
416: 
417: More specifically, according to Mittas and
418:  Angelesis, the best cut is the one that _maximizes
419:  the difference in the mean_ before and after the
420:  cut.
421: """
422: 
423:     def minMu(parts,all,big,same):
424:       cut,left,right = None,None,None
425:       before, mu     =  0, all.mu
426:       for i,l,r in leftRight(parts):
427:         if big(l.n) and big(r.n): 
428:           if not same(l,r):
429:             n = all.n * 1.0
430:             x = l.n/n*(mu - l.mu)**2 + \
431:                             r.n/n*(mu - r.mu)**2  
432:             if x > before:
433:               before,cut,left,right = x,i,l,r
434:       return cut,left,right
435: """
436: 
437: In the above _l.n_ and _r.n_ are the number of
438: measurements left and right of the cut (and _n_ =
439: _l.n + r.n_). So the above function gives most
440: weight to larger cuts that most change the mean of
441: the most number of measurements.
442: 
443: The _minMu_ function is a low-level procedure. A higher-level
444: tool calls _minMu_ to find one cut, then recurses on each cut.
445: 
446: """
447: 
448:     def rdiv(data,  # a list of class Nums
449:              all,   # all the data combined into one num
450:              div,   # function: find the best split
451:              big,   # function: rejects small splits
452:              same): # function: rejects similar splits
453:       def recurse(parts,all,rank=0):
454:         cut,left,right = div(parts,all,big,same)
455:         if cut: 
456:           # if cut, rank "right" higher than "left"
457:           rank = recurse(parts[:cut],left,rank) + 1
458:           rank = recurse(parts[cut:],right,rank)
459:         else: 
460:           # if no cut, then all get same rank
461:           for part in parts: 
462:             part.rank = rank
463:         return rank
464:       recurse(sorted(data),all)
465:       return data
466: """
467: Finally, the above is called by _scottknott_ that
468: recursively splits data, maximizing delta of the
469: expected value of the mean before and after the
470: splits (and rejects splits with under 3 items).
471: 
472: """
473:     def scottknott(data,small=3,b=250, conf=0.05):
474:       def theSame(one, two):
475:         if a12small(two, one): return True
476:         return  not bootstrap(one, two, b=b, conf=conf)
477:       all  = reduce(lambda x,y:x+y,data)
478:       same = lambda l,r: theSame(l.saw(), r.saw())
479:       big  = lambda    n: n > small    
480:       return rdiv(data,all,minMu,big,same)
481: 
482: """
483: 
484: (The _a12small_ and _bootstrap_ functions will be
485: explained below).
486: 
487: 
488: When this is run on the following data from
489: "_x1,x2,x3,x4,x5_" we see sensible groupings:
490: 
491: """
492:     def rdivDemo(data):
493:       data = map(lambda lst:Num(lst[0],lst[1:],keep=512),
494:                  data)
495:       for x in sorted(scottknott(data),key=lambda y:y.rank):
496:         print x.rank, x.name, gs([x.mu, x.s])
497:     
498:     def rdiv2():
499:       rdivDemo([
500:             ["x1",0.34, 0.49, 0.51, 0.6],
501:             ["x2",0.6,  0.7,  0.8,  0.9],
502:             ["x3",0.15, 0.25, 0.4,  0.35],
503:             ["x4",0.6,  0.7,  0.8,  0.9],
504:             ["x5",0.1,  0.2,  0.3,  0.4]
505:             ])
506: """
507: 
508: Note that `rdiv` performs a binary chop on the list
509: of optimizers. This is important since it means we
510: can rank _N_ learners using _log2(N)_
511: comparisons. This is very important, for reasons
512: we'll see shortly.
513: 
514: For full details on the Scott-Knott procedure, see
515: [stats.py](http://unbox.org/open/trunk/472/14/spring/var/code/stats.html).
516: 
517: 
518: ### The A12 test
519: 
520: 
521: I prefer a test for _small effect_ that has does not
522: _sweat the small stuff_; i.e. ignore small
523: differences between items in the samples. My
524: preferred test for _small effect_ has:
525: 
526: + a simple intuition;
527: + which makes no assumptions about (say) Gaussian
528:   assumptions;
529: + and which has a solid lineage in the literature.  
530: 
531: Such a test is Vargha and Delaney's _A12 statistic_-
532: The stastic was proposed in Vargha and Delaney's
533: 2000 paper was endorsed in many places including in
534: Acruci and Briad's ICSE 2011 paper.
535: 
536: + <em>A. Vargha and H. D. Delaney. A critique and
537:   improvement of the CL common language effect size
538:   statistics of McGraw and Wong. Journal of
539:   Educational and Behavioral Statistics,
540:   25(2):101-132, 2000
541: + Andrea Arcuri, Lionel C. Briand: A practical guide
542:   for using statistical tests to assess randomized
543:   algorithms in software engineering. ICSE 2011:
544:   1-10</em>
545: 
546: After I describe it to you, you will wonder why
547: anyone would ever want to use anything else.  Given
548: a performance measure seen in _m_ measures of _X_
549: and _n_ measures of _Y_, the A12 statistics measures
550: the probability that running algorithm _X_ yields
551: higher values than running another algorithm _Y_.
552: Specifically, it counts how often we seen larger
553: numbers in _X_ than _Y_ (and if the same numbers
554: are found in both, we add a half mark):
555: 
556:      a12= #(X.i > Y.j) / (n*m) + .5#(X.i == Y.j) / (n*m)
557: 
558: According to Vargha and Delaney, a small, medium, large difference
559: between two populations is:
560: 
561: + _large_ if `a12` is over 71%;
562: + _medium_ if `a12` is over 64%;
563: + _small_ if `a12` is  56%, or less.
564: 
565: The code is very simple- just remember to sort _lst1_ and
566: _lst2_ before doing the comparisons:
567: 
568: """
569:     def a12small(lst1, lst2): return a12(lst1,lst2) <= 0.56
570:         
571:     def a12(lst1,lst2, gt= lambda x,y: x > y):
572:       "how often is x in lst1 more than y in lst2?"
573:       def loop(t,t1,t2): 
574:         while t1.i < t1.n and t2.i < t2.n:
575:           h1 = t1.l[t1.i]
576:           h2 = t2.l[t2.i]
577:           if gt(h1,h2):
578:             t1.i  += 1; t1.gt += t2.n - t2.i
579:           elif h1 == h2:
580:             t2.i  += 1; t1.eq += 1; t2.eq += 1
581:           else:
582:             t2,t1  = t1,t2
583:         return t.gt*1.0, t.eq*1.0
584:       #--------------------------
585:       lst1 = sorted(lst1, cmp=gt)
586:       lst2 = sorted(lst2, cmp=gt)
587:       n1   = len(lst1)
588:       n2   = len(lst2)
589:       t1   = Thing(l=lst1,i=0,eq=0,gt=0,n=n1)
590:       t2   = Thing(l=lst2,i=0,eq=0,gt=0,n=n2)
591:       gt,eq= loop(t1, t1, t2)
592:       return gt/(n1*n2) + eq/2/(n1*n2)
593:  
594: """
595: ### Bootstrap Tests
596: 
597: (_For more details on this section, see  p220 to 223 of Efron's book "An introduction to the bootstrap"._)
598: 
599: 
600: Formally, _a12_ is actually a _post hoc effect size
601: test_, which is applied _after_ some hypothesis test
602: for statistically significant difference between two
603: populations.
604: 
605: Such statistical tests check if there is enough of a
606: difference between two lists of numbers to falsify
607: some hypothesis (e.g. the values in list1 are less than
608: those in list2). Usual practice is to make some
609: parametric  assumption (e.g. that the numbers come
610: from a Gaussian distribution). 
611: 
612: Another method is called _bootstrapping_ that makes
613: no parametric assumption. The way it works is to
614: define some _testStatistic_ and apply it to:
615: 
616: 1. The original two lists; 
617: 2. A "bootstrap" sample;
618:    i.e. two artificially created lists created by
619:    sampling with replacement from the original lists.
620:   
621: A bootstrap test runs the _testStatistic_:
622: 
623: + Once on the original pair of lists...
624: + Then (say) 1000 times on 1000 bootstrap samples.
625: 
626: Then we return how often in the bootstrap samples,
627: the _testStatistic_ returns the same value as with
628: the original list (see the last line of the following code):
629: 
630: """
631:     def bootstrap(y0,z0,conf=0.05,b=1000):
632:       class total():
633:         def __init__(i,some=[]):
634:           i.sum = i.n = i.mu = 0 ; i.all=[]
635:           for one in some: i.put(one)
636:         def put(i,x):
637:           i.all.append(x);
638:           i.sum +=x; i.n += 1; i.mu = float(i.sum)/i.n
639:         def __add__(i1,i2): return total(i1.all + i2.all)
640:       def testStatistic(y,z): 
641:         tmp1 = tmp2 = 0
642:         for y1 in y.all: tmp1 += (y1 - y.mu)**2 
643:         for z1 in z.all: tmp2 += (z1 - z.mu)**2
644:         s1    = float(tmp1)/(y.n - 1)
645:         s2    = float(tmp2)/(z.n - 1)
646:         delta = z.mu - y.mu
647:         if s1+s2:
648:           delta =  delta/((s1/y.n + s2/z.n)**0.5)
649:         return delta
650:       def one(lst): return lst[ int(any(len(lst))) ]
651:       def any(n)  : return random.uniform(0,n)
652:       y, z   = total(y0), total(z0)
653:       x      = y + z
654:       tobs   = testStatistic(y,z)
655:       yhat   = [y1 - y.mu + x.mu for y1 in y.all]
656:       zhat   = [z1 - z.mu + x.mu for z1 in z.all]
657:       bigger = 0.0
658:       for i in range(b):
659:         if testStatistic(total([one(yhat) for _ in yhat]),
660:                          total([one(zhat) for _ in zhat])) > tobs:
661:           bigger += 1
662:       return bigger / b < conf
663: """
664: 
665: In the above, the bootstrap sample is generated with the call the _one_.
666: This data is collected and summarized with the _total_ class.
667: Also, the _testStatsitic_ function
668:    checks if two means are different, tempered
669:          by the size of the two lists and the variance of the data.
670:    
671: One issue with the above is the number of
672: bootstrap. There are some statistical results that
673: say 200 to 300 bootstrap samples are enough. Note
674: that the more bootstrap samples, the slower the
675: statistical test (which is why folks often use the
676: faster parametric methods- even if the parametric 
677: methods make the wrong assumptions about the data).
678: 
679: For this reason,  this code  runs _bootstrap_
680: within a Scott-Knott that calls it only on "interesting" chops
681: (i.e. those that are not too small, that divide the data on more than a small
682: effect, and that maximize the expected value of the delta of the mean).
683: 
684: ## Demos
685: 
686: All the above is coded up.
687: See [statsd.py](http://unbox.org/open/trunk/472/14/spring/var/code/statsd.html).
688: Share and enjoy.
689: 
690: """
691:
This file is part of Timm's charming Python tricks.
© 2014, Tim Menzies: tim.menzies@gmail.com, http://menzies.us.
Timm's charming Python tricks are free software: you can redistribute it and/or modify it under the terms of the GNU Lesser Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Timm's charming Python tricks are distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU Lesser Public License along with Foobar. If not, see http://www.gnu.org/licenses.