leake explanation is an inference procedure different for each information smaller than data must apply some bias to reduce it bias makes us blind, bias lets us see (the future) timm: to convert data to info, need two classes of tools model-based data-based Model based BSC perspective perspectives, each with objectives (highest-level concepts) measures (some subset of which are the observables) targets (measure OP threshold) initiative (what are we doing about this) tufte informative diagrams space shuttle example skew the example with correlated columns data based dimensionality reduction e.g. remove correlated columns non-informative attributes TD-IDT Reject all but the top k most interesting words Interesting if frequent OR rare F[i,j]= frequency of item i in things j Interesting = F[i,j] * log((Number of things)/(number of things with item i)) entropy irrelevant rows e.g. add synthetic attributes values for missing attributes e.g. combine decrease granularity on numerics e.g. replace nums with quartiles. validation best: on new data worst: on old data (can't predict error on new data) middle: cross-val jack-knifing: leave one out bootstrap: 66% random samples - note, trick: boosting- focus on prior mis-classified examples 10-way: usual 3-way: for small data sets distinction soft between model-based and data-based background models inform data processing