LectureTrees  
Learning Trees

Given this input:

@relation weather

@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no

WEKA learns this decision tree:

outlook = sunny
|   humidity <= 75: yes 
|   humidity > 75: no 
outlook = overcast: yes 
outlook = rainy
|   windy = TRUE: no 
|   windy = FALSE: yes 

Q: how?

A: "iterative dichotomization". This is a general algorithm defined by the operators measure, split, recurse, stop, condense, prune.

  • Split things up
  • Recurse on each split

How to learn a tree

  • Given a bag of mixed-up stuff.
    • Need a measure of "mixed-up"
  • Split: Find something that divides up the bag in two new sub-bags
    • And each sub-bag is less mixed-up;
    • Each split is the root of a sub-tree.
  • Recurse: repeat for each sub-bag
    • i.e. on just the data that falls into each part of the split
      • Need a Stop rule
      • Condense the instances that fall into each sub-bag
  • Prune back the generated tree.

Different tree learners result from different selections of

  • condense: report majority class

Example: C4.5 (a.k.a. J48)

Q: which attribute is the best to split on?

A: the one which will result in the smallest tree:

Heuristic: choose the attribute that produces the "purest" nodes (purity = not-mixed-up)

e.g. Outlook= sunny

  • info((2,3))= entropy(2/5,3/5) = -2/5 * log(2/5) - 3/5 * log(3/5) = 0.971 bits

Outlook = overcast

  • info((4,0)) = entropy(1,0) = -1 * log(1) - 0 * log(0) = 0 bits

Outlook = rainy

  • info((3,2)) = entropy(3/5, 2/5) = -3/5 * log(3/5) - 2/5 * log(2/5) = 0.971 bits

Expected info for Outlook = Weighted sum of the above

  • info((3,2),(4,0),(3,2)) = 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971 = 0.693

Computing the information gain

  • e.g. information before splitting minus information after splitting
  • e.g. gain for attributes from weather data:
  • gain("Outlook") = info(9,5) - info(2,3,4,0,3,2) = 0.940 - 0.963 = 0.247 bits
  • gain("Temperature") = 0.247 bits
  • gain("Humidity") = 0.152 bits
  • gain("Windy") = 0.048 bits

Problem: Numeric Variables

No problem:

Other tree learning

Regression trees

  • Differences to decision trees:
  • Splitting criterion: minimizing intra-subset variation
  • Pruning criterion: based on numeric error measure
  • Leaf node predicts average class values of training instances reaching that node
  • Can approximate piecewise constant functions
  • Easy to interpret:
  •           curb-weight <= 2660 :
              |   curb-weight <= 2290 :
              |   |   curb-weight <= 2090 :
              |   |   |   length <= 161 : price=6220
              |   |   |   length >  161 : price=7150
              |   |   curb-weight >  2090 : price=8010
              |   curb-weight >  2290 :
              |   |   length <= 176 : price=9680
              |   |   length >  176 :
              |   |   |   normalized-losses <= 157 : price=10200
              |   |   |   normalized-losses >  157 : price=15800
              curb-weight >  2660 :
              |   width <= 68.9 : price=16100
              |   width >  68.9 : price=25500
    
  • More sophisticated version: model trees

Model trees

  • Regression trees with linear regression functions at each node
  •           curb-weight <= 2660 :
              |   curb-weight <= 2290 : LM1
              |   curb-weight >  2290 :
              |   |   length <= 176 : LM2
              |   |   length >  176 : LM3
              curb-weight >  2660 :
              |   width <= 68.9 : LM4
              |   width >  68.9 : LM5
              .
              LM1:  price = -5280 + 6.68 * normalized-losses
                                  + 4.44 * curb-weight
                                  + 22.1 * horsepower - 85.8 * city-mpg
                                  + 98.6 * highway-mpg
              LM2:  price = 9680
              LM3:  price = -1100 + 91 * normalized-losses
              LM4:  price = 9940 + 47.5 * horsepower
              LM5:  price = -19000 + 13.2 * curb-weight
    
  • Linear regression applied to instances that reach a node after full regression tree has been built
  • Only a subset of the attributes is used for LR
  • Attributes occurring in subtree (+maybe attributes occurring in path to the root)
  • Fast: overhead for LR not large because usually only a small subset of attributes is used in tree

Building the tree

  • Splitting criterion: standard deviation reduction into i bins
  • SDR = sd(T) - sum( ( |Ti| / |T| * sd(Ti) ) )
    • where (|T| = number of instances in that tree).
  • Termination criteria (important when building trees for numeric prediction):
  • Standard deviation becomes smaller than certain fraction of sd for full training set (e.g. 5%)
  • Too few instances remain (e.g. less than four)

Smoothing (Model Trees)

  • Naive method for prediction outputs value of LR for corresponding leaf node
  • Performance can be improved by smoothing predictions using internal LR models
  • Predicted value is weighted average of LR models along path from root to leaf
  • Smoothing formula: p' = (np+kq)/(n+k)
  • p' is what gets passed up the tree
  • p is what got passed from down the tree
  • q is the value predicted by the linear models at this node
  • n is the number of examples that fall down to here
  • k magic smoothing constant; default=2