Professional Documents
Culture Documents
in decision trees
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Review of loan default prediction
Loan
Applications
Safe
Risky
excellent poor
Credit?
fair
Loan
Income? i
Application Safe Term?
high Low
3 years 5 years
3 years 5 years
Risky Safe
Model complexity
y values Root
- + 18 13
x[1]
Decision
boundary
Tree depth
EC T
O T P ERF
N
17 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Why training error reduces with depth?
Symptoms: S1 and S2
SIMPLER
Diagnosis 1: 2 diseases Diagnosis 2: 1 disease
Two diseases D1 and D2 where OR Disease D3 explains both
D1 explains S1, D2 explains S2 symptoms S1 and S2
OR
SIMPLER
T4(X)
Simple Complex
trees trees
True error
Training error
max_depth
Tree depth
36 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Early stopping condition 1:
Limit depth of tree
Classification Error
max_depth
Tree depth
37 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Picking value for max_depth???
Classification Error
Validation set or
cross-validation
max_depth
Tree depth
38 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Early stopping condition 2:
Use classification error to
limit depth of tree
Credit?
Safe
Build decision stump Build decision stump
with subset of data with subset of data
where Credit = fair where Credit = poor
40 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Split selection for credit=poor
Credit?
No split improves
excellent fair poor classification error
16 0 1 2 5 16
Stop!
Splits for Classification
credit=poor error
Safe
(no split) 0.24
0.45
split on term 0.24
split on income 0.24
Credit?
Safe Risky
Credit?
Early stopping
condition 3
48 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Summary of decision trees
with early stopping
I O N A L
O P T
2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Stopping condition summary
Stopping condition:
1. All examples have the same target value
2. No more features to split on
Simple Complex
trees trees
True error
Training error
max_depth
Tree depth
57 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Is early stopping condition 2 a good idea?
Classification Error
Stop because of
zero decrease in
classification error
Tree depth
58 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Early stopping condition 2:
Dont stop if error doesnt decrease???
y = x[1] xor x[2] y values Root
x[1] x[2] y True False 2 2
False False False
False True True Error = .
Tree Classification
error
with early stopping 0.5
condition 2
x[2] x[2]
Tree Classification
error
True False True False
with early stopping 0.5 0 1 1 0 1 0 0 1
condition 2
without early 0.0
stopping condition 2 False True True False
64 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Early stopping condition 2: Pros and Cons
Pros:
- A reasonable heuristic for early stopping to
avoid useless splits
Cons:
- Too short sighted: We may miss out on good
splits may occur right after useless splits
Complex Tree
Simpler Tree
Simplify
Simplify after
tree is built
Dont stop
too early
Training Error
Tree depth
71 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Example 1: Which tree is simpler?
Start
excellent poor
Credit?
fair
Income?
OR Start
Safe Term?
high low
3 years 5 years excellent poor
Credit?
Start Start
excellent
Credit?
poor
OR Term?
3 years 5 years
good fair bad
Safe Risky
excellent poor
Credit?
Start Start
excellent
Credit?
poor
OR Term?
3 years 5 years
good fair bad
Safe Risky
SIMPLER
75 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Balance simplicity & predictive power
Too complex, risk of overfitting
Start
excellent
Credit?
poor
Too simple, high
fair classification error
Income?
Safe Term? Start
high low
3 years 5 years
Risky
Risky Safe Term? Risky
3 years 5 years
Risky Safe
76 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Desired total quality format
Want to balance:
i. How well tree fits data
ii. Complexity of tree
want to balance
Total cost =
measure of fit + measure of complexity
(classification error)
Large # = likely to
Large # = bad fit to
overfit
78
training data 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Consider specific total cost
Total cost =
classification error + number of leaf nodes
Error(T) L(T)
tuning parameter
If =0:
If =:
If in between:
81 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Use total cost to simplify trees
Complex tree
Simpler tree
Total quality
based pruning
excellent poor
Credit?
fair
Income?
Safe Term?
high low
3 years 5 years
3 years 5 years
Risky Safe
Start Tree T
excellent poor
Credit?
fair
Income?
Safe Term?
high low
3 years 5 years
3 years 5 years
Candidate for
Risky Safe
pruning
86 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Step 2: Compute total cost C(T) of split
Start Tree T
excellent
Credit?
poor = 0.3
Tree Error #Leaves Total
fair
T 0.25
Income?
Safe Term?
high
3 years 5 years
low
C(T) = Error(T) + L(T)
Risky Safe Term? Risky
3 years 5 years
Candidate for
Risky Safe
pruning
88 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Step 2: Undo the splits on Tsmaller
excellent
Credit?
poor = 0.3
Tree Error #Leaves Total
fair
T 0.25 6 0.43
Income?
Safe Term? Tsmaller 0.26
high low
3 years 5 years
C(T) = Error(T) + L(T)
Risky Safe Safe Risky
Replace split
by leaf node?
89 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Prune if total cost is lower: C(Tsmaller) C(T)
Worse training
Start Tree Tsmaller error but lower
overall cost
excellent
Credit?
poor = 0.3
Tree Error #Leaves Total
fair
T 0.25 6 0.43
Income?
Safe Term? Tsmaller 0.26 5 0.41
high low
3 years 5 years
C(T) = Error(T) + L(T)
Risky Safe Safe Risky
Replace split
by leaf node? YES!
90 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Step 5: Repeat Steps 1-4 for every split
Start
Decide if each
split can be
excellent poor pruned
Credit?
fair
Income?
Safe Term?
high low
3 years 5 years