Decision Trees Overfitting Annotated

Overfitting
in decision trees
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Review of loan default prediction
Loan
Applications
Safe

Intelligent loan application Risky

review system
Risky

3 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Decision tree review
T(xi) = Traverse decision tree

start
excellent poor
Credit?
fair
Loan
Income? i
Application Safe Term?
high Low
3 years 5 years
Input: xi Risky Safe Term? Risky
3 years 5 years
Risky Safe

Overfitting review

Overfitting in logistic regression
True error
Classification Error
Error =
Overfitting if there exists w*:

training_error(w*) > training_error()
true_error(w*) < true_error()
Training error
Model complexity

Overfitting
Overconfident predictions
Logistic Regression Logistic Regression

(Degree 6) (Degree 20)

Overfitting in decision trees

Decision stump (Depth 1):
Split on x[1]
y values Root
- + 18 13
x[1]
x[1] < -0.07 x[1] >= -0.07

13 3 4 11

What happens when we increase depth?
Training error reduces with depth
Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10
Training error 0.22 0.13 0.10 0.03 0.00
Decision
boundary

Deeper trees lower training error
Depth 10 (training error = 0.0)

Training Error
Tree depth

Training error = 0: Is this model perfect?
Depth 10 (training error = 0.0)
EC T
O T P ERF
N
Why training error reduces with depth?
Loan status: Root Split on credit

Safe Risky 22 18
Tree Training
error
Credit?
(root) 0.45
split on credit 0.20
excellent good fair
9 0 9 4 4 14
Safe Safe Risky Training error

improved by 0.25
because of the split
Feature split selection algorithm
Given a subset of data M (a node in a tree)
For each feature hi(x):
1. Split data of M according to feature hi(x)
2. Compute classification error split
Chose feature h*(x) with lowest
classification error
By design, each split
reduces training error

Decision trees overfitting
on loan data

Principle of Occams razor:
Simpler trees are better

Principle of Occams Razor
Among competing hypotheses, the one with
fewest assumptions should be selected,
William of Occam, 13th Century
Symptoms: S1 and S2
SIMPLER
Diagnosis 1: 2 diseases Diagnosis 2: 1 disease
Two diseases D1 and D2 where OR Disease D3 explains both
D1 explains S1, D2 explains S2 symptoms S1 and S2

Occams Razor for decision trees
When two trees have similar classification error
on the validation set, pick the simpler one
Complexity Train Validation

error error Same validation
error
Simple 0.23 0.24
Moderate 0.12 0.15
Complex 0.07 0.15
Super complex 0 0.18 Overfit
Which tree is simpler?
OR
SIMPLER

Modified tree learning problem
Find a simple decision tree with low classification error
Simple trees Complex trees

T1(X) T2(X)
T4(X)

How do we pick simpler trees?
1. Early Stopping: Stop learning algorithm

before tree become too complex
2. Pruning: Simplify tree after

learning algorithm terminates

Early stopping for
learning decision trees

Deeper trees
Increasing complexity
Model complexity increases with depth
Depth = 1 Depth = 2 Depth = 10

Early stopping condition 1:
Limit the depth of a tree

Restrict tree learning to shallow trees?
Simple Complex
trees trees
True error
Training error
max_depth
Tree depth
Limit depth of tree
Stop tree building when

depth = max_depth
max_depth
Tree depth
Picking value for max_depth???
Validation set or
cross-validation
max_depth
Tree depth
Use classification error to
limit depth of tree

Decision tree recursion review
Loan status: Root
Safe Risky 22 18
Credit?
excellent fair poor

16 0 1 2 5 16
Safe
Build decision stump Build decision stump
with subset of data with subset of data
where Credit = fair where Credit = poor
Split selection for credit=poor
Loan status: Root

Safe Risky 22 18
Credit?
No split improves
excellent fair poor classification error
16 0 1 2 5 16
Stop!
Splits for Classification
credit=poor error
Safe
(no split) 0.24
0.45
split on term 0.24
split on income 0.24

No split improves classification error
Loan status: Root
Safe Risky
Early stopping
22 18
condition 2
Credit?
excellent fair poor

16 0 1 2 5 16
Splits for Classification

credit=poor error
Safe Risky
Build decision stump (no split) 0.24
with subset of data split on term 0.24
where Credit = fair split on income 0.24

Practical notes about stopping when
classification error doesnt decrease
1. Typically, add magic parameter

- Stop if error doesnt decrease by more than
2. Some pitfalls to this rule (see pruning section)
3. Very useful in practice

Stop if number of data points
contained in a node is too small

Can we trust nodes with very few points?
Loan status: Root
Safe Risky 22 18
Credit?
excellent fair poor

16 0 1 2 5 16
Safe Risky
Stop recursing Only 3 data

points!
Stop when data points in a node <= Nmin
Loan status: Root
Safe Risky 22 18 Example: Nmin = 10
Credit?
excellent fair poor

16 0 1 2 5 16
Safe Risky Risky
Early stopping
condition 3
Summary of decision trees
with early stopping

Early stopping: Summary
1. Limit tree depth: Stop splitting after a

certain depth
2. Classification error: Do not consider any

split that does not cause a sucient
decrease in classification error
3. Minimum node size: Do not split an

intermediate node which contains
too few data points

Greedy decision tree learning
Step 1: Start with an empty tree

Step 2: Select a feature to split data
For each split of the tree: Stopping
Step 3: If nothing more to, conditions 1 & 2
make predictions or
Step 4: Otherwise, go to Step 2 & Early stopping
continue (recurse) on this split conditions 1, 2 & 3
Recursion

Overfitting in Decision Trees:
Pruning
I O N A L
O P T
Stopping condition summary
Stopping condition:
1. All examples have the same target value
2. No more features to split on
Early stopping conditions:

1. Limit tree depth
2. Do not consider splits that do not cause a
sucient decrease in classification error
3. Do not split an intermediate node which
contains too few data points

Exploring some challenges
with early stopping conditions

Challenge with early stopping condition 1
Hard to know exactly
when to stop
Simple Complex
trees trees
True error
Training error
max_depth
Tree depth
Is early stopping condition 2 a good idea?
Stop because of
zero decrease in
Tree depth
Dont stop if error doesnt decrease???
y = x[1] xor x[2] y values Root
x[1] x[2] y True False 2 2
False False False
False True True Error = .
True False True

True True False
=
Tree Classification error

(root) 0.5

Consider split on x[1]

False False False
True False True x[1]

True True False
=
True False
1 1 1 1
Tree Classification error

(root) 0.5
Split on x[1] 0.5
Consider split on x[2]

False False False
True False True x[2]

True True False
=
True False
1 1 1 1
Neither features Tree Classification error
improve training error (root) 0.5

Split on x[1] 0.5
Stop now??? Split on x[2] 0.5
Final tree with early stopping condition 2

False False False
False True True
True False True True
True True False
Tree Classification
error
with early stopping 0.5
condition 2

Without early stopping condition 2
y = x[1] xor x[2] Root

y values 2 2
x[1] x[2] y True False
False False False
x[1]
False True True
True False
True False True
1 1 1 1
True True False
x[2] x[2]
Tree Classification
error
True False True False
with early stopping 0.5 0 1 1 0 1 0 0 1
condition 2
without early 0.0
stopping condition 2 False True True False
Early stopping condition 2: Pros and Cons
Pros:
- A reasonable heuristic for early stopping to
avoid useless splits
Cons:
- Too short sighted: We may miss out on good
splits may occur right after useless splits

Tree pruning

Two approaches to picking simpler trees
1. Early Stopping: Stop the learning

algorithm before the tree becomes
too complex
2. Pruning: Simplify the tree after the

learning algorithm terminates
Complements early stopping

Pruning: Intuition
Train a complex tree, simplify later
Complex Tree
Simpler Tree
Simplify

Pruning motivation
Simple Complex
tree tree
True Error
Simplify after
tree is built
Dont stop
too early
Training Error
Tree depth
Example 1: Which tree is simpler?
Start
excellent poor
Credit?
fair
Income?
OR Start
Safe Term?
high low
3 years 5 years excellent poor
Credit?
Risky Safe Term? Risky fair
3 years 5 years Safe Safe Risky
Risky Safe SIMPLER

Example 2: Which tree is simpler???
Start Start
excellent
Credit?
poor
OR Term?
3 years 5 years
good fair bad
Safe Risky
Safe Safe Risky Risky Safe

Simple measure of complexity of tree
L(T) = # of leaf nodes

Start
excellent poor
Credit?
good fair bad

Safe Risky
Safe Safe Risky

Which tree has lower L(T)?
L(T1) = 5 L(T2) = 2
Start Start
excellent
Credit?
poor
OR Term?
3 years 5 years
good fair bad
Safe Risky
Safe Safe Risky Risky Safe
SIMPLER
Balance simplicity & predictive power
Too complex, risk of overfitting
Start
excellent
Credit?
poor
Too simple, high
fair classification error
Income?
Safe Term? Start
high low
3 years 5 years
Risky
Risky Safe Term? Risky
3 years 5 years
Risky Safe
Desired total quality format
Want to balance:
i. How well tree fits data
ii. Complexity of tree
want to balance
Total cost =
measure of fit + measure of complexity
(classification error)
Large # = likely to
Large # = bad fit to
overfit
78
training data 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Consider specific total cost
Total cost =
classification error + number of leaf nodes
Error(T) L(T)

Balancing fit and complexity
Total cost C(T) = Error(T) + L(T)
tuning parameter
If =0:
If =:
If in between:
Use total cost to simplify trees
Complex tree
Simpler tree
Total quality
based pruning

Tree pruning algorithm

Pruning Intuition
Start Tree T
excellent poor
Credit?
fair
Income?
Safe Term?
high low
3 years 5 years
3 years 5 years
Risky Safe

Step 1: Consider a split
Start Tree T
excellent poor
Credit?
fair
Income?
Safe Term?
high low
3 years 5 years
3 years 5 years
Candidate for
Risky Safe
pruning
Step 2: Compute total cost C(T) of split
Start Tree T
excellent
Credit?
poor = 0.3
Tree Error #Leaves Total
fair
T 0.25
Income?
Safe Term?
high
3 years 5 years
low
C(T) = Error(T) + L(T)
3 years 5 years
Candidate for
Risky Safe
pruning
Step 2: Undo the splits on Tsmaller
Start Tree Tsmaller
excellent
Credit?
poor = 0.3
fair
T 0.25 6 0.43
Income?
Safe Term? Tsmaller 0.26
high low
3 years 5 years
Risky Safe Safe Risky
Replace split
by leaf node?
Prune if total cost is lower: C(Tsmaller) C(T)
Worse training
Start Tree Tsmaller error but lower
overall cost
excellent
Credit?
poor = 0.3
fair
T 0.25 6 0.43
Income?
Safe Term? Tsmaller 0.26 5 0.41
high low
3 years 5 years
Replace split
by leaf node? YES!
Step 5: Repeat Steps 1-4 for every split
Start
Decide if each
split can be
excellent poor pruned
Credit?
fair
Income?
Safe Term?
high low
3 years 5 years

Decision tree pruning algorithm
Start at bottom of tree T and traverse up,
apply prune_split to each decision node M
prune_split(T,M):
1. Compute total cost of tree T using
2. Let Tsmaller be tree after pruning subtree
below M
3. Compute total cost complexity of Tsmaller
C(Tsmaller) = Error(Tsmaller) + L(Tsmaller)
4. If C(Tsmaller) < C(T), prune to Tsmaller

Summary of overfitting in
decision trees

What you can do now
Identify when overfitting in decision trees
Prevent overfitting with early stopping
- Limit tree depth
- Do not consider splits that do not reduce
- Do not split intermediate nodes with only
few points
Prevent overfitting by pruning complex trees
- Use a total cost formula that balances
classification error and tree complexity
- Use total cost to merge potentially complex
trees into simpler ones

Thank you to Dr. Krishna Sridhar
Dr. Krishna Sridhar

Sta Data Scientist, Dato, Inc.

Decision Trees Overfitting Annotated

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Trees Overfitting Annotated

Uploaded by

Copyright:

Available Formats

Overfitting

Intelligent loan application Risky

3 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

T(xi) = Traverse decision tree

Input: xi Risky Safe Term? Risky

4 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Overfitting if there exists w*:

8 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Logistic Regression Logistic Regression

9 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

x[1] < -0.07 x[1] >= -0.07

13 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Training error reduces with depth

Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10

Training error 0.22 0.13 0.10 0.03 0.00

14 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Depth 10 (training error = 0.0)

16 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Loan status: Root Split on credit

Safe Safe Risky Training error

19 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

21 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

25 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Complexity Train Validation

27 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Simple trees Complex trees

29 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

1. Early Stopping: Stop learning algorithm

2. Pruning: Simplify tree after

30 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Model complexity increases with depth

Depth = 1 Depth = 2 Depth = 10

33 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Stop tree building when

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

excellent fair poor

Loan status: Root

42 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

excellent fair poor

Splits for Classification

43 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

1. Typically, add magic parameter

2. Some pitfalls to this rule (see pruning section)

3. Very useful in practice

45 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

excellent fair poor

Stop recursing Only 3 data

excellent fair poor

Safe Risky Risky

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

1. Limit tree depth: Stop splitting after a

2. Classification error: Do not consider any

3. Minimum node size: Do not split an

50 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Step 1: Start with an empty tree

52 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

Early stopping conditions:

55 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization