You are on page 1of 40

Decision

Trees, Random Forests


and Random Ferns
Peter Kovesi
What do I want to do?

Take an image.
Iden9fy the dis9nct regions of stu in the image.
Mark the boundaries of these regions.
Recognize and label the stu in each region.
What do I want to do?

Take an image.
Iden9fy the dis9nct regions of stu in the image.
Mark the boundaries of these regions.
Recognize and label the stu in each region.
What do I want to do?

Take an image.
Iden9fy the dis9nct regions of stu in the image.
Mark the boundaries of these regions.
Recognize and label the stu in each region.
Recognizing Textures
Manual Classica;on
Textons
Fundamental micro-structures in natural images

Apply a bank of lters to a set of sample images of a texture.
Perform clustering on the lter outputs to nd groupings of
lter outputs that tend to co-occur together for that texture.
These clusters form textons that are stored in a dic9onary for
future use.

Filter Bank
Statistical Approach to Texture Classification from Single Image
Step 1: Build the texton dic;onary

4. Learning stage I: Generating the texton dictionary. Multiple, unregistered images from the training set of a particular tex
nvolved with a filter bank. The resultant filter responses are aggregated and clustered into textons using the K-Means algorithm
Varma
ifferent and
texture Zisserman
classes 2005 to form the texton dictionary.
are combined
Texton dic9onary built from coral images
Step 2: Build models of the textures

A set of training images for each texture are ltered and the
dic9onary textons closest to the lter outputs are found.
The
re 4. Learning stagehI:istogram of textons
Generating the texton found unregistered
dictionary. Multiple, in the images
image
fromforms the
the training set ofm odel texture
a particular
onvolved with a filter bank. The resultant filter responses are aggregated and clustered into textons using the K-Means algorithm. Te
corresponding
different texture classes are combined totform
o the training
the texton image.
dictionary.
Step 3: Texture recogni;on

Image of the unknown texture is ltered and the dic9onary


textons closest to the lter outputs are found.
The histogram of textons found in the image is then compared
against
Varma and Zissermanthe histograms of the training texture images to nd the
closest match.

e 6. Classification stage. A novel image is classified by forming its histogram and then using a nearest neighbour classifier to pi
st model to it (in the 2 sense). The novel image is declared as belonging to the texture class of the closest model.
Problems

While the papers report good results I am having trouble
replica9ng them.

Cluster centres seem to change drama9cally on dierent


training sets.

How many clusters should I use?

Clustering takes a long 9me.

Dont know which lters produce the most useful data for
separa9ng dierent textures.

I have lots of 8 megapixel images, each pixel with 36


features
Machine Learning Algorithms

K-means clustering: An unsupervised algorithm that learns which things go
together. User has to specify K.

Bayes Classier: Assumes features are Gaussian distributed and independent of


each other. For each class nd mean and variance of its a]ributes. Then, given
some a]ributes compute the probability that it is a member of each class and take
the most probable one. Works surprisingly well and can handle large data sets.

Decision Trees: Finds data features and thresholds that best splits the data into
separate classes. This is repeated recursively un9l data has been split into
homogeneous (or mostly homogeneous) groups. Can immediately iden9fy the
features that are most important.

Boos;ng: A collec9on of weak classiers (typically single level decision trees).


During training each classier learns a weight for its vote from its accuracy on the
data. Each classier is trained one by one, data that is poorly represented by
earlier classiers is given a higher weigh9ng so that subsequent classiers pay
more a]en9on to points where the errors are large.

Random Forests: An ensemble of decision trees. During learning tree nodes are
split using a random subset of data features. All trees vote to produce a nal
answer. Can be one of the most accurate techniques.
Machine Learning Algorithms

Expecta;on maximiza;on (EM) Maximum Likelihood Es;ma;on (MLE): Typically we
assume the data is a mixture of Gaussians. In this case EM ts N mul9dimensional
Gaussians to the data. User has to specify N.

Neural Networks / Mul;layer Perceptron: Slow to train but fast to run, design is a
bit of an art but can the the best performer on some problems.

Support Vector Machines: Algorithm nds hyperplanes that maximally separates


classes. Projec9ng the data into higher dimensions makes the data more likely to be
linearly separable. Works well when there is limited data.
Machine Learning Problems
Model

True
Bias: The model assump9ons are too
strong. It cannot t the data well.

Errors on training data and on test
data will be large.

Model

Variance: The model ts the training True


data too well and has included the
noise. It cannot generalize.

Errors on training data will be small
but errors on test data will be large.
Decision Tree for predic;ng Californian house prices from la;tude and longitude
Latitude < 38.485
|

Longitude < -121.655 Latitude < 39.355

11.73 11.32

Latitude < 37.925 Latitude < 34.675

12.48 12.10

Longitude < -118.315 Longitude < -120.275

11.75 11.28
Longitude < -117.545
12.53
Latitude < 33.725 Latitude < 33.59
Longitude < -116.33
12.54 12.14 11.63
12.09 11.16
Recursive par;;oning of the data
Deciding how to split nodes

A nice split

Histogram of classes at node

Condi9on?

true false
Deciding how to split nodes

A not so useful split

Histogram of classes at node

Condi9on?

true false
Deciding how to split nodes
Which a]ribute of the data at a node provides the highest
informa9on gain?
Low entropy High entropy
Entropy: H(X) = - pi log pi



Specic Condi;onal Entropy: H(X | Y=v)
= The entropy of X among only those records in which Y = v

Condi;onal Entropy: H(X | Y)
= The average specic condi9onal entropy of X
= Prob(Y=vj) H(X | Y = vj)

Informa;on Gain: IG(X | Y) = H(X) H(X | Y)

H(X) indicates the randomness of X
H(X | Y) indicates the randomness of X assuming I know Y
The dierence, H(X) H(X | Y), indicates the reduc9on in
randomness achieved by knowing Y
Entropy Specic condi9onal entropy

H(X) H(X | Y = v1)

H(X | Y = v2)
Condi9onal Entropy

H(X | Y) = Prob(Y=vj) H(X | Y = vj)

H(X | Y = v3)
Informa;on Gain from thresholding a real-valued
aWribute


Dene IG(X | Y:t) as H(X) H(X | Y:t)

Dene H(X | Y:t) =
H(X | Y<t) P(Y < t) + H(X | Y>= t) P(Y >= t)


IG(X | Y:t) is the informa9on gain for predic9ng X if all you know is
whether Y is less than or greater than t
A Decision Tree represents a structured plan of a set of a]ributes
to test in order to predict the output.

To decide which a]ribute should be tested rst, simply nd the


one with the highest informa9on gain.

Then recurse

Stop when:
All records at a node have the same output, or
All records at a node have the same a]ributes, in this case we
make a classica9on based on the majority output.

The tree directly provides an ordering of the importance of


each a]ribute in making the classica9on.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.


Classica9on and Regression Trees. Wadsworth, Belmont, CA, 1984. Applet Demo
The tree maps each data point
to a leaf. Each leaf stores the
distribu9on of classes that end
up there.
OverYng

If we expand the tree as far as we can go we are likely to end
up with many leaf nodes that contain only one record.

It is likely that we have ]ed the noise in the data. This will
result in the training set error being very small, but the test set
error being high.

Pruning:
Star9ng at the bo]om of the tree, delete splits that do not add
predic9ve power. Use a chi-squared test to decide whether
the distribu9ons generated by the split are signicantly
dierent. You have to provide a threshold which represents
your willingness to t noise.
Random Forests

An ensemble of decision trees. During learning tree nodes are split
using a random subset of data features. All trees vote to produce a
nal answer.

Why do this?

It was found that op9mal cut points can depend strongly on the
training set used (high variance).

This led to the idea of using mul9ple trees to vote for a result.

For the use of mul9ple trees to be most eec9ve the trees should
be independent as possible. Splinng using a random subset of
features hopefully achieves this.

Averaging the outputs of trees reduces overnng to noise.


Pruning is not needed.


Leo Breiman, Random Forests Machine Learning Vol 45 No 1. 2001

Typically 5 100 trees are used. Open only a few trees are needed.

Results seem fairly insensi9ve to the number of random a]ributes


that are tested for each split. A common default is to use the square
root of the number of a]ributes.

Trees are fast to generate because fewer a]ributes have to be tested


for each split and no pruning is needed.

Memory needed to store the trees can be large.


Extremely Randomized Trees

Not only randomly select a subset of a]ributes to evaluate for each
split but, in the case of numerical a]ributes, randomly select the
threshold to split the value on.

Seem to work slightly be]er than Random Forests. Though this may
be a result of the slightly dierent informa9on gain score used.

Completely random trees can also work well. Here a single a]ribute
is selected at random for each split. No evalua9on of the a]ribute
split is therefore needed. Trees are trivial to generate.

Pierre Guerts, D Ernst, L Wehenkel Extremely Randomized Trees Machine Learning Vol 63 No 1. 2006
Mach Learn ()
Some results taken from Geurts paper

Single tree Extremely Random Trees


Random Forest
Importance of a Par;cular Feature Variable

Breimans algorithm:

1. Train a classier.

2. Perform valida9on to determine accuracy of classier.

3. For each data point randomly choose a new value for the feature
variable from among the values that the feature has in the rest of the
data set. (This ensures the distribu9on of the feature values remains
unchanged but the meaning of the feature variable is destroyed.)

4. Train the classier on the altered data and measure its accuracy. If the
accuracy is degraded badly then the feature is very important.

5. Restore the data and repeat the process using every other feature
variable. The result will be an ordering of each feature variable by its
importance.
Regression Trees vs Classica;on Trees

A Regression Tree a]empts to predict a con9nuous numerical
value rather than a discrete classica9on.

Evalua9on of each node split has to be made on the variance of


the split distribu9ons rather than the informa9on gain.

Here entropy is equal but variance is not

A large Random Forrest or set of Extremely Randomized Trees


acts as a linear interpolator.
1 Extremely Random Tree

100 Extremely Random Trees


Random Ferns

Extending the randomness and simplicity even further

A fern can be thought of as a constrained tree where the same binary
test is performed at each level of the tree

zuysal, Fua and Lepe9t CVPR 2007


zuysal, Calonder, Lepe9t and Fua PAMI 2009 (Diagrams taken from zuysals pages)
Recognizing keypoints with Random Ferns

Keypoint features fi are the sign of the intensity dierence of two pixels at
random loca9ons in a patch about the keypoint.

Each keypoint has N features but each fern is constructed from a random
subset of S features.

Fern 1 Fern 2 Fern 3

The output of each feature test can be concatenated to


form a binary number. This corresponds to the index
of the leaf node that we end up at in the equivalent
constrained tree.
Training: Example views of each keypoint are passed through each fern. A
histogram of the leaf indices that each keypoint class ends up at is built up

Fern 1 Fern 2 Fern 3


Recogni;on: The output of the feature tests on the candidate keypoint
places us at a leaf node on each fern. Each gives a probability for each of
the possible keypoints. These are combined assuming independence
between distribu9ons
Random Ferns are Semi-Nave Bayes Classiers

Typical values used by zuysal, Fua and Lepe9t
Number of features: N = 450
Number of Ferns: M = 30 ~ 50
Number of features/Fern: S = 11

Consider the op9ons:
One large Fern made up from all the features -> 2N parameters
(too large!)

N single feature Ferns (Nave Bayes Classier) -> N parameters


Assuming each feature is independent is too simplis9c and keypoint
pose varia9ons are not handled well.

M Ferns each consis9ng of S features -> M x 2S parameters


Assumes that each group of S features are independent (Semi-Nave
Bayes Classier). Varying M and S allow tuning of complexity and
performance.
Recognizing Textures with Trees?
Some are star9ng to do this
Maree, Geurts and Wehenkel 2009
Sho]on, Johnson and Cipolla CVPR 2008
Tree/Fern based learning algorithms

Simple and can perform very well.

Training is fast and leaf histograms can be incrementally updated.

Can require considerable memory to store.

Stochas9c force seems to outgun careful design!

You might also like