You are on page 1of 36

Summary

of HW Assigned

Assignment

Due date

Assigned

Due

HW0: Term paper proposal,


7irst draft,

4 Sep 2013

13 Sep 2013

HW1: WEKA features

6 Sep 2013, Friday

13 Sep 2013

HW2: Decision Tree: Play


Tennis, Paper & Pencil

9 Sep 2013, Monday

16 Sep 2013

HW3: MATLAB Program on


IRIS Data, Decision Trees

11 Sep 2013

23 Sep 2013

Classica8on: Example
IRIS DATA Classica8on using
Decision Trees
MATLAB Sta8s8cal Package
2

Where to Get Data?


The UC Irvine Data Sets
hMp://archive.ics.uci.edu/ml/datasets/Iris
This is perhaps the best known database to be found in
the paMern recogni8on literature.
Fisher's paper is a classic in the eld and is referenced
frequently to this day. (See Duda & Hart, for example.)
The data set contains 3 classes of 50 instances each,
where each class refers to a type of iris plant.
One class is linearly separable from the other 2; the
laMer are NOT linearly separable from each other.

3

Fishers IRIS Data Set


150 data items that contain
measurements about iris owers from 3
species:
Setosa (50), Versicolor (50), and Virginica (50)

The data include informa8on about four


features of the owers:
sepal length, sepal width, petal length, petal width.

AMribute Informa8on
Col.
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. Class1: Iris Setosa
6. Class2: Iris Versicolour
7. Class3: Iris Virginica
5

Flower Parts

Load Iris Data into MATLAB


Visit UCI ML web site
Download the iris data into a le and name it
sheriris
Study the data
See how the sepal measurements dier
between species. You can use the two
columns containing sepal measurements.

Matlab Instruc8on: gscaMer

ScaMer plot by group


h = gscaMer(...)
gscaMer(x,y,group) creates a scaMer plot of x and y, grouped by group.
x and y are vectors of the same size. group is a grouping variable in the form of
a categorical variable, vector, string array, or cell array of strings.
gscaMer(x,y,group,clr,sym,siz) species the color, marker type, and size for each
group. clr is a string array of colors recognized by the plot func8on.
The default for clr is 'bgrcmyk'. sym is a string array of symbols recognized by
the plot command, with the default value '.'. siz is a vector of sizes, with the
default determined by the 'DefaultLineMarkerSize' property. If you do not
specify enough values for all groups, gscaMer cycles through the specied
values as needed.
gscaMer(x,y,group,clr,sym,siz,doleg) controls whether a legend is displayed on the
graph (doleg is 'on', the default) or not (doleg is 'o').

MATLAB: Loading Data & ScaMer


Plolng
load sheriris
gscaMer(meas(:,1), meas(:,2), species,'rgb',csd');
xlabel('Sepal length');
ylabel('Sepal width');
N = size(meas,1);
NOTE:
rgb = red, green, blue
csd = circle, square, diamond


9

Resul8ng ScaMer Plot

10

For Home Work #3


For training data, use Fisher's sepal measurements
for iris versicolor and virginica: (Try this at home)
load sheriris
SL = meas(51:end,1);
SW = meas(51:end,2);
group = species(51:end);
h1 = gscaMer(SL,SW,group,'rb','v^',[],'o');
set(h1,'LineWidth',2)
legend('Fisher versicolor','Fisher virginica',...
'Loca8on','NW')
11

What is the Problem to be Solved?


Suppose you measure a sepal and petal from
an iris, and you need to determine its species
on the basis of those measurements.
The
func8on can perform classica8on
using dierent types of discriminant analysis.
(Later, we will use classify func8on that uses
Linear/Quadra8c Discriminant Analyses, LDA
& QDA.)
Now let us use Decision Tree Classier
12

Decision Trees
Is a classica8on algorithm.
A decision tree is a set of simple rules, such as "if the
sepal length is less than 5.45, classify the specimen as
setosa."
Decision trees are also nonparametric because they do
not require any assump8ons about the distribu8on of
the variables in each class.
The
class creates a decision tree.
Create a decision tree for the iris data and see how
well it classies the irises into species.

13

Visualizing the Decision Boundaries


It is interes8ng to see how the decision tree
method divides the plane.
Use the same technique as above to visualize
the regions assigned to each species.

14

One Method

15

Decision Tree

16

Reading the Tree


This cluMered-looking tree uses a series of rules
of the form "SL < 5.45" to classify each specimen
into one of 19 terminal nodes.
To determine the species assignment for an
observa8on, start at the top node and apply the
rule. If the point sa8ses the rule you take the
lew path, and if not you take the right path.
Ul8mately you reach a terminal node that
assigns the observa8on to one of the three
species.
17

Home Work #3
Your job is to reproduce the results shown so
far (and no more) by using MATLAB.
You can use the code suggested by me or you
can modify it.
I cannot guarantee that the code suggested by
me works on your system.
You can take two weeks to do this

18

Rest of the Slides


I will stop the Decision Tree discussion at this
point.
We will pick it up later
The next topic is Neural Nets

19

Re-Subs8tu8on & CV Errors


dtclass = t.eval(meas(:,1:2));
bad = ~strcmp(dtclass,species);
dtResubErr = sum(bad) / N
dtClassFun = @(xtrain,ytrain,xtest)
(eval(classregtree(xtrain,ytrain),xtest));
dtCVErr = crossval('mcr',meas(:,1:2),species, ...
'predfun', dtClassFun,'par88on',cp)

20

Comment
For the decision tree algorithm, the cross-valida8on
error es8mate is signicantly larger than the re-
subs8tu8on error.
This shows that the generated tree overts the training
set. In other words, this is a tree that classies the
original training set well, but the structure of the tree
is sensi8ve to this par8cular training set so that its
performance on new data is likely to degrade.
It is owen possible to nd a simpler tree that performs
beMer than a more complex tree on new data.

21

Decision Tree Pruning


Try pruning the tree.
First compute the re-subs8tu8on error for
various of subsets of the original tree.
Then compute the cross-valida8on error for
these sub-trees.
A graph shows that the re-subs8tu8on error is
overly op8mis8c. It always decreases as the tree
size grows, but beyond a certain point, increasing
the tree size increases the cross-valida8on error
rate.
22

Pruning Code

23

Performance of Pruned Tree

24

Shape of Pruned Tree


pt = prune(t,bestlevel);view(pt)

25

Addi8onal Analysis
on the Iris Data Set

26

Ranges of the Measurements


of Iris Flowers (in cen8meters)
Sepal length

Sepal width

Petal length

Petal width

4.3 7.9

2.0 4.4

1.0 6.9

0.1 2.5

27

Ranges Within the Classes


Sepal
length
Sepal
width

Setosa Versicolor Virginica


4.3 5.8 4.9 7.0 4.9 7.9

2.3 4.4 2.0 3.4 2.2 3.8

Petal
length

1.0 1.9 3.0 5.1 4.5 6.9

Petal
width

0.1 0.6 1.0 1.8 1.4 2.5


28

Granulated Ranges
(of sepal length)
4.3 4.9

Sestosa

4.9 5.8

5.8 7.0

Sestosa

Versicolor

Versicolor

7.0 7.9

Virginica

Virginica

Virginica

29

Granulated Ranges
(of sepal width)
2.0 2.2

Versicolor

Versicolor

Versicolor

2.2 2.3

2.3 3.4

3.4 3.8

3.8 4.4

Sestosa

Sestosa

Sestosa

Virginica

Virginica

Virginica

30

Granulated Ranges
(of petal length)
1.0 1.9

Sestosa

3.0 4.5

4.5 5.1

5.1 6.9

Versicolor

Versicolor

Virginica

Virginica

31

Granulated Ranges
(of petal width)

0.1 0.6

Sestosa

1.0 1.4

Versicolor

Versicolor

Virginica

Virginica

1.4 1.8

1.8 2.5

32

Fuzzy Models: Linguis8c Labels


(for sepal length)
4.3 4.9

4.9 5.8

5.8 7.0

7.0 7.9

short sepal

medium long sepal

long sepal

very long sepal

A11

A12

A13

A14

33

Fuzzy Models: Linguis8c Labels


(for sepal width)
2.0 2.2

very narrow sepal

2.2 2.3

narrow sepal

2.3 3.4

3.4 3.8

3.8 4.4

medium wide
sepal
wide sepal

very wide sepal

A21

A22

A23

A24

A25

34

Fuzzy Models: Linguis8c Labels


(for petal length)
1.0 1.9

very short petal


A31

3.0 4.5

medium long petal

A32

4.5 5.1

long petal

A33

5.1 6.9

very long petal


A34

35

Fuzzy Models: Linguis8c Labels


(for petal width)
0.1 0.6

very narrow petal


A41

1.0 1.4

medium wide petal


A42

1.4 1.8

wide petal

A43

1.8 2.5

very wide petal

A44

36

You might also like