17 Classification Decision Tree 2017 PDF

Classification in Data Mining
Decision tree algorithm

Classification
Sl Acad Work CGPA Interview Performance
No Background Experience (0-9 Scale) Score in Company
1 Science < 1 year >7.5 Poor Excellent
2 Arts > 3 years 6.5 - 7.5 Good Good
3 Commerce 1-3 years 6.5 - 7.5 Good Good
4 Engg > 3 years < 6.5 Good Good
5 Engg < 1 year 6.5 - 7.5 Poor Fair
6 Science 1-3 years < 6.5 Good Excellent
7 Engg 1-3 years >7.5 Poor Good
8 Science > 3 years < 6.5 Good Fair
9 Engg 1-3 years 6.5 - 7.5 Good Good
10 Science < 1 year 6.5 - 7.5 Poor Fair
11 Science > 3 years < 6.5 Good Fair
12 Commerce 1-3 years >7.5 Good Excellent
13 Engg > 3 years < 6.5 Good Fair
.
Engg 1-3 years <6.5 Good ?????
BDM 2017 (Decision Tree) 2

Example
Classification a simple example

What is Classification?
Classification is learning a function that maps
(classifies) a data record into one of several
predefined classes
A data record has a number of attributes an
instance or a case
Predefined classes are based on values of the
class attribute
Learning is done through analysis of predefined
classes historically identified by managers that
are used as training samples

Questions
Which items are bought together ?
Which customers are likely to respond to offers?
Which loan applicant will pay in time?
What is the price of a used Toyota Corolla?
Which group of employees performs similarly?
Which credit card transaction is normal or fraudulent?

Questions
Which items may be put on sale ?
Which customers have similar buying patterns?
Which medicine to recommend in certain diseases?
Which symptoms may appear together?
How to reduce the empty truck space in

transportation?

Example findings
Findings Organization Suggested
Explanation
Buy diapers, likely to Osco drug Daddy needs a beer

buy beer
Dolls and Candy bars Walmart Kids come along for

errands
Staplers reveals hires A large retailer Stapler purchases are

often part of office kit
Mac users book Orbitz Mac is expensive, so

expensive hotels users have more
financial power

Example findings
Findings Organization Suggested Explanation
Crime rises after election Researchers in India Incumbent politicians

crack down on crime
before election
Hungry judges rule Columbia University and Hunger and fatigue leave
negatively Ben Gurion University decision makers feeling
(Israel) less forgiving
Retirement is bad for University of Zurich Unhealthy habits after

health retirement
Vegetarians miss fewer An airline Availability of veg meals

flights provides incentives for
travel
Question
A company plans to suggest a product to potential

customers. A database of 1 million customer
records exists; 20,000 purchases (2%) is the goal.
How to go about it?

An Example
A company plans to suggest a product to potential
customers. A database of 1 million customer records
exists; 20,000 purchases (2%) is the goal.
Instead of contacting all 1,000,000 customers, only
100,000 were contacted and their response to the
sale offers were obtained. These subset was used
to train a Classifier to tell which of the 100,000
customers decide to buy the product.
Then the rest 900,000 customers were then
presented to the Classifier which classified 32,000 of
them as potential buyers. The 32,000 customers
were contacted. and the 2% response is achieved.
Total savings $2,500,000.

Classification
Supervised learning two step process

Learning y=f(X) and applying the function
Training data set and new data
Training data set, Test data set and new data
Performance measure of the classifier
Overfitting
Training data set, Validation data set, Test data
set and new data multiple models

Popular classification methods
Decision tree induction

Bayesian classification
K-nearest neighbours
Classification by neural networks


Characteristics of Classifiers
Accuracy Correctly specify class labels of new or
previously unseen data; number of data records in
test set correctly classified
Speed How fast the classifier works
Robustness Ability to classify correctly given noisy
data
Scalability Ability to construct the classifier for large
amounts of data
Interpretability Level of insights provided by the
classifier

Decision Trees
One of the most widely used and practical
methods for classification through inductive
inference
a method for approximating discrete-valued
target functions
instances are represented by attribute-value
pairs
Greedy, top-down recursive divide-and-conquer
approach

Decision Trees and Rules
Goal: Classify or predict an outcome based on a
set of predictors
The output is a set of rules
Example:
Goal: classify a record as will buy computer or
will not buy
Rule might be IF (Income > 92.5) AND
(Education = poor) AND (FamilySize = small)
THEN buy = no (class = 0)
Rules are represented by tree diagrams

Input data
age
youth middle_aged senior
student yes Credit_rating
no yes fair excellent
no yes no yes
attribute, node, branch, test, decision and terminal node

Every path constitutes a decision rule

How to generate a decision tree?

Rule Generation
Once a decision tree has been constructed, it

is a simple matter to convert it into an
equivalent set of rules (Function)
To generate rules, trace each path in the
decision tree, from root node to leaf node,
recording the test outcomes as antecedents
and the leaf-node classification as the
consequent

Rules
If age = youth and student = no then
buys_computer = no
If age = youth and student = yes then
buys_computer = yes
If age = middle_aged then buys_computer = yes
If age = senior and credit_rating = fair then
buys_computer = no
If age = senior and credit_rating = excellent then
buys_computer = no

Basic Decision Tree Algorithm
Greedy approach - Identifies which attribute would be the
best classifier for the target attribute in a given data set
splitting attribute A
A is discrete-valued (k values) k branches,
remove A from attribute list
A is continuous 2 branches, midpoint of two known
adjacent values, A split_point, A > split_point
and do not remove A from the list
A is discrete_valued and binary split is required -
2 branches, A SA or not (For Binary Decision trees)
No backtrack method
Basic Decision Tree Algorithm
Termination of recursive partitioning

all tuples in D belong to the same class
no remaining attributes to partition D, majority
voting employed
Dj is empty, majority voting in D employee
ID3, C4.5, C5.0 are decision tree algorithms

CART (Classification and Regression Trees)
is a binary decision tree algorithm (XLMiner)
Attribute selection
A heuristic for selecting the splitting criterion
Objective is to separate a given data partition
into individual classes ideally pure partitions
Every attribute is given a score, the attribute
with best score is chosen (impurity function)
Information Gain, Gain ratio, Gini Index
(XLMiner uses a variation of Gini Index called
delta splitting rules) are popular attribute
selection measures

Attribute selection
Entropy - a measure from information theory,
characterizes the (im)purity, or homogeneity,
of an arbitrary collection of examples
Entropy - Information content of the set D -
Examine the attributes to add at the next level of
the tree using an entropy calculation
Choose the attribute that minimizes the entropy

How to compute entropy?
Entropy(D) =
= c - (pc ) log2(pc )
= c - (nbc / nb ) log2(nbc / nb )

Meaning of entropy value
Entropy varies between zero and one when

the members belong to only two classes;
otherwise it may be greater than 1
The entropy is zero when the set is perfectly
homogeneous.
The entropy is one when the set is perfectly

inhomogeneous.

How entropy changes
The disorder in a set containing members of

two classes A and B, as a function of the
fraction of the set belonging to class A.
In the diagram below, the total number of
instances in both classes combined is two.

How entropy changes
In the diagram below, the total number of

instances in both classes is eight.

Entropy - a measure of homogeneity
To illustrate, suppose D is a collection of 25

examples, including 15 positive and 10
negative examples [15+, 10-].
The entropy of D relative to this classification
is
Entropy(D) = - (15/25) log2 (15/25) - (10/25) log2 (10/25)
= 0.970

Interpretation of entropy value
Entropy is 0 if all members of D belong to the same
class.
If all members are positive (pp= 1 ), then pn is 0, and
Entropy(D) = -1 log2(1) - 0 log2(0) = -1* 0 - 0*log20 = 0.
Entropy is 1 (at its maximum!) when the collection contains
an equal number of positive and negative examples.
If the collection contains unequal numbers of positive and
negative examples, the entropy is between 0 and 1.
Higher the entropy more difficult is to make prediction
The value of entropy can be greater than 1 based on the
number of different classes the members of D belong to

Information gain
Information gain measures the expected reduction

in entropy based on additional information
A measure of the effectiveness of an attribute in
classifying the training data, information gain, is
simply the expected reduction in entropy caused by
partitioning the examples according to this attribute.
Information gain(D,A) = Entropy of the original
collection D - expected value of the entropy after D
is partitioned using attribute A.

Choice of Attribute

Information gain
Entropy(D) Entropy(D,A)
= Gain(D,A)
= Info(D) Info(D,A)
= - (ntc / nt ) log2(ntc / nt )
- Average entropy on branching on A
= - c(ntc / nt ) log2(ntc / nt )
- ( b ((nb / nt ) x (-c (nbc / nb ) log2(nbc / nb ))))

Gini Index for CART, XLMiner
Gini Index for a data set D containing k classes
Gini(D) = 1 - pi2
pi = proportion of cases in data set D that belong to

class k
Gini(D) = 0 when all cases belong to same class

At a max when all classes are equally
represented (= 0.50 in binary case)

ID3, C4.5 (C5.0), CART
ID3 Uses Information gain, better for attributes with
large number of values
C4.5 Uses maximum Gain Ratio =
Gain(D,A)/SplitInfo(D,A)
where SplitInfo(D,A) = - b(nb / nt ) log2(nb / nt )
CART uses Gini Index

CHAID uses statistical 2 test for independence
C-SEP, G-statistic are other measures

Classification trees
Classification trees in XLMiner
Main distinction: Categorical vs. Numeric variables
Numeric
Continuous
Integer
Categorical
Ordered (low, medium, high)
Unordered (male, female)

Variable handling
Numeric
Most algorithms in XLMiner take numeric data
May occasionally need to bin into categories for
outcome variables
IBM Modeler accepts Categorical variables

Example: Riding Mowers
Data: 24 households classified as owning

or not owning riding mowers
Predictors = Income, Lot Size

Income Lot_Size Ownership
60.0 18.4 owner
85.5 16.8 owner
64.8 21.6 owner
61.5 20.8 owner
87.0 23.6 owner
110.1 19.2 owner
108.0 17.6 owner
82.8 22.4 owner
69.0 20.0 owner
93.0 20.8 owner
51.0 22.0 owner
81.0 20.0 owner
75.0 19.6 non-owner
52.8 20.8 non-owner
64.8 17.2 non-owner
43.2 20.4 non-owner
84.0 17.6 non-owner
49.2 17.6 non-owner
59.4 16.0 non-owner
66.0 18.4 non-owner
47.4 16.4 non-owner
33.0 18.8 non-owner
51.0 14.0 non-owner
BDM 2017 (Decision Tree)
63.0 14.8 non-owner 40
How to split into Binary tree
Order records according to one variable, say
lot size
Find midpoints between successive values
E.g., first midpoint is 14.4 (halfway between
14.0 and 14.8)
Divide records into those with lot size > 14.4
and those < 14.4
After evaluating that split, try the next one,
which is 15.4 (halfway between 14.8 and
16.0)
The first split: Lot Size = 19,000

Second Split: Income = $84,000

After All Splits

Note: Categorical Variables for binary
splitting
Examine all possible ways in which the
categories can be split.
E.g. categories A, B, C can be split 3 ways
{A} and {B, C}
{B} and {A, C}
{C} and {A, B}
With many categories, # of splits becomes huge
XLMiner supports only binary categorical
variables and binary trees
Recursive Partitioning Steps
Pick one of the predictor variables, xi
Pick a value of xi, say si, that divides the
training data into two or more (not necessarily
equal) portions
Measure how pure or homogeneous each
of the resulting portions are
Pure = containing records of mostly one class
Idea is to pick xi, and si to maximize purity
Repeat the process

Recursive Partitioning
Obtain overall impurity measure (weighted
avg. of individual rectangles)
At each successive stage, compare this
measure across all possible splits in all
variables
Choose the split that reduces impurity the
most
Chosen split points become nodes on the
tree

First Split The Tree

Tree after second split

Tree Structure
Split points become nodes on tree (circles
with split value in center)
Rectangles represent leaves (terminal
points, no further splits, classification value
noted)
Numbers on lines between nodes indicate #
cases
Read down tree to derive rule, e.g.
If lot size < 19, and if income > 84.75, then class
= owner
Determining Leaf Node Label
Each leaf node label is determined by
voting of the records within it, and by the
cutoff value
Records within each leaf node are from
the training data
Default cutoff=0.5 means that the leaf
nodes label is the majority class.
Cutoff = 0.75: requires majority of 75% or
more 1 records in the leaf to label it a 1
node

Tree after all splits

Classifier Issues
Accuracy of a classifier
Training, Test and Validation data set
Overfitting
Classifier performance

Accuracy of Classifier
Main measure is the (classification)

accuracy which is the number of correctly
classified instances in the test set divided
by the total number of instances in the test
set.
Classifier with the highest accuracy is
preferred.
Some researchers use error rate which is
1 - accuracy.
Training, Validation and Test Partition
When a model is developed on
training data, it can overfit the
training data (hence need to assess
on validation)
Assessing multiple models on same
validation data can overfit validation
data
Some methods use the validation
data to choose a parameter. This
too can lead to overfitting the
validation data
Training, Validation and Test Partition
Solution: final selected model is
applied to a test partition to give
unbiased estimate of its
performance on new data

The Problem of Overfitting
Models can produce highly complex

explanations of relationships between
variables
The fit may be excellent
When used with new data, models of great
complexity do not do so well.

100% fit not useful for new data
1600
1400
1200
1000
Revenue
800
600
400
200
0
0 100 200 300 400 500 600 700 800 900 1000
Expenditure

Overfitting (cont.)
Causes:
Too many predictors
A model with too many parameters
Trying many different models
Consequence: Deployed model will not work as

well as expected with completely new data

Stopping Tree Growth
Natural end of process is 100% purity in

each leaf
This overfits the data, which end up fitting
noise in the data
Overfitting leads to low predictive accuracy
of new data
Past a certain point, the error rate for the
validation data starts to increase

Full Tree Error Rate

Pruning
Idea of pruning is to find that point at which
the validation error begins to rise
Generate smaller trees by pruning leaves
Measure the error rates for the smaller trees

How to keep Decision Tree small
Pruning
Prune before rules are generated (prepruning)
Not to further split a node
The node becomes a leaf
The leaf may hold the most frequent class among the subtree
Use measures such as information gain to halt splitting if the
gain is below a threshold
Prune after rules are generated (postpruning, used by
CART)
Remove subtrees from a fully grown tree
Go from bottom, compute cost complexity at an internal node
N as a function of error rate (percentage of tuples
misclassified) with and without the subtree
Prune if cost complexity is lower
Use pruning set, not the training set

Advantages of Decision tree methods
Easy to use, understand

Produce rules that are easy to interpret &
implement
Variable selection & reduction is automatic
Do not require the assumptions of
statistical models
Can work without extensive handling of
missing data

Disadvantage of decision tree methods
Decision trees are less appropriate for

estimation tasks where the goal is to predict
the value of a continuous attribute.
Decision trees are prone to errors in
classification problems with many class and
relatively small number of training examples.
Decision tree can be computationally
expensive to train.
Since the process deals with one variable at a
time, no way to capture interactions between
variables
Summary
Classification Trees are an easily
understandable and transparent method for
predicting or classifying new records
A tree is a graphical representation of a set of
rules
Trees must be pruned to avoid over-fitting of
the training data
As trees do not make any assumptions about
the data structure, they usually require large
samples

Steps in Data Mining
1. Define/understand purpose
2. Obtain data (may involve random sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised DM, partition it
5. Specify task (classification, clustering, etc.)
6. Choose the techniques (regression, CART,
neural networks, etc.)
7. Iterative implementation and tuning
8. Assess results compare models
9. Deploy best model
SEMMA by SAS
Sample
Explore
Modify
Model
Assess
CRISM-DM by IBM - Modeler

17 Classification Decision Tree 2017 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

17 Classification Decision Tree 2017 PDF

Uploaded by

Copyright:

Available Formats

Classification in Data Mining

Decision tree algorithm

BDM 2017 (Decision Tree) 2

Classification a simple example

BDM 2017 (Decision Tree) 3

BDM 2017 (Decision Tree) 4

Which customers are likely to respond to offers?

Which loan applicant will pay in time?

What is the price of a used Toyota Corolla?

Which group of employees performs similarly?

Which credit card transaction is normal or fraudulent?

Which customers have similar buying patterns?

Which medicine to recommend in certain diseases?

Which symptoms may appear together?

How to reduce the empty truck space in

BDM 2017 (Decision Tree) 6

Buy diapers, likely to Osco drug Daddy needs a beer

Dolls and Candy bars Walmart Kids come along for

Staplers reveals hires A large retailer Stapler purchases are

Mac users book Orbitz Mac is expensive, so

BDM 2017 (Decision Tree) 7

Crime rises after election Researchers in India Incumbent politicians

Retirement is bad for University of Zurich Unhealthy habits after

Vegetarians miss fewer An airline Availability of veg meals

A company plans to suggest a product to potential

BDM 2017 (Decision Tree) 9

BDM 2017 (Decision Tree) 10

Supervised learning two step process

BDM 2017 (Decision Tree) 11

Decision tree induction

BDM 2017 (Decision Tree) 12

BDM 2017 (Decision Tree) 13

BDM 2017 (Decision Tree) 14

BDM 2017 (Decision Tree) 15

student yes Credit_rating

no yes fair excellent

attribute, node, branch, test, decision and terminal node

BDM 2017 (Decision Tree) 16

BDM 2017 (Decision Tree) 17

Once a decision tree has been constructed, it

BDM 2017 (Decision Tree) 18

BDM 2017 (Decision Tree) 19

Termination of recursive partitioning

ID3, C4.5, C5.0 are decision tree algorithms

BDM 2017 (Decision Tree) 23

BDM 2017 (Decision Tree) 24

BDM 2017 (Decision Tree) 25

Entropy varies between zero and one when

The entropy is one when the set is perfectly

BDM 2017 (Decision Tree) 26

The disorder in a set containing members of

BDM 2017 (Decision Tree) 27

In the diagram below, the total number of

BDM 2017 (Decision Tree) 28

To illustrate, suppose D is a collection of 25

BDM 2017 (Decision Tree) 29

BDM 2017 (Decision Tree) 30

Information gain measures the expected reduction

BDM 2017 (Decision Tree) 31

BDM 2017 (Decision Tree) 32

BDM 2017 (Decision Tree) 34

pi = proportion of cases in data set D that belong to

Gini(D) = 0 when all cases belong to same class