You are on page 1of 66

Classification in Data Mining

Decision tree algorithm


Classification
Sl Acad Work CGPA Interview Performance
No Background Experience (0-9 Scale) Score in Company
1 Science < 1 year >7.5 Poor Excellent
2 Arts > 3 years 6.5 - 7.5 Good Good
3 Commerce 1-3 years 6.5 - 7.5 Good Good
4 Engg > 3 years < 6.5 Good Good
5 Engg < 1 year 6.5 - 7.5 Poor Fair
6 Science 1-3 years < 6.5 Good Excellent
7 Engg 1-3 years >7.5 Poor Good
8 Science > 3 years < 6.5 Good Fair
9 Engg 1-3 years 6.5 - 7.5 Good Good
10 Science < 1 year 6.5 - 7.5 Poor Fair
11 Science > 3 years < 6.5 Good Fair
12 Commerce 1-3 years >7.5 Good Excellent
13 Engg > 3 years < 6.5 Good Fair
.
Engg 1-3 years <6.5 Good ?????

BDM 2017 (Decision Tree) 2


Example

Classification a simple example

BDM 2017 (Decision Tree) 3


What is Classification?
Classification is learning a function that maps
(classifies) a data record into one of several
predefined classes
A data record has a number of attributes an
instance or a case
Predefined classes are based on values of the
class attribute
Learning is done through analysis of predefined
classes historically identified by managers that
are used as training samples

BDM 2017 (Decision Tree) 4


Questions
Which items are bought together ?

Which customers are likely to respond to offers?

Which loan applicant will pay in time?

What is the price of a used Toyota Corolla?

Which group of employees performs similarly?

Which credit card transaction is normal or fraudulent?


BDM 2017 (Decision Tree) 5
Questions
Which items may be put on sale ?

Which customers have similar buying patterns?

Which medicine to recommend in certain diseases?

Which symptoms may appear together?

How to reduce the empty truck space in


transportation?

BDM 2017 (Decision Tree) 6


Example findings
Findings Organization Suggested
Explanation

Buy diapers, likely to Osco drug Daddy needs a beer


buy beer

Dolls and Candy bars Walmart Kids come along for


errands

Staplers reveals hires A large retailer Stapler purchases are


often part of office kit

Mac users book Orbitz Mac is expensive, so


expensive hotels users have more
financial power

BDM 2017 (Decision Tree) 7


Example findings
Findings Organization Suggested Explanation

Crime rises after election Researchers in India Incumbent politicians


crack down on crime
before election

Hungry judges rule Columbia University and Hunger and fatigue leave
negatively Ben Gurion University decision makers feeling
(Israel) less forgiving

Retirement is bad for University of Zurich Unhealthy habits after


health retirement

Vegetarians miss fewer An airline Availability of veg meals


flights provides incentives for
travel
BDM 2017 (Decision Tree) 8
Question

A company plans to suggest a product to potential


customers. A database of 1 million customer
records exists; 20,000 purchases (2%) is the goal.
How to go about it?

BDM 2017 (Decision Tree) 9


An Example
A company plans to suggest a product to potential
customers. A database of 1 million customer records
exists; 20,000 purchases (2%) is the goal.
Instead of contacting all 1,000,000 customers, only
100,000 were contacted and their response to the
sale offers were obtained. These subset was used
to train a Classifier to tell which of the 100,000
customers decide to buy the product.
Then the rest 900,000 customers were then
presented to the Classifier which classified 32,000 of
them as potential buyers. The 32,000 customers
were contacted. and the 2% response is achieved.
Total savings $2,500,000.

BDM 2017 (Decision Tree) 10


Classification

Supervised learning two step process


Learning y=f(X) and applying the function
Training data set and new data
Training data set, Test data set and new data
Performance measure of the classifier
Overfitting
Training data set, Validation data set, Test data
set and new data multiple models

BDM 2017 (Decision Tree) 11


Popular classification methods

Decision tree induction


Bayesian classification
K-nearest neighbours
Classification by neural networks

BDM 2017 (Decision Tree) 12


Characteristics of Classifiers
Accuracy Correctly specify class labels of new or
previously unseen data; number of data records in
test set correctly classified
Speed How fast the classifier works
Robustness Ability to classify correctly given noisy
data
Scalability Ability to construct the classifier for large
amounts of data
Interpretability Level of insights provided by the
classifier

BDM 2017 (Decision Tree) 13


Decision Trees
One of the most widely used and practical
methods for classification through inductive
inference
a method for approximating discrete-valued
target functions
instances are represented by attribute-value
pairs
Greedy, top-down recursive divide-and-conquer
approach

BDM 2017 (Decision Tree) 14


Decision Trees and Rules
Goal: Classify or predict an outcome based on a
set of predictors
The output is a set of rules

Example:
Goal: classify a record as will buy computer or
will not buy
Rule might be IF (Income > 92.5) AND
(Education = poor) AND (FamilySize = small)
THEN buy = no (class = 0)
Rules are represented by tree diagrams

BDM 2017 (Decision Tree) 15


Input data
age
youth middle_aged senior

student yes Credit_rating

no yes fair excellent

no yes no yes

attribute, node, branch, test, decision and terminal node


Every path constitutes a decision rule

BDM 2017 (Decision Tree) 16


How to generate a decision tree?

BDM 2017 (Decision Tree) 17


Rule Generation

Once a decision tree has been constructed, it


is a simple matter to convert it into an
equivalent set of rules (Function)
To generate rules, trace each path in the
decision tree, from root node to leaf node,
recording the test outcomes as antecedents
and the leaf-node classification as the
consequent

BDM 2017 (Decision Tree) 18


Rules
If age = youth and student = no then
buys_computer = no
If age = youth and student = yes then
buys_computer = yes
If age = middle_aged then buys_computer = yes
If age = senior and credit_rating = fair then
buys_computer = no
If age = senior and credit_rating = excellent then
buys_computer = no

BDM 2017 (Decision Tree) 19


Basic Decision Tree Algorithm
Greedy approach - Identifies which attribute would be the
best classifier for the target attribute in a given data set
splitting attribute A
A is discrete-valued (k values) k branches,
remove A from attribute list
A is continuous 2 branches, midpoint of two known
adjacent values, A split_point, A > split_point
and do not remove A from the list
A is discrete_valued and binary split is required -
2 branches, A SA or not (For Binary Decision trees)
No backtrack method
BDM 2017 (Decision Tree) 20
Basic Decision Tree Algorithm

Termination of recursive partitioning


all tuples in D belong to the same class
no remaining attributes to partition D, majority
voting employed
Dj is empty, majority voting in D employee

ID3, C4.5, C5.0 are decision tree algorithms


CART (Classification and Regression Trees)
is a binary decision tree algorithm (XLMiner)
BDM 2017 (Decision Tree) 21
Attribute selection
A heuristic for selecting the splitting criterion
Objective is to separate a given data partition
into individual classes ideally pure partitions
Every attribute is given a score, the attribute
with best score is chosen (impurity function)
Information Gain, Gain ratio, Gini Index
(XLMiner uses a variation of Gini Index called
delta splitting rules) are popular attribute
selection measures

BDM 2017 (Decision Tree) 23


Attribute selection
Entropy - a measure from information theory,
characterizes the (im)purity, or homogeneity,
of an arbitrary collection of examples
Entropy - Information content of the set D -
Examine the attributes to add at the next level of
the tree using an entropy calculation
Choose the attribute that minimizes the entropy

BDM 2017 (Decision Tree) 24


How to compute entropy?

Entropy(D) =

= c - (pc ) log2(pc )
= c - (nbc / nb ) log2(nbc / nb )

BDM 2017 (Decision Tree) 25


Meaning of entropy value

Entropy varies between zero and one when


the members belong to only two classes;
otherwise it may be greater than 1
The entropy is zero when the set is perfectly
homogeneous.

The entropy is one when the set is perfectly


inhomogeneous.

BDM 2017 (Decision Tree) 26


How entropy changes

The disorder in a set containing members of


two classes A and B, as a function of the
fraction of the set belonging to class A.
In the diagram below, the total number of
instances in both classes combined is two.

BDM 2017 (Decision Tree) 27


How entropy changes

In the diagram below, the total number of


instances in both classes is eight.

BDM 2017 (Decision Tree) 28


Entropy - a measure of homogeneity

To illustrate, suppose D is a collection of 25


examples, including 15 positive and 10
negative examples [15+, 10-].
The entropy of D relative to this classification
is
Entropy(D) = - (15/25) log2 (15/25) - (10/25) log2 (10/25)
= 0.970

BDM 2017 (Decision Tree) 29


Interpretation of entropy value
Entropy is 0 if all members of D belong to the same
class.
If all members are positive (pp= 1 ), then pn is 0, and
Entropy(D) = -1 log2(1) - 0 log2(0) = -1* 0 - 0*log20 = 0.
Entropy is 1 (at its maximum!) when the collection contains
an equal number of positive and negative examples.
If the collection contains unequal numbers of positive and
negative examples, the entropy is between 0 and 1.
Higher the entropy more difficult is to make prediction
The value of entropy can be greater than 1 based on the
number of different classes the members of D belong to

BDM 2017 (Decision Tree) 30


Information gain

Information gain measures the expected reduction


in entropy based on additional information
A measure of the effectiveness of an attribute in
classifying the training data, information gain, is
simply the expected reduction in entropy caused by
partitioning the examples according to this attribute.
Information gain(D,A) = Entropy of the original
collection D - expected value of the entropy after D
is partitioned using attribute A.

BDM 2017 (Decision Tree) 31


Choice of Attribute

BDM 2017 (Decision Tree) 32


Information gain

Entropy(D) Entropy(D,A)
= Gain(D,A)
= Info(D) Info(D,A)
= - (ntc / nt ) log2(ntc / nt )
- Average entropy on branching on A
= - c(ntc / nt ) log2(ntc / nt )
- ( b ((nb / nt ) x (-c (nbc / nb ) log2(nbc / nb ))))

BDM 2017 (Decision Tree) 34


Gini Index for CART, XLMiner
Gini Index for a data set D containing k classes
Gini(D) = 1 - pi2

pi = proportion of cases in data set D that belong to


class k

Gini(D) = 0 when all cases belong to same class


At a max when all classes are equally
represented (= 0.50 in binary case)

BDM 2017 (Decision Tree) 35


ID3, C4.5 (C5.0), CART
ID3 Uses Information gain, better for attributes with
large number of values

C4.5 Uses maximum Gain Ratio =

Gain(D,A)/SplitInfo(D,A)

where SplitInfo(D,A) = - b(nb / nt ) log2(nb / nt )

CART uses Gini Index


CHAID uses statistical 2 test for independence
C-SEP, G-statistic are other measures

BDM 2017 (Decision Tree) 36


Classification trees
Classification trees in XLMiner
Main distinction: Categorical vs. Numeric variables
Numeric
Continuous
Integer
Categorical
Ordered (low, medium, high)
Unordered (male, female)

BDM 2017 (Decision Tree) 37


Variable handling
Numeric
Most algorithms in XLMiner take numeric data
May occasionally need to bin into categories for
outcome variables
IBM Modeler accepts Categorical variables

BDM 2017 (Decision Tree) 38


Example: Riding Mowers

Data: 24 households classified as owning


or not owning riding mowers

Predictors = Income, Lot Size

BDM 2017 (Decision Tree) 39


Income Lot_Size Ownership
60.0 18.4 owner
85.5 16.8 owner
64.8 21.6 owner
61.5 20.8 owner
87.0 23.6 owner
110.1 19.2 owner
108.0 17.6 owner
82.8 22.4 owner
69.0 20.0 owner
93.0 20.8 owner
51.0 22.0 owner
81.0 20.0 owner
75.0 19.6 non-owner
52.8 20.8 non-owner
64.8 17.2 non-owner
43.2 20.4 non-owner
84.0 17.6 non-owner
49.2 17.6 non-owner
59.4 16.0 non-owner
66.0 18.4 non-owner
47.4 16.4 non-owner
33.0 18.8 non-owner
51.0 14.0 non-owner
BDM 2017 (Decision Tree)
63.0 14.8 non-owner 40
How to split into Binary tree
Order records according to one variable, say
lot size
Find midpoints between successive values
E.g., first midpoint is 14.4 (halfway between
14.0 and 14.8)
Divide records into those with lot size > 14.4
and those < 14.4
After evaluating that split, try the next one,
which is 15.4 (halfway between 14.8 and
16.0)
BDM 2017 (Decision Tree) 41
The first split: Lot Size = 19,000

BDM 2017 (Decision Tree) 42


Second Split: Income = $84,000

BDM 2017 (Decision Tree) 43


After All Splits

BDM 2017 (Decision Tree) 44


Note: Categorical Variables for binary
splitting
Examine all possible ways in which the
categories can be split.
E.g. categories A, B, C can be split 3 ways
{A} and {B, C}
{B} and {A, C}
{C} and {A, B}
With many categories, # of splits becomes huge
XLMiner supports only binary categorical
variables and binary trees
BDM 2017 (Decision Tree) 45
Recursive Partitioning Steps
Pick one of the predictor variables, xi
Pick a value of xi, say si, that divides the
training data into two or more (not necessarily
equal) portions
Measure how pure or homogeneous each
of the resulting portions are
Pure = containing records of mostly one class
Idea is to pick xi, and si to maximize purity
Repeat the process

BDM 2017 (Decision Tree) 46


Recursive Partitioning
Obtain overall impurity measure (weighted
avg. of individual rectangles)
At each successive stage, compare this
measure across all possible splits in all
variables
Choose the split that reduces impurity the
most
Chosen split points become nodes on the
tree

BDM 2017 (Decision Tree) 47


First Split The Tree

BDM 2017 (Decision Tree) 48


Tree after second split

BDM 2017 (Decision Tree) 49


Tree Structure
Split points become nodes on tree (circles
with split value in center)
Rectangles represent leaves (terminal
points, no further splits, classification value
noted)
Numbers on lines between nodes indicate #
cases
Read down tree to derive rule, e.g.
If lot size < 19, and if income > 84.75, then class
= owner
BDM 2017 (Decision Tree) 50
Determining Leaf Node Label
Each leaf node label is determined by
voting of the records within it, and by the
cutoff value
Records within each leaf node are from
the training data
Default cutoff=0.5 means that the leaf
nodes label is the majority class.
Cutoff = 0.75: requires majority of 75% or
more 1 records in the leaf to label it a 1
node

BDM 2017 (Decision Tree) 51


Tree after all splits

BDM 2017 (Decision Tree) 52


Classifier Issues

Accuracy of a classifier

Training, Test and Validation data set

Overfitting

Classifier performance

BDM 2017 (Decision Tree) 53


Accuracy of Classifier

Main measure is the (classification)


accuracy which is the number of correctly
classified instances in the test set divided
by the total number of instances in the test
set.
Classifier with the highest accuracy is
preferred.
Some researchers use error rate which is
1 - accuracy.
BDM 2017 (Decision Tree) 54
Training, Validation and Test Partition
When a model is developed on
training data, it can overfit the
training data (hence need to assess
on validation)
Assessing multiple models on same
validation data can overfit validation
data
Some methods use the validation
data to choose a parameter. This
too can lead to overfitting the
validation data
BDM 2017 (Decision Tree) 55
Training, Validation and Test Partition
Solution: final selected model is
applied to a test partition to give
unbiased estimate of its
performance on new data

BDM 2017 (Decision Tree) 56


The Problem of Overfitting

Models can produce highly complex


explanations of relationships between
variables
The fit may be excellent
When used with new data, models of great
complexity do not do so well.

BDM 2017 (Decision Tree) 57


100% fit not useful for new data
1600

1400

1200

1000
Revenue

800

600

400

200

0
0 100 200 300 400 500 600 700 800 900 1000

Expenditure

BDM 2017 (Decision Tree) 58


Overfitting (cont.)

Causes:
Too many predictors

A model with too many parameters

Trying many different models

Consequence: Deployed model will not work as


well as expected with completely new data

BDM 2017 (Decision Tree) 59


Stopping Tree Growth

Natural end of process is 100% purity in


each leaf
This overfits the data, which end up fitting
noise in the data
Overfitting leads to low predictive accuracy
of new data
Past a certain point, the error rate for the
validation data starts to increase

BDM 2017 (Decision Tree) 60


Full Tree Error Rate

BDM 2017 (Decision Tree) 61


Pruning
Idea of pruning is to find that point at which
the validation error begins to rise

Generate smaller trees by pruning leaves

Measure the error rates for the smaller trees

BDM 2017 (Decision Tree) 62


How to keep Decision Tree small
Pruning
Prune before rules are generated (prepruning)
Not to further split a node
The node becomes a leaf
The leaf may hold the most frequent class among the subtree
Use measures such as information gain to halt splitting if the
gain is below a threshold
Prune after rules are generated (postpruning, used by
CART)
Remove subtrees from a fully grown tree
Go from bottom, compute cost complexity at an internal node
N as a function of error rate (percentage of tuples
misclassified) with and without the subtree
Prune if cost complexity is lower
Use pruning set, not the training set

BDM 2017 (Decision Tree) 63


Advantages of Decision tree methods

Easy to use, understand


Produce rules that are easy to interpret &
implement
Variable selection & reduction is automatic
Do not require the assumptions of
statistical models
Can work without extensive handling of
missing data

BDM 2017 (Decision Tree) 64


Disadvantage of decision tree methods

Decision trees are less appropriate for


estimation tasks where the goal is to predict
the value of a continuous attribute.
Decision trees are prone to errors in
classification problems with many class and
relatively small number of training examples.
Decision tree can be computationally
expensive to train.
Since the process deals with one variable at a
time, no way to capture interactions between
variables
BDM 2017 (Decision Tree) 65
Summary
Classification Trees are an easily
understandable and transparent method for
predicting or classifying new records
A tree is a graphical representation of a set of
rules
Trees must be pruned to avoid over-fitting of
the training data
As trees do not make any assumptions about
the data structure, they usually require large
samples

BDM 2017 (Decision Tree) 66


Steps in Data Mining
1. Define/understand purpose
2. Obtain data (may involve random sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised DM, partition it
5. Specify task (classification, clustering, etc.)
6. Choose the techniques (regression, CART,
neural networks, etc.)
7. Iterative implementation and tuning
8. Assess results compare models
9. Deploy best model
BDM 2017 (Decision Tree) 67
SEMMA by SAS

Sample
Explore
Modify
Model
Assess

CRISM-DM by IBM - Modeler

BDM 2017 (Decision Tree) 68

You might also like