You are on page 1of 62

Decision Trees

-CHAID & CART Analyses

Pristine

Pristine www.edupristine.com

Decision Trees Analyses


I.

Data Mining and Decision Trees

II. Decision Tree Example


III. CHAID analyses
IV. CART

Pristine

CHAID Analyses
I.

Data Mining and Decision Trees

II. Decision Tree Example


III. CHAID analyses- Method and Algorithms
IV. Using R to run CHAID analysis
V. Case
VI. CHAID Demonstration in SPSS

Pristine

Data Mining
Data mining is the nontrivial extraction of implicit, previously unknown, and
potentially useful information from data.
As goals could be differently. Accordingly the data mining techniques will vary. At
a high level, data mining techniques can be classified into:
Directed or
Undirected

Directed

Pristine

Undirected

Goal is to predict,
estimate, classify, or
characterize the behavior
of some pre-identified
target variable.

Goal is
to discover
structure in the data set
as a whole.

- Classification
- Estimation
- Prediction

- Description &
Visualization
- Association Rule or
Affinity Grouping
- Clustering

Classification & Decision Tree Analysis


Classification is used to develop a model that maps a data item into one of
several predefined classes.
Decision Tree Analysis

Builds classification and regression trees


Starts with pre-identified target variable in other words dependent variable.
This is the initial node Initial node is split into two or more child nodes
Splitting is based on statistical analysis used by decision tree algorithms (CHAID is one
such algorithm)

Pristine

Classification & Decision Tree AnalysisExample


Case: A pizza outlet carried out a promotional campaign and reached out to 1000 individuals
in a locality. Out of the 1000 individuals approached, 200 responded by visiting the outlet.
The outlet also collected the personal details of customer. After the campaign period was
over, the outlet carried out an analysis in order to classify the customers into various classes:

Respons
e
20%
1000

Married
25%
500

Male
40%
300
Pristine

Female
15%
300

Divorce
d
15%
200

Single
15%
300

Pet=no
40%
50

Pet=yes
6.67%
150

Segments in order of
responsiveness
1. Married Male

Response rate 40%

Sample size 300


2. Divorced with no pets

Response rate 40%

Sample size 50
3. Married Female

Response rate 15%

Sample size 300


4. Singles

Response rate 15%

Sample size 300


5. Divorced with pets

Response rate 6.67%

Sample size 150

Decision Tree Algorithms


CHAID (Chi square Automatic Interaction Detector)
C&RT (Classification and Regression Tree)
QUEST (Quick Unbiased Efficient Statistical Test)

Pristine

CHAID Analyses
I.

CHAID analyses- Method and Algorithms

II. Using R to run CHAID analysis


III. Case

Pristine

CHAID Method and Components


CHAID was originally designed to handle categorical variables only.
Data mining tools like SAS E-Miner; R and SPSS has extended algorithm to handle
nominal, ordinal and continuous dependent variables.
One or more predictor variables. Predictor variables can be continuous, ordinal,
or nominal.
One target variable. The target variable must be categorical.

Pristine

CHAID Algorithms
A CHAID tree is a decision tree that is constructed by splitting subsets of the
space into two or more child (nodes) repeatedly, beginning with the entire data
set.
To determine the best split at any node, CHAID merges any allowable pair of
categories of the predictor variable (the set of allowable pairs is determined by
the type of predictor variable being studied) if there is no statistically significant
difference within the pair with respect to the target variable.
The process is repeated until no non-significant pair is found.
The resulting set of categories of the predictor variable is the best split with
respect to that predictor variable.
This process is followed for all predictor variables.
The split that is the best prediction is selected, and the node is split.
The process repeats recursively until one of the stopping rules is triggered.
The significance of splitting is tested by means of chi-square test based on
contingency table approach.

Pristine

10

customer segments who default on payment


due
Romanov, an Analytics consultant works with Credit One Bank. His manager
gave him data having Credit and personal information of a group of
customers. Some of the customers had defaulted in making the payment on
balance due. He asked him to identify the classes of customers having
higher default rate than average using a decision tree. Romanov has no
knowledge of running a CHAID analysis.
Now suppose, he approaches you and request for your help to complete the
assignment. Lets help Romanov in solving the problem.

Pristine

11

Case- CHAID analysis


In due course of helping Romanov to complete his task, we will walk him
through following steps:
Variable identification

Identifying the dependent (response) variable.


Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary

Response variable exploration


Distribution analysis
Outlier treatment
Not required for Binary variables

Running the CHAID analysis using R

Importing data in R
Selecting the variables
Runing the analysis
Interpreting the results

Pristine

12

Understanding the Data


Response variable
Default_on_Payment
Binary in nature
Takes (0,1) as values corresponding to (no default, default)

Independent variables (also called a s attributes in credit industry)


20 attributes
3 numerical and 17 categorical

For details refer to


Data dictionary
Data Dictionary- Analysis_of_Default.docx
Data file
Analysis_of_Default.xlsx

Pristine

13

Response variable analysis


Frequency distribution
Default_On_Payment # Observations
0
3505
1
1495
Total
5000

Event rate (for a Binary response variable)


Proportion of 1 as a total number of observation
Over here Default_On_Payment Rate = Event Rate
Event Rate = 1495/5000 = 29.9%

Pristine

14

Bivariate Analysis (of Independent variables)


1

#
Default_On_Pay
Status_Checking_Acc Observatio
ment
ns
A11
1370
675
A12
1345
520
A13
315
70
A14
1970
230
Duration_in_Mo
# Observations
nths
12
1795
24
2055
36
715
48
355
60
80
72
5

Credit_History
A30
A31
A32
A33
A34

Pristine

Default_On_Payment
380
610
285
180
40
5

#
Default_On_Pay
Observatio
ment
ns
200
125
245
140
2650
840
440
140
1465
250

Default
Rate
49.27%
38.66%
22.22%
11.68%
Default
Rate
21.2%
29.7%
39.9%
50.7%
50.0%
100.0%

These are transformed values.


For original values refer to
Analysis_of_Default.xlsx

Default
Rate
62.50%
57.14%
31.70%
31.82%
17.06%

15

Bivariate Analysis (of Independent variables)


4

Purposre_Credit_Taken
A40
A41
A42
A43
A44
A45
A46
A48
A49
A410

#
Default_On_Pay
Observatio
ment
ns
1170
445
515
85
905
290
1400
305
60
20
110
40
250
110
45
5
485
170
60
25

Default
Rate
38.03%
16.50%
32.04%
21.79%
33.33%
36.36%
44.00%
11.11%
35.05%
41.67%

Credit_Amount

# Observations

Default_On_Payment

4000
11000
12000

3770
1085
145

975
420
100

Savings_Acc
A61
A62
A63
A64
A65

Pristine

#
Default_On_Pay
Observatio
ment
ns
3015
1080
515
170
315
55
240
30
915
160

Default
Rate
25.86%
38.71%
68.97%

A40: car (new)


A41: car (used)
A42: furniture/equipment
A43: radio/television
A44: domestic appliances
A45: repairs
A46: education
A47: (vacation - does not exist?)
A48: retraining
A49: business
A410: others

These are transformed values.


For original values refer to
Analysis_of_Default.xlsx

Default
Rate
35.82%
33.01%
17.46%
12.50%
17.49%

16

Bivariate Analysis (of Independent variables)


7

#
Years_At_Present_Emplo
Default_On_Pay
Observatio
yment
ment
ns
A71
310
115
A72
860
350
A73
1695
515
A74
870
195
A75
1265
320
Inst_Rt_Income
1
2
3
4

Marital_Status_Gender
A91
A92
A93
A94

1
0

#
Default_On_Pay
Observatio
ment
ns
680
170
1155
305
785
225
2380
795
#
Default_On_Pay
Observatio
ment
ns
250
100
1550
540
2740
730
460
125

#
Other_Debtors_Guaranto
Default_On_Pay
Observatio
rs
ment
ns
A101
4535
1355
A102
205
90
A103
260
50

Pristine

Default
Rate
37.10%
40.70%
30.38%
22.41%
25.30%
Default
Rate
25.00%
26.41%
28.66%
33.40%
Default
Rate
40.00%
34.84%
26.64%
27.17%
Default
Rate
29.88%
43.90%
19.23%

17

Bivariate Analysis (of Independent variables)


1
1

Current_Address_Yrs
1
2
3
4

1
2

1
3
1
4

Property
A121
A122
A123
A124

#
Default_On_Pay
Observatio
ment
ns
1410
295
1160
355
1660
510
770
335

Age:
Refer to
Analysis_of_Default.xlsx
Other_Inst_Plans
A141
A142
A143

1
5

#
Default_On_Pay
Observatio
ment
ns
650
180
1540
480
745
215
2065
620

Housing
A151
A152
A153

Pristine

#
Default_On_Pay
Observatio
ment
ns
695
285
235
95
4070
1115
#
Default_On_Pay
Observatio
ment
ns
895
350
3565
925
540
220

Default
Rate
27.69%
31.17%
28.86%
30.02%
Default
Rate
20.92%
30.60%
30.72%
43.51%

Default
Rate
41.01%
40.43%
27.40%
Default
Rate
39.11%
25.95%
40.74%

18

Bivariate Analysis (of Independent variables)


1
6

Num_CC
1
2
3
4

1
7

Job
A171
A172
A173
A174

1
8

1
9

2
0

Dependents
1
2
Telephone
A191
A192
Foreign_Worker
A201
A202

Pristine

#
Default_On_Pay
Observatio
ment
ns
3165
995
1665
460
140
30
30
10
#
Default_On_Pay
Observatio
ment
ns
110
35
1000
280
3150
925
740
255
#
Default_On_Pay
Observatio
ment
ns
4225
1265
775
230
#
Default_On_Pay
Observatio
ment
ns
2980
930
2020
565
#
Default_On_Pay
Observatio
ment
ns
4815
1475
185
20

Default
Rate
31.44%
27.63%
21.43%
33.33%
Default
Rate
31.82%
28.00%
29.37%
34.46%
Default
Rate
29.94%
29.68%
Default
Rate
31.21%
27.97%
Default
Rate
30.63%
10.81%

19

Code for CHAID

Pristine

20

R outputs for CHAID


For better preview see the file saved as pdf through R

Pristine

21

CHAID Analysis Demonstration in SPSS

Pristine

22

Preparing SPSS to run CHAID analysis


1

Pristine

23

Preparing SPSS to run CHAID analysis


5

Pristine

24

Preparing SPSS to run CHAID analysis


7

Pristine

25

Running a simple CHAID analysis- Summary

Pristine

26

Running a simple CHAID analysis - Tree

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

Branches in decending order of default rate:


A11- (A33; A30; A31) {rate = 74.5%, sample = 235}
A12- (A30; A31) {rate = 63.6%, sample = 165}
A11- A32 {rate = 51,2%, sample = 800}
A12- A32 {rate = 38.4%, sample = 730}
A12- (A34; A33) {rate = 30%, sample = 450}
A11- A34 {rate = 26.9%, sample = 335}
A14- (A33; A30; A31) {rate = 24.1%, sample = 270}
A13 {rate = 22.2, sample = 315}
A14- A32 {rate = 12.3%, sample = 935}
A14- A34 {rate = 6.5%, sample = 765}

Pristine

27

Running a simple CHAID analysis Tree Table

Pristine

28

CHAID Analysis Result in Decision making


Group1 (Customers having
Default Rate Higher than
Overall Average)
A11- (A33; A30; A31) {rate = 74.5%,

sample = 235}
A12- (A30; A31) {rate = 63.6%, sample

= 165}
A11- A32 {rate = 51,2%, sample =

800}

A12- A32 {rate = 38.4%, sample =


730}

A12- (A34; A33) {rate = 30%, sample


Management
should focus on lending credit
= 450}

Group2 (Customers having


Default Rate Lowe than Overall
Average)
A11- A34 {rate = 26.9%, sample =
335}
A14- (A33; A30; A31) {rate = 24.1%,
sample = 270}
A13 {rate = 22.2, sample = 315}
A14- A32 {rate = 12.3%, sample =
935}
A14- A34 {rate = 6.5%, sample = 765}

to customers falling in Group2 .

Group1 customers have higher risk than average. Management should


improvise on risk management strategy to manage these customers

Pristine

29

CART Analyses
I.

CART analyses- Method and Algorithms

II. Using R to run CART analysis


III. Case
IV. CART Demonstration using SPSS

Pristine

30

CART Method and Components


A non-parametric technique,using the methodology of tree building.
Data mining tools like SAS E-Miner; R and SPSS has extended algorithm to handle
nominal, ordinal and continuous dependent variables.
One or more predictor variables. Predictor variables can be continuous, ordinal,
or nominal.
One target variable. The target variable can be categorical or continuous.

Pristine

31

CART Algorithms

It makes use of Recursive Partitioning


Take all of your data.
Consider all possible values of all variables.
Select the variable/value (X=t1) that produces the greatest separation in the
target.
(X=t1) is called a split.
If X< t1 then send the data to the left; otherwise, send data point to the
right.
Now repeat same process on these two nodes
You get a tree
Note: CART only uses binary splits (Unlike CHAID which produces non-binary
trees as well)
Separation defined in many ways.
Regression Trees (continuous target): use sum of squared errors.
Classification Trees (categorical target): choice of entropy, Gini measure, twoing
splitting rule.

Pristine

32

Some differences between CART and CHAID


Dependent variable for CHAID must be categorical; for CART it can be metric
Different splitting algorithm (e.g., CHAID uses a Chi-squared test using
contingency tables)
CHAID splits into multiple groups, CART makes binary splits
Different stopping criteria

Pristine

33

customer segments who default on payment


due
Romanov, an Analytics consultant works with Credit One Bank. His manager
gave him data having Credit and personal information of a group of
customers. Some of the customers had defaulted in making the payment on
balance due. He asked him to identify the classes of customers having
higher default rate than average using a decision tree. Romanov has no
knowledge of running a CART analysis.
Now suppose, he approaches you and request for your help to complete the
assignment. Lets help Romanov in solving the problem.

Pristine

34

Case- CART analysis


In due course of helping Romanov to complete his task, we will walk him
through following steps:
Variable identification

Identifying the dependent (response) variable.


Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary

Response variable exploration


Distribution analysis
Outlier treatment
Not required for Binary variables

Running the CHAID analysis using R

Importing data in R
Selecting the variables
Runing the analysis
Interpreting the results

Pristine

35

Understanding the Data


Response variable
Default_on_Payment
Binary in nature
Takes (0,1) as values corresponding to (no default, default)

Independent variables (also called a s attributes in credit industry)


20 attributes
3 numerical and 17 categorical

For details refer to


Data dictionary
Data Dictionary- Analysis_of_Default.docx
Data file
Analysis_of_Default.xlsx

Pristine

36

Response variable analysis


Frequency distribution
Default_On_Payment # Observations
0
3505
1
1495
Total
5000

Event rate (for a Binary response variable)


Proportion of 1 as a total number of observation
Over here Default_On_Payment Rate = Event Rate
Event Rate = 1495/5000 = 29.9%

Pristine

37

Bivariate Analysis (of Independent variables)


1

#
Default_On_Pay
Status_Checking_Acc Observatio
ment
ns
A11
1370
675
A12
1345
520
A13
315
70
A14
1970
230
Duration_in_Mo
# Observations
nths
12
1795
24
2055
36
715
48
355
60
80
72
5

Credit_History
A30
A31
A32
A33
A34

Pristine

Default_On_Payment
380
610
285
180
40
5

#
Default_On_Pay
Observatio
ment
ns
200
125
245
140
2650
840
440
140
1465
250

Default
Rate
49.27%
38.66%
22.22%
11.68%
Default
Rate
21.2%
29.7%
39.9%
50.7%
50.0%
100.0%

These are transformed values.


For original values refer to
Analysis_of_Default.xlsx

Default
Rate
62.50%
57.14%
31.70%
31.82%
17.06%

38

Bivariate Analysis (of Independent variables)


4

Purposre_Credit_Taken
A40
A41
A42
A43
A44
A45
A46
A48
A49
A410

#
Default_On_Pay
Observatio
ment
ns
1170
445
515
85
905
290
1400
305
60
20
110
40
250
110
45
5
485
170
60
25

Default
Rate
38.03%
16.50%
32.04%
21.79%
33.33%
36.36%
44.00%
11.11%
35.05%
41.67%

Credit_Amount

# Observations

Default_On_Payment

4000
11000
12000

3770
1085
145

975
420
100

Savings_Acc
A61
A62
A63
A64
A65

Pristine

#
Default_On_Pay
Observatio
ment
ns
3015
1080
515
170
315
55
240
30
915
160

Default
Rate
25.86%
38.71%
68.97%

A40: car (new)


A41: car (used)
A42: furniture/equipment
A43: radio/television
A44: domestic appliances
A45: repairs
A46: education
A47: (vacation - does not exist?)
A48: retraining
A49: business
A410: others

These are transformed values.


For original values refer to
Analysis_of_Default.xlsx

Default
Rate
35.82%
33.01%
17.46%
12.50%
17.49%

39

Bivariate Analysis (of Independent variables)


7

#
Years_At_Present_Emplo
Default_On_Pay
Observatio
yment
ment
ns
A71
310
115
A72
860
350
A73
1695
515
A74
870
195
A75
1265
320
Inst_Rt_Income
1
2
3
4

Marital_Status_Gender
A91
A92
A93
A94

1
0

#
Default_On_Pay
Observatio
ment
ns
680
170
1155
305
785
225
2380
795
#
Default_On_Pay
Observatio
ment
ns
250
100
1550
540
2740
730
460
125

#
Other_Debtors_Guaranto
Default_On_Pay
Observatio
rs
ment
ns
A101
4535
1355
A102
205
90
A103
260
50

Pristine

Default
Rate
37.10%
40.70%
30.38%
22.41%
25.30%
Default
Rate
25.00%
26.41%
28.66%
33.40%
Default
Rate
40.00%
34.84%
26.64%
27.17%
Default
Rate
29.88%
43.90%
19.23%

40

Bivariate Analysis (of Independent variables)


1
1

Current_Address_Yrs
1
2
3
4

1
2

1
3
1
4

Property
A121
A122
A123
A124

#
Default_On_Pay
Observatio
ment
ns
1410
295
1160
355
1660
510
770
335

Age:
Refer to
Analysis_of_Default.xlsx
Other_Inst_Plans
A141
A142
A143

1
5

#
Default_On_Pay
Observatio
ment
ns
650
180
1540
480
745
215
2065
620

Housing
A151
A152
A153

Pristine

#
Default_On_Pay
Observatio
ment
ns
695
285
235
95
4070
1115
#
Default_On_Pay
Observatio
ment
ns
895
350
3565
925
540
220

Default
Rate
27.69%
31.17%
28.86%
30.02%
Default
Rate
20.92%
30.60%
30.72%
43.51%

Default
Rate
41.01%
40.43%
27.40%
Default
Rate
39.11%
25.95%
40.74%

41

Bivariate Analysis (of Independent variables)


1
6

Num_CC
1
2
3
4

1
7

Job
A171
A172
A173
A174

1
8

1
9

2
0

Dependents
1
2
Telephone
A191
A192
Foreign_Worker
A201
A202

Pristine

#
Default_On_Pay
Observatio
ment
ns
3165
995
1665
460
140
30
30
10
#
Default_On_Pay
Observatio
ment
ns
110
35
1000
280
3150
925
740
255
#
Default_On_Pay
Observatio
ment
ns
4225
1265
775
230
#
Default_On_Pay
Observatio
ment
ns
2980
930
2020
565
#
Default_On_Pay
Observatio
ment
ns
4815
1475
185
20

Default
Rate
31.44%
27.63%
21.43%
33.33%
Default
Rate
31.82%
28.00%
29.37%
34.46%
Default
Rate
29.94%
29.68%
Default
Rate
31.21%
27.97%
Default
Rate
30.63%
10.81%

42

R Code for CART-Regression Tree

Pristine

43

R Outputs for Regression tree


Output

Pristine

44

R Outputs for Regression tree


For better preview see the file saved as pdf through R

Pristine

45

R Code for CART-Classification Tree

Pristine

46

R Outputs for Classification tree

Pristine

47

R Outputs for Regression tree


For better preview see the file saved as pdf through R

Pristine

48

CART Analysis Demonstration in SPSS

Pristine

49

Preparing SPSS to run CART analysis


1

Pristine

50

Preparing SPSS to run CART analysis


5

Pristine

51

Preparing SPSS to run CART analysis


7

Pristine

52

Running a simple CART analysis- Summary

Pristine

53

Running a simple CART analysis Tree


(Compressed View)
Use this tree result along with the
tabular result present in the next
Terminal Nodes
slide
with default rate
less than over all
average of 29.9%
Node 6: 25.8%
Node 12: 20%
Node 13:
26.9%
Node 17:6.5%
Node 18:
12.3%
Node 22:
25.7%

Terminal Nodes
with default rate
greater than over
all average of
29.9%

Pristine

Node 14:
32.7%
Node 19:
51.2%
Node 20: 75%
Node 21:
38.4%
54

Running a simple CART analysis Table

Terminal Nodes
with default rate
less than over all
average of 29.9%
Node 6: 25.8%
Node 12: 20%
Node 13:
26.9%
Node 17:6.5%
Node 18:
12.3%
Node 22:
25.7%

Terminal Nodes
with default rate
greater than over
all average of
29.9%
Node 14:
32.7%
Node 19:
51.2%
Node 20: 75%
Node 21:
38.4%

Pristine

55

Running a simple CART analysis Tree


(Expanded View -1)

Pristine

56

Running a simple CART analysis Tree


(Expanded View- 2)

Pristine

57

Running a simple CART analysis Comparison


with CHAID result
CART

CHAID

For a simple analysis using only two variables (Status_Checking_Account and


Credit_History) are producing the results with same level of accuracy
Pristine

58

Running a simple CART analysis Comparison


with CHAID result
CART

CHAID

For a simple analysis using only two variables (Status_Checking_Account and


Credit_History) are producing the results with same level of accuracy
Pristine

59

CART vs. CHAID Comparison- Using all the


available variables
1 CHAID

Pristine

2 CART

60

CART vs. CHAID Comparison- Using all the


available variables
CART

CHAID

CHAID is producing better/more accurate results as compared to CART (79.3%


vs 77.4%).
Pristine

61

When to use CART and when to use CHAID


Depends on type of Response variable
CART when numeric or metric or Binary
CHAID when Binary (provided it produces better result than CART)

CHAID should be used when the goal is to describe or understand the


relationship between a response variable and a set of explanatory variables.
CART is better suited for creating a model that has high prediction accuracy of
new cases.

Pristine

62

You might also like