1 - Chaid and Cart

Decision Trees
-CHAID & CART Analyses
Pristine
Pristine www.edupristine.com
Decision Trees Analyses

I.
Data Mining and Decision Trees
II. Decision Tree Example

III. CHAID analyses
IV. CART
Pristine
CHAID Analyses
I.
Data Mining and Decision Trees
II. Decision Tree Example

III. CHAID analyses- Method and Algorithms
IV. Using R to run CHAID analysis
V. Case
VI. CHAID Demonstration in SPSS
Pristine
Data Mining
Data mining is the nontrivial extraction of implicit, previously unknown, and
potentially useful information from data.
As goals could be differently. Accordingly the data mining techniques will vary. At
a high level, data mining techniques can be classified into:
Directed or
Undirected
Directed
Pristine
Undirected
Goal is to predict,
estimate, classify, or
characterize the behavior
of some pre-identified
target variable.
Goal is
to discover
structure in the data set
as a whole.
- Classification
- Estimation
- Prediction
- Description &
Visualization
- Association Rule or
Affinity Grouping
- Clustering
Classification & Decision Tree Analysis

Classification is used to develop a model that maps a data item into one of
several predefined classes.
Decision Tree Analysis
Builds classification and regression trees

Starts with pre-identified target variable in other words dependent variable.
This is the initial node Initial node is split into two or more child nodes
Splitting is based on statistical analysis used by decision tree algorithms (CHAID is one
such algorithm)
Pristine
Classification & Decision Tree AnalysisExample

Case: A pizza outlet carried out a promotional campaign and reached out to 1000 individuals
in a locality. Out of the 1000 individuals approached, 200 responded by visiting the outlet.
The outlet also collected the personal details of customer. After the campaign period was
over, the outlet carried out an analysis in order to classify the customers into various classes:
Respons
e
20%
1000
Married
25%
500
Male
40%
300
Pristine
Female
15%
300
Divorce
d
15%
200
Single
15%
300
Pet=no
40%
50
Pet=yes
6.67%
150
Segments in order of
responsiveness
1. Married Male
Response rate 40%
Sample size 300

2. Divorced with no pets
Response rate 40%
Sample size 50
3. Married Female
Response rate 15%
Sample size 300

4. Singles
Response rate 15%
Sample size 300

5. Divorced with pets
Response rate 6.67%
Sample size 150
Decision Tree Algorithms

CHAID (Chi square Automatic Interaction Detector)
C&RT (Classification and Regression Tree)
QUEST (Quick Unbiased Efficient Statistical Test)
Pristine
CHAID Analyses
I.
CHAID analyses- Method and Algorithms
II. Using R to run CHAID analysis

III. Case
Pristine
CHAID Method and Components

CHAID was originally designed to handle categorical variables only.
Data mining tools like SAS E-Miner; R and SPSS has extended algorithm to handle
nominal, ordinal and continuous dependent variables.
One or more predictor variables. Predictor variables can be continuous, ordinal,
or nominal.
One target variable. The target variable must be categorical.
Pristine
CHAID Algorithms
A CHAID tree is a decision tree that is constructed by splitting subsets of the
space into two or more child (nodes) repeatedly, beginning with the entire data
set.
To determine the best split at any node, CHAID merges any allowable pair of
categories of the predictor variable (the set of allowable pairs is determined by
the type of predictor variable being studied) if there is no statistically significant
difference within the pair with respect to the target variable.
The process is repeated until no non-significant pair is found.
The resulting set of categories of the predictor variable is the best split with
respect to that predictor variable.
This process is followed for all predictor variables.
The split that is the best prediction is selected, and the node is split.
The process repeats recursively until one of the stopping rules is triggered.
The significance of splitting is tested by means of chi-square test based on
contingency table approach.
Pristine
10
customer segments who default on payment

due
Romanov, an Analytics consultant works with Credit One Bank. His manager
gave him data having Credit and personal information of a group of
customers. Some of the customers had defaulted in making the payment on
balance due. He asked him to identify the classes of customers having
higher default rate than average using a decision tree. Romanov has no
knowledge of running a CHAID analysis.
Now suppose, he approaches you and request for your help to complete the
assignment. Lets help Romanov in solving the problem.
Pristine
11
Case- CHAID analysis

In due course of helping Romanov to complete his task, we will walk him
through following steps:
Variable identification
Identifying the dependent (response) variable.

Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary
Response variable exploration

Distribution analysis
Outlier treatment
Not required for Binary variables
Running the CHAID analysis using R
Importing data in R
Selecting the variables
Runing the analysis
Interpreting the results
Pristine
12
Understanding the Data

Response variable
Default_on_Payment
Binary in nature
Takes (0,1) as values corresponding to (no default, default)
Independent variables (also called a s attributes in credit industry)

20 attributes
3 numerical and 17 categorical
For details refer to

Data dictionary
Data Dictionary- Analysis_of_Default.docx
Data file
Analysis_of_Default.xlsx
Pristine
13
Response variable analysis

Frequency distribution
Default_On_Payment # Observations
0
3505
1
1495
Total
5000
Event rate (for a Binary response variable)

Proportion of 1 as a total number of observation
Over here Default_On_Payment Rate = Event Rate
Event Rate = 1495/5000 = 29.9%
Pristine
14
Bivariate Analysis (of Independent variables)

1
#
Default_On_Pay
Status_Checking_Acc Observatio
ment
ns
A11
1370
675
A12
1345
520
A13
315
70
A14
1970
230
Duration_in_Mo
# Observations
nths
12
1795
24
2055
36
715
48
355
60
80
72
5
Credit_History
A30
A31
A32
A33
A34
Pristine
Default_On_Payment
380
610
285
180
40
5
#
Default_On_Pay
Observatio
ment
ns
200
125
245
140
2650
840
440
140
1465
250
Default
Rate
49.27%
38.66%
22.22%
11.68%
Default
Rate
21.2%
29.7%
39.9%
50.7%
50.0%
100.0%
These are transformed values.

For original values refer to
Default
Rate
62.50%
57.14%
31.70%
31.82%
17.06%
15

4
Purposre_Credit_Taken
A40
A41
A42
A43
A44
A45
A46
A48
A49
A410
#
Default_On_Pay
Observatio
ment
ns
1170
445
515
85
905
290
1400
305
60
20
110
40
250
110
45
5
485
170
60
25
Default
Rate
38.03%
16.50%
32.04%
21.79%
33.33%
36.36%
44.00%
11.11%
35.05%
41.67%
Credit_Amount
# Observations
Default_On_Payment
4000
11000
12000
3770
1085
145
975
420
100
Savings_Acc
A61
A62
A63
A64
A65
Pristine
#
Default_On_Pay
Observatio
ment
ns
3015
1080
515
170
315
55
240
30
915
160
Default
Rate
25.86%
38.71%
68.97%
A40: car (new)

A41: car (used)
A42: furniture/equipment
A43: radio/television
A44: domestic appliances
A45: repairs
A46: education
A47: (vacation - does not exist?)
A48: retraining
A49: business
A410: others

Default
Rate
35.82%
33.01%
17.46%
12.50%
17.49%
16

7
#
Years_At_Present_Emplo
Default_On_Pay
Observatio
yment
ment
ns
A71
310
115
A72
860
350
A73
1695
515
A74
870
195
A75
1265
320
Inst_Rt_Income
1
2
3
4
Marital_Status_Gender
A91
A92
A93
A94
1
0
#
Default_On_Pay
Observatio
ment
ns
680
170
1155
305
785
225
2380
795
#
Default_On_Pay
Observatio
ment
ns
250
100
1550
540
2740
730
460
125
#
Other_Debtors_Guaranto
Default_On_Pay
Observatio
rs
ment
ns
A101
4535
1355
A102
205
90
A103
260
50
Pristine
Default
Rate
37.10%
40.70%
30.38%
22.41%
25.30%
Default
Rate
25.00%
26.41%
28.66%
33.40%
Default
Rate
40.00%
34.84%
26.64%
27.17%
Default
Rate
29.88%
43.90%
19.23%
17

1
1
Current_Address_Yrs
1
2
3
4
1
2
1
3
1
4
Property
A121
A122
A123
A124
#
Default_On_Pay
Observatio
ment
ns
1410
295
1160
355
1660
510
770
335
Age:
Refer to
Other_Inst_Plans
A141
A142
A143
1
5
#
Default_On_Pay
Observatio
ment
ns
650
180
1540
480
745
215
2065
620
Housing
A151
A152
A153
Pristine
#
Default_On_Pay
Observatio
ment
ns
695
285
235
95
4070
1115
#
Default_On_Pay
Observatio
ment
ns
895
350
3565
925
540
220
Default
Rate
27.69%
31.17%
28.86%
30.02%
Default
Rate
20.92%
30.60%
30.72%
43.51%
Default
Rate
41.01%
40.43%
27.40%
Default
Rate
39.11%
25.95%
40.74%
18

1
6
Num_CC
1
2
3
4
1
7
Job
A171
A172
A173
A174
1
8
1
9
2
0
Dependents
1
2
Telephone
A191
A192
Foreign_Worker
A201
A202
Pristine
#
Default_On_Pay
Observatio
ment
ns
3165
995
1665
460
140
30
30
10
#
Default_On_Pay
Observatio
ment
ns
110
35
1000
280
3150
925
740
255
#
Default_On_Pay
Observatio
ment
ns
4225
1265
775
230
#
Default_On_Pay
Observatio
ment
ns
2980
930
2020
565
#
Default_On_Pay
Observatio
ment
ns
4815
1475
185
20
Default
Rate
31.44%
27.63%
21.43%
33.33%
Default
Rate
31.82%
28.00%
29.37%
34.46%
Default
Rate
29.94%
29.68%
Default
Rate
31.21%
27.97%
Default
Rate
30.63%
10.81%
19
Code for CHAID
Pristine
20
R outputs for CHAID

For better preview see the file saved as pdf through R
Pristine
21
CHAID Analysis Demonstration in SPSS
Pristine
22
Preparing SPSS to run CHAID analysis

1
Pristine
23

5
Pristine
24

7
Pristine
25
Running a simple CHAID analysis- Summary
Pristine
26
Running a simple CHAID analysis - Tree
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Branches in decending order of default rate:

A11- (A33; A30; A31) {rate = 74.5%, sample = 235}
A12- (A30; A31) {rate = 63.6%, sample = 165}
A11- A32 {rate = 51,2%, sample = 800}
A12- A32 {rate = 38.4%, sample = 730}
A12- (A34; A33) {rate = 30%, sample = 450}
A11- A34 {rate = 26.9%, sample = 335}
A14- (A33; A30; A31) {rate = 24.1%, sample = 270}
A13 {rate = 22.2, sample = 315}
A14- A32 {rate = 12.3%, sample = 935}
A14- A34 {rate = 6.5%, sample = 765}
Pristine
27
Running a simple CHAID analysis Tree Table
Pristine
28
CHAID Analysis Result in Decision making

Group1 (Customers having
Default Rate Higher than
Overall Average)
A11- (A33; A30; A31) {rate = 74.5%,
sample = 235}
A12- (A30; A31) {rate = 63.6%, sample
= 165}
A11- A32 {rate = 51,2%, sample =
800}
A12- A32 {rate = 38.4%, sample =

730}
A12- (A34; A33) {rate = 30%, sample

Management
should focus on lending credit
= 450}
Group2 (Customers having

Default Rate Lowe than Overall
Average)
A11- A34 {rate = 26.9%, sample =
335}
A14- (A33; A30; A31) {rate = 24.1%,
sample = 270}
A13 {rate = 22.2, sample = 315}
A14- A32 {rate = 12.3%, sample =
935}
A14- A34 {rate = 6.5%, sample = 765}
to customers falling in Group2 .
Group1 customers have higher risk than average. Management should

improvise on risk management strategy to manage these customers
Pristine
29
CART Analyses
I.
CART analyses- Method and Algorithms
II. Using R to run CART analysis

III. Case
IV. CART Demonstration using SPSS
Pristine
30
CART Method and Components

A non-parametric technique,using the methodology of tree building.
Data mining tools like SAS E-Miner; R and SPSS has extended algorithm to handle
nominal, ordinal and continuous dependent variables.
One or more predictor variables. Predictor variables can be continuous, ordinal,
or nominal.
One target variable. The target variable can be categorical or continuous.
Pristine
31
CART Algorithms
It makes use of Recursive Partitioning

Take all of your data.
Consider all possible values of all variables.
Select the variable/value (X=t1) that produces the greatest separation in the
target.
(X=t1) is called a split.
If X< t1 then send the data to the left; otherwise, send data point to the
right.
Now repeat same process on these two nodes
You get a tree
Note: CART only uses binary splits (Unlike CHAID which produces non-binary
trees as well)
Separation defined in many ways.
Regression Trees (continuous target): use sum of squared errors.
Classification Trees (categorical target): choice of entropy, Gini measure, twoing
splitting rule.
Pristine
32
Some differences between CART and CHAID

Dependent variable for CHAID must be categorical; for CART it can be metric
Different splitting algorithm (e.g., CHAID uses a Chi-squared test using
contingency tables)
CHAID splits into multiple groups, CART makes binary splits
Different stopping criteria
Pristine
33
customer segments who default on payment

due
Romanov, an Analytics consultant works with Credit One Bank. His manager
gave him data having Credit and personal information of a group of
customers. Some of the customers had defaulted in making the payment on
balance due. He asked him to identify the classes of customers having
higher default rate than average using a decision tree. Romanov has no
knowledge of running a CART analysis.
Now suppose, he approaches you and request for your help to complete the
assignment. Lets help Romanov in solving the problem.
Pristine
34
Case- CART analysis

In due course of helping Romanov to complete his task, we will walk him
through following steps:
Variable identification
Identifying the dependent (response) variable.

Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary
Response variable exploration

Distribution analysis
Outlier treatment
Not required for Binary variables
Running the CHAID analysis using R
Importing data in R
Selecting the variables
Runing the analysis
Interpreting the results
Pristine
35
Understanding the Data

Response variable
Default_on_Payment
Binary in nature
Takes (0,1) as values corresponding to (no default, default)
Independent variables (also called a s attributes in credit industry)

20 attributes
3 numerical and 17 categorical
For details refer to

Data dictionary
Data Dictionary- Analysis_of_Default.docx
Data file
Pristine
36
Response variable analysis

Frequency distribution
Default_On_Payment # Observations
0
3505
1
1495
Total
5000
Event rate (for a Binary response variable)

Proportion of 1 as a total number of observation
Over here Default_On_Payment Rate = Event Rate
Event Rate = 1495/5000 = 29.9%
Pristine
37

1
#
Default_On_Pay
Status_Checking_Acc Observatio
ment
ns
A11
1370
675
A12
1345
520
A13
315
70
A14
1970
230
Duration_in_Mo
# Observations
nths
12
1795
24
2055
36
715
48
355
60
80
72
5
Credit_History
A30
A31
A32
A33
A34
Pristine
Default_On_Payment
380
610
285
180
40
5
#
Default_On_Pay
Observatio
ment
ns
200
125
245
140
2650
840
440
140
1465
250
Default
Rate
49.27%
38.66%
22.22%
11.68%
Default
Rate
21.2%
29.7%
39.9%
50.7%
50.0%
100.0%

Default
Rate
62.50%
57.14%
31.70%
31.82%
17.06%
38

4
Purposre_Credit_Taken
A40
A41
A42
A43
A44
A45
A46
A48
A49
A410
#
Default_On_Pay
Observatio
ment
ns
1170
445
515
85
905
290
1400
305
60
20
110
40
250
110
45
5
485
170
60
25
Default
Rate
38.03%
16.50%
32.04%
21.79%
33.33%
36.36%
44.00%
11.11%
35.05%
41.67%
Credit_Amount
# Observations
Default_On_Payment
4000
11000
12000
3770
1085
145
975
420
100
Savings_Acc
A61
A62
A63
A64
A65
Pristine
#
Default_On_Pay
Observatio
ment
ns
3015
1080
515
170
315
55
240
30
915
160
Default
Rate
25.86%
38.71%
68.97%
A40: car (new)

A41: car (used)
A42: furniture/equipment
A43: radio/television
A44: domestic appliances
A45: repairs
A46: education
A47: (vacation - does not exist?)
A48: retraining
A49: business
A410: others

Default
Rate
35.82%
33.01%
17.46%
12.50%
17.49%
39

7
#
Years_At_Present_Emplo
Default_On_Pay
Observatio
yment
ment
ns
A71
310
115
A72
860
350
A73
1695
515
A74
870
195
A75
1265
320
Inst_Rt_Income
1
2
3
4
Marital_Status_Gender
A91
A92
A93
A94
1
0
#
Default_On_Pay
Observatio
ment
ns
680
170
1155
305
785
225
2380
795
#
Default_On_Pay
Observatio
ment
ns
250
100
1550
540
2740
730
460
125
#
Other_Debtors_Guaranto
Default_On_Pay
Observatio
rs
ment
ns
A101
4535
1355
A102
205
90
A103
260
50
Pristine
Default
Rate
37.10%
40.70%
30.38%
22.41%
25.30%
Default
Rate
25.00%
26.41%
28.66%
33.40%
Default
Rate
40.00%
34.84%
26.64%
27.17%
Default
Rate
29.88%
43.90%
19.23%
40

1
1
Current_Address_Yrs
1
2
3
4
1
2
1
3
1
4
Property
A121
A122
A123
A124
#
Default_On_Pay
Observatio
ment
ns
1410
295
1160
355
1660
510
770
335
Age:
Refer to
Other_Inst_Plans
A141
A142
A143
1
5
#
Default_On_Pay
Observatio
ment
ns
650
180
1540
480
745
215
2065
620
Housing
A151
A152
A153
Pristine
#
Default_On_Pay
Observatio
ment
ns
695
285
235
95
4070
1115
#
Default_On_Pay
Observatio
ment
ns
895
350
3565
925
540
220
Default
Rate
27.69%
31.17%
28.86%
30.02%
Default
Rate
20.92%
30.60%
30.72%
43.51%
Default
Rate
41.01%
40.43%
27.40%
Default
Rate
39.11%
25.95%
40.74%
41

1
6
Num_CC
1
2
3
4
1
7
Job
A171
A172
A173
A174
1
8
1
9
2
0
Dependents
1
2
Telephone
A191
A192
Foreign_Worker
A201
A202
Pristine
#
Default_On_Pay
Observatio
ment
ns
3165
995
1665
460
140
30
30
10
#
Default_On_Pay
Observatio
ment
ns
110
35
1000
280
3150
925
740
255
#
Default_On_Pay
Observatio
ment
ns
4225
1265
775
230
#
Default_On_Pay
Observatio
ment
ns
2980
930
2020
565
#
Default_On_Pay
Observatio
ment
ns
4815
1475
185
20
Default
Rate
31.44%
27.63%
21.43%
33.33%
Default
Rate
31.82%
28.00%
29.37%
34.46%
Default
Rate
29.94%
29.68%
Default
Rate
31.21%
27.97%
Default
Rate
30.63%
10.81%
42
R Code for CART-Regression Tree
Pristine
43
R Outputs for Regression tree

Output
Pristine
44

Pristine
45
R Code for CART-Classification Tree
Pristine
46
R Outputs for Classification tree
Pristine
47

Pristine
48
CART Analysis Demonstration in SPSS
Pristine
49
Preparing SPSS to run CART analysis

1
Pristine
50

5
Pristine
51

7
Pristine
52
Running a simple CART analysis- Summary
Pristine
53
Running a simple CART analysis Tree

(Compressed View)
Use this tree result along with the
tabular result present in the next
Terminal Nodes
slide
with default rate
less than over all
average of 29.9%
Node 6: 25.8%
Node 12: 20%
Node 13:
26.9%
Node 17:6.5%
Node 18:
12.3%
Node 22:
25.7%
Terminal Nodes
with default rate
greater than over
all average of
29.9%
Pristine
Node 14:
32.7%
Node 19:
51.2%
Node 20: 75%
Node 21:
38.4%
54
Running a simple CART analysis Table
Terminal Nodes
with default rate
less than over all
average of 29.9%
Node 6: 25.8%
Node 12: 20%
Node 13:
26.9%
Node 17:6.5%
Node 18:
12.3%
Node 22:
25.7%
Terminal Nodes
with default rate
greater than over
all average of
29.9%
Node 14:
32.7%
Node 19:
51.2%
Node 20: 75%
Node 21:
38.4%
Pristine
55

(Expanded View -1)
Pristine
56

(Expanded View- 2)
Pristine
57
Running a simple CART analysis Comparison

with CHAID result
CART
CHAID
For a simple analysis using only two variables (Status_Checking_Account and

Credit_History) are producing the results with same level of accuracy
Pristine
58
Running a simple CART analysis Comparison

with CHAID result
CART
CHAID
For a simple analysis using only two variables (Status_Checking_Account and

Credit_History) are producing the results with same level of accuracy
Pristine
59
CART vs. CHAID Comparison- Using all the

available variables
1 CHAID
Pristine
2 CART
60
CART vs. CHAID Comparison- Using all the

available variables
CART
CHAID
CHAID is producing better/more accurate results as compared to CART (79.3%

vs 77.4%).
Pristine
61
When to use CART and when to use CHAID

Depends on type of Response variable
CART when numeric or metric or Binary
CHAID when Binary (provided it produces better result than CART)
CHAID should be used when the goal is to describe or understand the

relationship between a response variable and a set of explanatory variables.
CART is better suited for creating a model that has high prediction accuracy of
new cases.
Pristine
62

1 - Chaid and Cart

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 - Chaid and Cart

Uploaded by

Copyright:

Available Formats

Decision Trees

-CHAID & CART Analyses

Decision Trees Analyses

Data Mining and Decision Trees

II. Decision Tree Example

Data Mining and Decision Trees

II. Decision Tree Example

Classification & Decision Tree Analysis

Builds classification and regression trees

Classification & Decision Tree AnalysisExample

Response rate 40%

Sample size 300

Response rate 40%

Response rate 15%

Sample size 300

Response rate 15%

Sample size 300

Response rate 6.67%

Sample size 150

Decision Tree Algorithms

CHAID analyses- Method and Algorithms

II. Using R to run CHAID analysis

CHAID Method and Components

customer segments who default on payment

Case- CHAID analysis

Identifying the dependent (response) variable.

Response variable exploration

Running the CHAID analysis using R

Understanding the Data

Independent variables (also called a s attributes in credit industry)

For details refer to

Response variable analysis

Event rate (for a Binary response variable)

Bivariate Analysis (of Independent variables)

These are transformed values.

Bivariate Analysis (of Independent variables)

A40: car (new)

These are transformed values.

Bivariate Analysis (of Independent variables)

Bivariate Analysis (of Independent variables)

Bivariate Analysis (of Independent variables)

Code for CHAID

R outputs for CHAID

CHAID Analysis Demonstration in SPSS

Preparing SPSS to run CHAID analysis

Preparing SPSS to run CHAID analysis

Preparing SPSS to run CHAID analysis

Running a simple CHAID analysis- Summary

Running a simple CHAID analysis - Tree

Branches in decending order of default rate:

Running a simple CHAID analysis Tree Table

CHAID Analysis Result in Decision making

A12- A32 {rate = 38.4%, sample =

A12- (A34; A33) {rate = 30%, sample

Group2 (Customers having

to customers falling in Group2 .

Group1 customers have higher risk than average. Management should

CART analyses- Method and Algorithms

II. Using R to run CART analysis

CART Method and Components

It makes use of Recursive Partitioning

Some differences between CART and CHAID

customer segments who default on payment