Logistic Regression Notes

A COMPARISON OF MULTIPLE
REGRESSION, LOGISTIC REGRESSION AND

DISCRIMINATION FUNCTION IN
CLASSIFICATION OF OBSERVATIONS
by: Dr. Yap Bee Wah
UNI VERSI TI TEKNOLOGI MARA
Faculty of Information Technology & Quantitative Sciences
PETALWID
3.0 2.5 2.0 1.5 1.0 .5 0.0
P
E
T
A
L
E
N
7
6
5
4
3
2
1
0
TYPE
iris virginica
iris versicolor
iris setosa
Kolokium Statistik 24 Julai 2004 Th 5, FTMSK
-6 -4 -2 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
KOLOKIUM STATISTIK 2004,
FTMSK
2
OVERVIEW OF PRESENTATION
Introduction
Multiple Regression
Logistic Regression
Discriminant Function
Methodology (Model Building and Evaluation Process)
Results
Conclusion
FTMSK
3
Introduction:
Two(2) pioneer studies
Efron (1975) studied the relative efficiency of logistic regression and normal
discrimination analysis.
He found that typically, logistic regression is between one half and two thirds
as effective as normal discrimination.
(Efron, B (1975). The Efficiency of Logistic Regression Compared to Normal
Discrimination Function Analysis. Journal of the American Statistical Association,
December 1975, Volume 70, Number 352, Theory and Methods Section)

Press and Wilson (1978) compared logistic regression and parametric
discriminant analysis and conclude that logistic regression is preferable to
parametric discriminant analysis in cases for which the variables do not have
multivariate normal distributions.
However, for normal distributions, logistic regression is less efficient than
parametric discriminant analysis.
(Press, S. J. & Wilson, S. (1978). Choosing between logistic regression and
discriminant analysis. Journal of the American Statistical Association, 73, 699-705)

FTMSK
4
Introduction to Multiple Linear
Regression
Multiple Linear Regression is a useful statistical
modeling technique for describing the relationship
between a response (dependent) variable with one
or several predictor variables
When the response variable is dichotomous (2
categories) or polytomous (more than 2
categories), logistic regression or discriminant
analysis is frequently used to model the
relationship.
FTMSK
5
Multiple Regression Model
Consider k predictor variables, the
multiple regression model is stated as
follow:

i ik k i i i
X X X Y c | | | | + + + + + = ...
2 2 1 1 0
) , ( ~
2
0 o c N
i
Regression coefficients
Must be a
quantitative
variable
FTMSK
6
Regression: Research Application Example
IS Faculty Research Productivity: Influential Factors
and Implications

by :Qing Hu & T. Grandon Gill(Florida Atlantic University)
(Information Resources Management Journal, Vol 13, No 2,
2000)
Research
Productivity
(Annual rate of
publication)
Number of years in IS faculty
Percentages of time allocated for
teaching
Percentages of time allocated for research
Percentages of time allocated for academic services
Type of degree
FTMSK
7
Introduction to Logistic regression
Allows estimating the probability of an
event happening
Useful for modeling data with a
dichotomous dependent variable (Y) (eg:
survive/die; purchase/do not purchase;
pass/fail etc)
Allows a mixture of quantitative and
qualitative predictor variables (X).
FTMSK
8
Application examples

Dependent variable Independent variables

=
otherwise 0
survive 1
Y
gender X
X
X
age X
:
level dosage :
illness of length :
:
4
3
2
1

=
otherwise 0
bills card credit settle 1
Y
gender :
children of number :
income :
cards credit of number :
:
5
4
3
2
1
X
X
X
X
age X

FTMSK
9
Logit model,otherwise known as the
logistic regression model
For k explanatory variables and i =1,2,,n
the model is

where
i k k i i
i
i
x x x
p
p
| | | | + + + + =
(
... log
2 2 1 1 0
1
) 1 ( = =
i i
Y P p
Referred to as the logit
or log-odds

FTMSK
10
We can solve the logit equation to obtain:

In mathematical
expression, this
formula is called the
logistic function and
can be written as:
)] ... ( exp[ 1
1
) 1 Pr(
2 2 1 1 0 k k
X X X
Y
| | | | + + + + +
= =
z -
e 1
1
f(z)

+
=
k k 1 1 0
X ... X z where | | | + + + =
FTMSK
11
Simple logit model
Let Y and X be defined as follows:

) ( )] | ( log[
) ( )] | ( log[
,
) (
) (
log
0 0 1
1 1 1

1 1
1
0
0
0
| |
| |
| |
+ = = =
+ = = =
+ =
(
=
=
X Y odds
X Y odds
Hence
X
Y P
Y P
1
0
1 0
NS vs S
|
|
| |
e
e
e
s nonsmo odds
s smo odds
OR
=
=
=
+
) ker (
) ker (
=
otherwise 0
smoker if 1
otherwise 0
cancer lung develop if 1
1
X
Y
OR (Odds-ratio) :
A ratio of 2 odds
FTMSK
12
Interpretation of odds-ratio,
If for example ,

This odds ratio (OR) indicates that a smoker is 3
times more likely to develop lung cancer
compared to a nonsmoker.
3 ratio Odds , 0986 . 1
0986 . 1
= = = e |
FTMSK
13
Introduction to Discriminant Analysis
An appropriate technique for classifying or separating individuals
into different groups (dependent variable) based on a set of
quantitative independent random variables
Involves deriving the linear combination of predictor variables
(called the discriminant function) that will discriminate best
between the given groups
The main objective of discriminant analysis is to predict group
membership based on a set of quantitative variables.
Assumptions: The predictor variables for each group has a
multivariate normal distribution

-6 -4 -2 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
FTMSK
14
Scatter Plot of Income Vs Lotsize
LOTSIZE
24 22 20 18 16 14 12
I
N
C
O
M
E
120
100
80
60
40
20
GROUP
nonowners
owners

Can we find a discriminant function based on income and lotsize of
house to predict if a house owner will or will not purchase a lawn
mower? (Johnson and Wichern, Applied Multivariate Statistical
Analysis, Wiley,2002).
FTMSK
15
We can classify a new observation (x
o
)
using:
1) Linear or quadratic discriminant functions
2) Posterior probabilities
i
1
k
of y probabilit prior the is where
observed) x was given that from comes
t
t t
i
i i
g
i
k k
k
p
x f p
x f p
x P x P
) (
) (
( ) | (
=
=
=
Group k
FTMSK
16

Classification for two (2) normal populations
Homoscedastic Case (when )
Allocate to if

Otherwise allocate into .
2 1
E = E
( ) ( ) ( )
( )
( )
(
(
|
|
.
|
\
|
|
|
.
|
\
|
> +

1
2
2 1
1
2 1
0
1
2 1
1 2
2 1
2 1
p
p
c
c
x x S x x x S x x
pooled pooled
/
/
ln ' / '
0
x
0
x
2
t
1
t
Linear Discriminant
Function
Cost of misclassification
Prior
probabilities
An observation
Note: Assume c(1/2)=c(2/1) and
p
1
=p
2
is they are unknown.
Hence, ln (1)=0.
Source: Johnson &
Wichern, 2002
FTMSK
17

Classification for two (2) normal
populations:
Heteroscedastic case (when )
Allocate to if

Otherwise allocate into
0
x
0
x
2
t
1
t
2 1
E = E
( ) ( )
( )
( )
(
(
|
|
.
|
\
|
|
|
.
|
\
|
> +

1
2
0
1
2 2
1
1
1
0
1
2
1
1 0
1 2
2 1
2 1
p
p
c
c
k x S x S x x S S x
/
/
ln ' ' ' /
( )
2
1
2 2 1
1
1 1
2
1
2 1 2 1 x S x x S x
S
S
k

+
|
|
.
|
\
|
= ' ' /
| |
| |
ln /
Quadratic Discriminant Function
Source: Johnson & Wichern, 2002
FTMSK
18
Example:Admission into graduate programs
based on GPA and GMAT
(

=
(
=
= =
(
=
(
=
9046 2246 4038 5
4038 5 02969 0
S
2473 4618 05809 0
05809 0 0435 0
S
28 n 31 n
07 447
48 2
x
23 561
40 3
x
2 1
2 1 2 1
. .
. .
,
. .
. .
, ,
.
.
,
.
.
admit not do
admit
2
1
:
:
t
t
(
=
(

=
0003 0 02061 0
02061 0 4837 28
S
939 3494 5290 2
5290 2 03695 0
1 -
pooled
. .
. .
,
. .
. .
pooled
S
score GMAT
GPA ate undergradu
2
1
:
:
X
X
Independent variables (X)
Response variable (Y)
FTMSK
19
Assume Homoscedastic case (Box Ms test of equality of
covariance matrices not significant),classify a student with
GPA=3.21 and GMAT=497
7857 110 05321 0 5578 28
3 1008
88 5
0.0003 0.02061
0.02061 28.4837
114.16] 92 0
2
1
0.0003 0.02061
0.02061 28.4837
114.16] [0.92
) x x ( S ) x x (
2
1
x S ) x x (
2 1
2
1
2 1
1
pooled
'
2 1 0
1
pooled
'
2 1
. . .
.
.
. [
+ =
(
=
+ =

x x
x
x
y
admit
y
Since
:
, . ) ( . ) . ( .
,
1
into student he classify t
0 7857 110 497 05321 0 21 3 5578 28
t
> + =
Source: Johnson & Wichern, 2002
FTMSK
20
Classification with several populations

Allocate to
if the linear discriminant score
=the largest of

where

0
x
k
t
) (
x d
k
) (
),..., (
), (
x d x d x d
g 2 1
,...,g , i p x d
i i
2 1 x S x
2
1
x S x
i
1
pooled
'
i
1
pooled
'
i
= + =

ln ) (
,...,g , i p x x S x x S x d
i i i i
Q
i
2 1
2
1
2
1
i
1
= + =

ln ) ( ) ( | | ln ) (
'
Quadratic discriminant score
Fishers discrimianant
function given in
SPSS/SAS output
NOTE:
(1) Equal covariance matrices
(2)Unequal covariance matrices
Source: Johnson &
Wicheren 2002)
FTMSK
21
Assessing the performance of the
classification functions

Error rates-percentage of observations
misclassified
Predicted Membership

Actual
Membership
Group Owners Non-
owners
Sample
size
Owners n
1c
n
1m
n
1

Non-owners n
2m
n
2c
n
2

2 1
2 1
rate Error
n n
n n
m m
+
+
=
FTMSK
22
Comparing the performance of multiple regression,
logistic regression, and discrimination functions in
classification of observations
These three statistical methods were applied to a
a data set to compare their predictive ability of
classifying a baby as low birth weight or normal
based on several predictor variables.
FTMSK
23

Dependent variable
Y = Birth weight (g)

Independent variables

X
1
= Race (Malay, Chinese and Indian), X6 = Abortion (yes, no)

X2 = Gender (male, female) X7 = Mothers height (cm)

X3 = Mothers age (years) X8 = Vitamin (mg)

X4 = Fathers income (RM) X9 = Weight gain (kg)

X5 = Parity (children) X10 = Antenatal visits (number of times)

Data set (collected in 1997) courtesy of
Hospital Kuala Lumpur
FTMSK
24
Yes
Split data into
(the training data set (n
1
= 365)
(the validation data set (n
2
= 50)
Build the model(s) using
the training data set.
Evaluate the performance of the
model using the validation data
set.
Find the probabilities of misclassifications;
E
1
, E
2
and E
3
.
Compare the error rates E
1
, E
2
and E
3
.
Are remedial
measures
needed?
Checking the model
adequacy using plot
of residuals and other
diagnostics.
Select the best model.
Yes
No
Methodology (The Process of Developing and
Evaluating the Models)
FTMSK
25
SPSS Results(Multiple Linear Regression
Analysis)
ANOVA
16531299 4 4132824.770 15.500 .000
95990816 360 266641.157
1.13E+08 364
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Coefficients
-1532.707 857.305 -1.788 .075
45.828 17.534 .131 2.614 .009
23.679 5.506 .210 4.300 .000
39.234 9.698 .210 4.046 .000
51.366 14.606 .178 3.517 .000
(Constant)
PARITY
MUM_HEIG
WGHTGAIN
ANT_VST
Model
1
B Std. Error
Unstandardized
Coef f icients
Beta
Standardi
zed
Coef f icien
ts
t Sig.
Model Summary
.383 .147 .137 516.37
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Significant
predictor
variables }
FTMSK
26
The final estimated regression function is:

Birth Weight= -1532.707 + 45.828(Parity) +
23.679(Mothers Height) + 39.234(Weight Gain) +
51.366(Antenatal Visits)

SPSS Results (Multiple Linear Regression)
FTMSK
27
Multiple Regression Results
Interpretation of the estimated regression coefficients;

1. For parity (b
1
= 45.828): every additional one child in
the family, the birth weight of babies will increase by
approximately 46g holding mothers height, weight gain and
antenatal visits constant.
2. For mothers height (b
2
= 23.679), it indicates that the birth
weight of babies will increase by approximately 24g for every
1 cm increase in mothers height, holding parity, weight gain
and antenatal visits constant.(Birth weight higher for taller
mothers)
3. For weight gain (b
3
= 39.234), it indicates that the birth weight
of babies will increase by approximately 40g for every
1kg increase in weight gain, holding parity, mothers height
and antenatal visits constant.
4. For antenatal visits (b
4
= 51.366), it indicates that the birth
weight of babies will increase by approximately 52g for
every one unit (time) increase in number of antenatal visits
holding parity, mothers height and weight gain constant.

FTMSK
28
Checking Model Adequacy Through
Diagnostic Plots
Observed Value
4 3 2 1 0 -1 -2 -3
E
x
p
e
c
t
e
d

N
o
r
m
a
l
3
2
1
0
-1
-2
-3

Regression Standardized Predicted Value
4 3 2 1 0 -1 -2 -3
R
e
g
r
e
s
s
i
o
n

S
t
a
n
d
a
r
d
i
z
e
d

R
e
s
i
d
u
a
l
4
3
2
1
0
-1
-2
-3
-4

Q-Q Plot of Residuals Plot of Residuals against Fitted Values
Notes: Kolmogorov-Smirnov = 0.045, p-value = 0.077
Skewness = - 0.153
Kurtosis = 0.048

No violation of regression model assumptions of normal
errors with constant variance.
FTMSK
29
Evaluating Regression Model Performance
Through Error Rate
The estimated regression function is then used to
predict the birth weight of the 50 observations in
the validation sample
Predicted values below 2500 were classified as
low birth weight. Otherwise, they are classified as
normal birth weight.
The following classification table gives the true
and predicted category obtained.
FTMSK
30

Predicted

Total

Normal weight

Low weight

Observed

Normal weight

34

0

34

Low weight

15

1

16

Total

49

1

50

Classification Table.
30 0
50
0 15
1
. =
+
= E
Error rate for Multiple Regression Model
FTMSK
31

Independent variables

X
1
= Race (Malay, Chinese and Indian), X6 = Abortion (yes, no)

X2 = Gender (male, female) X7 = Mothers height (cm)

X3 = Mothers age (years) X8 = Vitamin (mg)

X4 = Fathers income (RM) X9 = Weight gain (kg)

X5 = Parity (children) X10 = Antenatal visits (number of times)

<
=
otherwise 0
2500g) ht (birthweig ht birth weig low if 1
Y
APPLYING LOGISTIC
REGRESSION
FTMSK
32
SPSS Results for Multiple Logistic Regression.
-.193 .052 13.666 1 .000 .824
-.194 .070 7.624 1 .006 .824
.648 .308 4.428 1 .035 1.912
-.108 .028 15.136 1 .000 .898
18.247 4.380 17.356 1 .000 8.4E+07
WGHTGAIN
ANT_VST
ABORT(1)
MUM_HEIG
Constant
Step
1
B S.E. Wald df Sig. Exp(B)
where
z
j
= 18.247 - 0.193(Weight gain) - 0.194(Antenatal visits) +
0.648(History of abortion) 0.108(Mothers height)

The estimated logistic regression
model obtained:
j
z
j
e
Y P

+
= =
1
1
1) (
FTMSK
33
1. For weight gain; the odds ratio means that for every 1 kg increase
in weight gain, the odds of low birth weight will decrease.

2. For antenatal visits; the odds ratio indicates that when a
mother increases antenatal visit by 1 time, the odds of
low birth weight will decrease.

3. For abortion; the odds ratio indicates that a mother who has
had abortion(s) is approximately 2 times more likely to have a
baby with low birth weight compared to those who have no
history of abortion(s).

4. For mothers height; the odds ratio indicates that the odds of low
birth weight is lower for mothers who are taller
Interpretation of the odds-ratio
|
e
FTMSK
34

Assessing the Goodness of Fit of the Model

SPSS Results of Model Summary
378.206 .124 .180
Step
1
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke
R Square
Hosmer and Lemeshow Test
7.236 8 .511
Step
1
Chi-square df Sig.
No significant lack of fit
Logistic Regression Results (Contd)
For
comparisons
of models
Equivalent to
R-square in
regression
FTMSK
35
Evaluating the performance of the logistic
regression model
The estimated logistic function is then used to predict the 50
observations in the validation data set
If

we classify the observation as belonging to (low birth weight)
50 2 1 50 0 1 ,... , , . ) ( = > = j Y P
j
1
t
FTMSK
36

Error Rate for the Logistic Regression
Model

Predicted

Total

Normal weight

Low weight

Observed

Normal weight

33

1

34

Low weight

11

5

16

Total

44

6

50

24 0
50
1 11
2
. =
+
= E
FTMSK
37
Discriminant Analysis (Checking the
assumption of multivariate normal distribution)
Variables Normal birth weight Low birth weight
Mothers age Approximately Normal Approximately Normal
Fathers income Nonnormal Nonnormal
Parity Approximately Approximately Normal
Mothers height Approximately Normal Approximately Normal
Vitamin Approximately Normal Approximately Normal
Weight gain Approximately Normal Approximately Normal
Antenatal visit Approximately Normal Approximately Normal
FTMSK
38
Chi-square plots for checking multivariate
normality

0.00 5.00 10.00 15.00 20.00 25.00
chisq
0.00000
10.00000
20.00000
30.00000
40.00000
50.00000
M
a
h
a
l
a
n
o
b
i
s

D
i
s
t
a
n
c
e
0.00 5.00 10.00 15.00 20.00
chisq
0.00000
10.00000
20.00000
30.00000
40.00000
50.00000
M
a
h
a
l
a
n
o
b
i
s

D
i
s
t
a
n
c
e
Low Birth Weight Group Normal Birth Weight
Group
-6 -4 -2 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
Chi-square plots indicate both
groups have approximate
multivariate normal distributions.
FTMSK
39

Boxs M 12.83
F 2.11
df 1 6
df 2 217363
Sig. 0.049
Use Fishers Linear Disriminant Function.
Can assume
equal covariance
matrices
Boxs M Test of
Equality of Covariance
Matrices.

Discriminant Analysis Results
2 1 1
2 1 0
E = E
E = E
:
:
H
H
FTMSK
40
SPSS Output (Discriminant Functions)
Classification Function Coefficients
6.699 6.601
.397 .238
2.965 2.797
-532.689 -515.224
MUM_HEIG
WGHTGAIN
ANT_VST
(Constant)
0 normal
weight 1 low weight
WEI_CODE
Fisher's linear discriminant f unctions
1
d
Normal birth weight category;
= -532.689 + 6.699(mothers height) + 0.397(weight gain) +
2.965(antenatal visits)

Low birth weight category;
= -515.224 + 6.601(mothers height) + 0.238(weight gain) +
2.797(antenatal visits)

2
d
FTMSK
41
Classification Results
170 96 266
28 71 99
63.9 36.1 100.0
28.3 71.7 100.0
169 97 266
28 71 99
63.5 36.5 100.0
28.3 71.7 100.0
WEI_CODE
normal weight
low weight
normal weight
low weight
normal weight
low weight
normal weight
low weight
Count
%
Count
%
Original
Cross-validated
normal weight low weight
Predicted Group
Membership
Total
Discriminant Analysis Results (Contd)
Cross-validation error rate of the
model=0.34
FTMSK
42
Evaluating the performance of the
discriminant functions
The estimated discriminant functions is then used
to predict the group membership of the 50
observations in the validation data set
If

we classify the observation into (low birth weight)
) ( ) (
j j
x d x d
2 1
>
1
t
FTMSK
43
Evaluate Discriminant Functions Performance
Through Error Rate
Classification Table.

Predicted

Total

Normal weight

Low weight

Observed

Normal weight

22

12

34

Low weight

6

10

16

Total

28

22

50

36 0
50
12 6
3
. =
+
= E
FTMSK
44
Summary of the Models Performances

Statistical model

Significant variables

Error rate

1. Multiple linear
regression

1. Mothers height
2. Weight gain
3. Antenatal visits
4. Parity

0.30

2. Multiple logistic
regression

1. Mothers height
2. Weight gain
3. Antenatal visits
4. History of abortion(s)

0.24

3. Discriminant

1. Mothers height
2. Weight gain
3. Antenatal visits

0.36

Comparing the models with same significant predictor variables,
error rates for Multiple Regression,Logistic Regression and
Discriminant Analysis are 0.28, 0.26 and 0.36 respectively.
Note:
FTMSK
45
Conclusion of Study
The significant predictor variables affecting birth
weight of babies are weight gain, number of
antenatal visits, parity, mothers height and history
of abortions
The logistic regression model is found to be the
best model in this study as it has the lowest error
rate
FTMSK
46
Some interesting research papers
(1) Logistic Regression for Data Mining and High-
Dimensional Classification
Paul Komarek (PhD thesis, Carnegie Mellon University,
2004, 138 pages)
(www.autonlab.org/autonweb/showPaper.jsp?ID=komarek:Ir
_thesis
(2) Predicting Housing Value: A Comparison of Multiple
Regression and Artificial Neural Networks
Nghiep Nguyen & Al Cripps
Journal of Real Estate Research,Vol 22, p313-336, 2001.
FTMSK
47
(3)Application of f-regression to fuzzy classification problem
Boris Izyumov
Proceedings of 3
rd
International Conference on Fuzzy
Logic and Technology (EUS,2003),Zittau, Germany
(2003),pp781-766
(4) Assessing and Predicting Information and Communication
Technology Literacy in Education Undergraduates
JoAnne Davies (Phd thesis, Department of Educational
Psychology, Edmonton, Alberta, 2002)
FTMSK
48
(5) Discriminant Analysis for recognition of human face
images
Kamran Etemad & Rama Chellapa
J. Optical Soc. Of America, Vol 14, No 8, 1997
FTMSK
49
RECOMMENDED REFERENCES
Hosmer, David W & Lemeshow, Stanley (1989). Applied Logistic
Regression. John Wiley & Sons, Inc.
Johnson, Richard A. & Wichern, Dean W. (2002). Applied
Multivariate Statistical Analysis (5th Edition). New Jersey: Prentice
Hall.
Montgomery, D.C., Peck, Elizabeth A., & Vining G. Geoffrey (2001).
Introduction to Linear Regression Analysis. John Wiley & Sons, Inc.
Hair, J.F , Anderson, R.E, Tatham, R.L., & Black, W. C. (1998).
Multivariate Data Analysis, 5
th
Edition, Prentice Hall.
Tabachnick, B. & Fidell, L.S. (2001). Using Multivariate Statistics, 4
th

Edition, Pearson.
Malhotra, N. K (2004). Marketing Research:An Applied Orientation.
Pearson Education, Inc.

FTMSK
50
PETALWID
3.0 2.5 2.0 1.5 1.0 .5 0.0
P
E
T
A
L
E
N
7
6
5
4
3
2
1
0
TYPE
iris virginica
iris versicolor
iris setosa
-6 -4 -2 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
FACULTY OF INFORMATION TECHNOLOGY &
QUANTITATIVE SCIENCES, UiTM

Logistic Regression Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistic Regression Notes

Uploaded by

Copyright:

Available Formats

A COMPARISON OF MULTIPLE

REGRESSION, LOGISTIC REGRESSION AND

You might also like