You are on page 1of 39

Dr.

Robert Chapman
June 2016

PH 507

Multivariable linear and logistic regression analysis

Note: In this course there is time for only a very superficial


introduction to multivariable regression. It is a very broad
topic, and I encourage you to learn more about it on your
own.

1 Dr. Robert Chapman


Regression analysis enables us to describe the relationship between a
dependent variable (Y) and multiple independent variables (X) at the same
time. It also lets us estimate (predict) the value of the dependent variable
at specific values of multiple independent variables.
Regression analysis enables us to evaluate the relative importance of
independent variables.
There are several kinds of regression analysis, including linear regression
and logistic regression.
In linear regression the DEPENDENT variable is continuous.
In logistic regression the DEPENDENT variable is dichotomous, has only
2 levels (yes, no) (0,1). Example: presence or absence of an illness.
In both linear and logistic regression, INDEPENDENT variables can be
any type of data, e.g., continuous, dichotomous, ordinal. (Be careful with
nominal independent variables.)
We construct regression models to describe relationships between
independent and dependent variables. The model has 1 dependent
variable and 1 or more independent variables. When there is >1
independent variable, we call it multivariable regression or multiple
2 regression. Dr. Robert Chapman
In linear regression, we use the idea of a straight line (Y = a + bX) to
make regression models. The word "linear" means "relating to a line."
Recall that a = the Y-intercept (the value of Y when X=0), and b = the
slope of the line. Slope has units. Slope = change in Y / unit change in
X, also called the rise over the run. If we know a and b, we can specify
the location of the line.
In fact, Y = a + bX can be regarded as a simple linear regression model,
with only 1 independent variable, X.
We can also make linear models with more than 1 independent variable.
These take the form Y = a + b1X1 + b2X2 + + bnXn,
where a = the Y-intercept and each X = a different independent variable.
Each b (b1, b2, , bn) is the slope for its independent variable.
The slopes (the b's) are multipliers of their X's. The b's are also called
coefficients or betas or effects estimates.
Sometimes we see b written as the Greek beta (), and a written as the
Greek alpha (). Sometimes a is written as b0 or 0.
The a and all the b's are known as model parameters. Linear and logistic
regression are both types of parametric analysis (usually).
3 Dr. Robert Chapman
We will use 2 SPSS data files:

AchievementAptitudePH507Chapman0813.sav. Total 9 observations.


Achievement is the dependent variable. Aptitude is the independent
variable. Both are continuous.

HighSchoolTestScoresPH507Chapman0813.sav. Total 200


observations. The dependent variables are test scores, expressed as
both continuous and dichotomous data. There are several continuous
and categorical independet variables.

4 Dr. Robert Chapman


Data file AchievementAptitudePH507Chapman0813.sav. . Achievement is
the dependent variable (Y). Aptitude is the independent variable (X).
Means of aptitude (Xbar) and achievement (Ybar). We will come back to
these means later.
Descriptive Sta tistics

N Mean
APTITUDE Aptitude 9 7.11
ACHIEVMT Achievement 9 8.00
Valid N (listwise) 9

Sum of (Achievement - Ybar)2 = Sum of (Achievement - 8)2 = 86. This is


the total variation in achievement, or the total variation in Y. Note that
this is the numerator of the variance of Y. We will come back to this later.

N Sum Mean SD
ymybsqar
(Achievment-mean)**2, 9 86 9.56 8.805
part of SSyy
5 Dr. Robert Chapman
Scatter diagram of aptitude (X) and achievement (Y)

12

10

8
Achievement

4 5 6 7 8 9 10

Aptitude

We can see that achievement increases as aptitude increases (positive


or direct relationship between aptitude and achievement).
How can we describe this relationship quantitatively and usefully?
6 Dr. Robert Chapman
Both achievement and aptitude are continuous, so we could use
correlation analysis to describe this relationship.

Pearson correlation between aptitude and achievement


(correlation coefficient = r)

Correlations r2 = 0.677
Achiev
Aptitude ement
Aptitude Pearson Correlation 1.000 .823**
Sig. (2-tailed) . .006
N 9 9
Achievement Pearson Correlation .823** 1.000
Sig. (2-tailed) .006 .
N 9 9
**. Correlation is significant at the 0.01 level
(2-tailed).

We can also describe the relationship with linear regression.


7 Dr. Robert Chapman
Basis for fitting the regression line. Pieces of total variation. The formula for
the straight line is Y = a + bX. We call the points on the line Ycap
(estimated Y).

12 Ycap =
a + bX

10
For each
Achievement

Mean of Y = Ybar = 8
8 data point,
Ycap-Ybar, variation explained, Y-Ybar =
the piece for relationship (Y-Ycap) +
6 Y-Ybar, related (Ycap-Ybar)
to total variation Y-Ycap, error or residual, the
in Y piece for lack of relationship
4 Ycap is
also
2 written

4 5 6 7 8 9 10 Y
8
Aptitude
Dr. Robert Chapman
(Y - Ycap) is the error, the distance from Y to the line, also known as the
deviation or residual.
Ycap = a + bX, so (Y - Ycap) = (Y - a - bX).
We fit the regression line so that the sum of the squared deviations from
the line is a minimum, that is
(Y - a - bX)2 = minimum
In calculus, when something is a minimum, its derivative is zero. We can
solve for a and b by differentiating (Y - a - bX)2 with respect to a, then
with respect to b. Then we set the derivatives to zero. The solutions are
a = Ybar - bXbar, same as Ybar = a + bXbar. In other words,
the regression line passes through the point (Xbar,Ybar).
b = [(Y - Ybar)(X - Xbar)] / (X - Xbar)2 (Also see slide 15.)
This is called the least-squares solution for a and b. The fitted line is
called the least-squares line, and linear regression is often called least-
squares regression.
Note: The equations above are for the situation when there is only one
independent variable. The equations are more complicated for more than
one independent variable, but the basic concepts are the same.
9 Dr. Robert Chapman
Here is the least-squares regression line for the relationship between
aptitude and achievement. Note that the line passes through (Xbar,Ybar).
Note that R-square (R Sq) = the square of the Pearson correlation
coefficient. This shows that there is a close relationship between
correlation and linear regression. The equation for this line is
achievement = -2.092 + 1.419(aptitude).

12

Ybar=8.00
10

8
Achievement

4
R Sq Linear = 0.677

Xbar=7.11
2

4 5 6 7 8 9 10

Aptitude

10 Dr. Robert Chapman


SPSS output from linear regression for relationship between
aptitude and achievement

The square root of R Square = "correlation coefficient"


for the whole regression model. Same as correlation coefficient
that we saw before.
Model Summary

Std. Error
Adjusted of the
Model R R Square R Square Estimate
1 .823a .677 .630 1.99
a. Predictors: (Constant), Aptitude

Measure of overall relationship


between the regression model
and the dependent variable. The
proportion of total variation
in Y that the model "explains."
11 Dr. Robert Chapman
For each data point, we have seen that (Y-Ybar) = (Y-Ycap) + (Ycap-Ybar),
so for the whole data set, (Y-Ybar) = (Y-Ycap) + (Ycap-Ybar)

We can also prove that (Y - Ybar)2 = (Y - Ycap)2 + (Ycap - Ybar)2

Recall that (Y - Ybar)2 is the total variation of the dependent variable, Y.

(Ycap - Ybar)2 is the piece for relationship between dependent and


independent variables (the piece for regression).

(Y - Ycap)2 is the piece for error (residual or deviation).

Thus the total variation = the piece for relationship + the piece for error.
This is the basis of the ANOVA table in linear regression. In other words,
the total sum of squares = sum of squares for regression + sum of squares
for residual (SStotal = SSregression + SSresidual. R2 for the model = the
piece for regression / total variation = SSregression / SS total = proportion
of total variation that the model explains. See the ANOVA table in the next
slide.

12 Dr. Robert Chapman


SPSS ANOVA table for linear regression. Abbreviations: SS = sum of
squares, MS = mean square, df = degrees of freedom
The amount F-test for the whole
of variation Equals the sum regression model =
explained of squares / d.f. MSregression /
= SSregression = ANOVAb
MSresidual, d.f. = 1,7
(Ycap - Ybar)2
Sum of Mean
Model Squares df Square F Sig.
1 Regression 58.188 1 58.188 14.646 .006a
Residual 27.812 7 3.973
Total 86.000 8
a. Predictors: (Constant),
VariationAPTITUDE
explained / The amount
Aptitude
Total variation of variation not
b. dependent
of total variation =
Dependent Variable: ACHIEVMT Achievement
58.188 / 86 = explained =
variable = SSresidual = SSerror
(Y - Ybar)2 = 0.677 = R2 =
= (Y - Ycap)2
= SSregression SSregression /
+ SSresidual SStotal
Total d.f. = N 1. D.F. for regression = number of independent variables.
13 D.F. for residual (error) = (N 1 number
Dr. Robert Chapman of independent variables)
t = B / standard error.
d.f. = residual d.f. = (N 1 -
B is the regression no. of independent variables)
coefficient (slope). Coe fficie ntsa
Units are change in
achievement per unit Standar
change in aptitude. dized
Unstandardized Coeffici
Coefficients ents
Model B Std. Error Beta t Sig.
1 (Constant) -2.092 2.720 -.769 .467
APTITUDE
1.419 .371 .823 3.827 .006
Aptitude
a. Dependent Variable: ACHIEVMT Achievement
The constant is the We will Ignore 2-tailed
Y-intercept of the standardized coefficients. p-value for
regression line. the t-statistic
The modeled relationship is achievement = -2.092 + 1.419(aptitude). The model
predicts that achievement increases by 1.419 units for each unit increase in aptitude.
An excellent web site for statistical calculators and test statistics is
14 www.danielsoper.com. Dr. Robert Chapman
OPTIONAL: Where the slope (B) comes from

[(Y - Ybar)(X - Xbar)] = SSxy = numerator of b in slide 9

N Sum Mean SD
xmxbymyb
(xminxbar)(yminybar), part 9 41.00 4.56 5.878
of SSxy
xmxbsqar (xminxbar)**2,
part of SSxx 9 28.89 3.21 3.639

(X - Xbar)2 = SSxx = denominator of b in slide 9

b = slope = [(Y - Ybar)(X - Xbar)] / (X - Xbar)2 = 41.00 / 28.89 = 1.419


= SSxy / SSxx
Note: The sums of squares for regression and residual are different
from SSxx and SSxy
15 Dr. Robert Chapman
95% confidence intervals around regression coefficients

Use t-statistic with 2-tailed


p-value = 0.05 and d.f. =
residual d.f. (= 7). In this
Coe fficie ntsa example, t = 2.365
Coe fficie ntsa

Standar Standar
dized dized
Unstandardized Coeffici Unstandardized 95% Confidence
Coeffici
Coefficients ents Coefficients Interval for B
ents
Lower Upper
B Std. Error Beta
Model Bt Std.Sig.
Error Bound
Beta Bound
t Sig.
-2.092 1 2.720 (Constant) -2.092
-.769 .467 -8.523
2.720 4.338
-.769 .46
1.419 Aptitude
.371 .823 1.419
3.827 .006
.371 .542
.823 2.296
3.827 .00
a. Dependent Variable: Achievement
able: Achievement

1.419 - (2.365)(0.371) = 0.542.


1.419 + (2.365)(0.371) = 2.296.
16 Dr. Robert Chapman
Structure of data file HighSchoolTestScoresPH507Chapman0813.sav

17 Dr. Robert Chapman


Assessing associations of writing score with 3 independent variables, one at
a time (bivariate analysis) Model Summary

Starting with gender Adjusted


Std. Error
of the
Male=0 Model
1
R
.256a
R Square R Square
.066 .061
Estimate
9.185
Female=1 a. Predictors: (Constant), female
DichotomousVariableForGender

ANOVAb

Sum of Mean
Model Squares df Square F Sig.
1 Regression 1176.21 1 1176.2 13.943 .000a
Residual 16702.7 198 84.357
Total 17878.9 199
a. Predictors: (Constant), female DichotomousVariableForGender
b. Dependent Variable: write WritingTestScore

Coe fficie ntsa

Standar
dized
Unstandardized Coeffici 95% Confidence
Coefficients ents Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 50.121 .963 52.057 .000 48.222 52.020
female
Dichotomous 4.870 1.304 .256 3.734 .000 2.298 7.442
VariableForGender
a. Dependent Variable: write WritingTestScore
18 Dr. Robert Chapman
Weight in kilograms
Coe fficie ntsa

Standar
dized
Unstandardized Coeffici 95% Confidence
Coefficients ents Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 76.131 8.308 9.163 .000 59.747 92.515
weight
-.558 .198 -.197 -2.820 .005 -.948 -.168
WeightInKilograms
a. Dependent Variable: write WritingTestScore

3-level SES scale (SES): Note: SES scale is ordinal categorical, but we
are considering it to be continuous in this example.
Coe fficie ntsa

Standar
dized
Unstandardized Coeffici 95% Confidence
Coefficients ents Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 47.195 1.982 23.814 .000 43.287 51.103
ses
Ordinal3Level 2.715 .910 .207 2.985 .003 .921 4.510
SESScale
a. Dependent Variable: write WritingTestScore
19 Dr. Robert Chapman
In bivariate analysis, all 3 independent variables are statistically
significantly associated with writing score. Now we will enter all 3 into
one single multivariable (multiple) linear regression model.

Model Summary

Std. Error
Adjusted of the
Model R R Square R Square Estimate
1 .352 .124 .111 8.939

ANOVAb

Sum of Mean
Model Squares df Square F Sig.
1 Regression 2218.70 3 739.568 9.256 .000a
Residual 15660.2 196 79.899
Total 17878.9 199
a. Predictors: (Constant), ses Ordinal3LevelSESScale, weight
WeightInKilograms, female DichotomousVariableForGender
b. Dependent Variable: write WritingTestScore
20 Dr. Robert Chapman
Weight is no longer statistically significant, probably because it is
associated with gender (girls are lighter than boys).

Coe fficie ntsa

Unstandardized 95% Confidence


Coefficients Interval for B
Lower Upper
Model B Std. Error t Sig. Bound Bound
1 (Constant) 42.663 13.070 3.264 .001 16.887 68.439
female
Dichotomous 5.515 1.934 2.852 .005 1.701 9.328
VariableForGender
weight
.013 .288 .046 .963 -.554 .580
WeightInKilograms
ses
Ordinal3Level 3.186 .882 3.611 .000 1.446 4.926
SESScale
a. Dependent Variable: write WritingTestScore

21 Dr. Robert Chapman


A reasonable final linear regression model.

Coe fficie ntsa

Unstandardized 95% Confidence


Coefficients Interval for B
Lower Upper
Model B Std. Error t Sig. Bound Bound
1 (Constant) 43.261 2.112 20.480 .000 39.096 47.427
female
Dichotomous 5.448 1.276 4.269 .000 2.931 7.964
VariableForGender
ses
Ordinal3Level 3.185 .880 3.621 .000 1.450 4.919
SESScale
a. Dependent Variable: write WritingTestScore

22 Dr. Robert Chapman


USING THE REGRESSION MODEL TO ESTIMATE (PREDICT) VALUE OF Y
AT SPECIFIC VALUES OF X.
The general regression equation for this model is
Y = a + b1x1 + b2x2
The specific regression equation is
writing score = 43.261 + 5.448(female) + 3.185(SES)
What writing score does this model predict:
for females when SES=2?
Y = 43.261 + 5.448(1) + 3.185(2) = 55.08
for males when SES=3?
Y = 43.261 + 5.448(0) + 3.185(3) = 52.82
for males when SES=0? (NOTE: This is not realistic.)
Y = 43.261 + 5.448(0) + 3.185(0) = 43.261. Note that this is the Y-
intercept. In a multivariable model, the Y-intercept is the predicted
value of Y when all the independent variables (X's) = 0

23 Dr. Robert Chapman


Introduction to logistic regression
In linear regression we model a continuous dependent variable, Y, with
the regression function (RF) Y = a + b1X1 + b2X2 + b3X3 + ... + bnXn. We
directly estimate the value of Y using this function. The Y-intercept and
the coefficients of the X's are derived by minimizing the sum of squared
deviations between the observed data and the regression function (least-
squares method).
But often we do not wish to model a continuous dependent variable.
Rather, we wish to model the probability of occurrence of a dichotomous
dependent variable. Such a dichotomous variable could be presence or
absence of an illness. We might wish to assess the influence of
independent variables on the probability of having the illness. We could
use linear regression to do this. However, in the real world, probability
can vary only between 0 and 1. In linear regression, as we have seen,
the model can predict Y-values <0 and >1. Thus, it is usually not
appropriate to use linear regression when we want to model probability
with dichotomous dependent variables, because linear regression can
predict probabilities that cannot exist in the real world.
24 Dr. Robert Chapman
So we need to find a way to ensure that modeled probability can vary only
between 0 and 1. To do this, we use binary logistic regression to model
the logarithm of the odds of occurrence of the dependent variable
(log(odds)). We do not model probability directly. To model log(odds), we
let Y = the modeled log(odds) = the regression function
a + b1X1 + b2X2 + b3X3 + ... + bnXn. Again, this is the same function as in
linear regression, but now it equals modeled log(odds). We will abbreviate
the regression function as RF. (The log(odds) is often called the logit.)
The base of natural logarithms is e, a fundamental constant, = 2.71828...
The antilog of a natural logarithm, X, = eX, which we can write as exp(X).
Exp means exponential, same meaning as antilog.
Now consider probability (P) and odds (O). Recall that
O = P / (1 P) and P = O / (O + 1).
Again, we model the log(odds) in logistic regression. Odds simply = the
antilog of the log odds = exp(log(odds)) =
exp(a + b1X1 + b2X2 + b3X3 + ... + bnXn) = exp(regression function) =
exp(RF)
So P = exp(RF) / (exp(RF) + 1). This is a logistic function, and it is at the
heart of logistic regression. It can also be written
25 P = 1 / (1 + 1 / exp(RF)) = 1 / (1 Dr. + / eRF) = 1 / (1 + e-RF)
1 Chapman
Robert
Odds can vary from zero to infinity. so log(odds) can vary from minus infinity
to plus infinity. However, because of the mathematical relationship between
odds and probability, the modeled probability cannot go below zero or above
1. This is realistic, because probability can only vary between 0 and 1 in the
real world. This is the main reason why we use logistic regression, not linear
regression, to model probability with dichotomous dependent variables.

The graph is a 1.00


logistic function
relating log(odds)
0.80
to probability
Log(odds) can 0.60
Probability

vary from -infinity


to +infinity, but 0.40
probability
can vary only from
0.20
0 to 1
Recall that 0.00
log(odds) =
-60 -40 -20 0 20 40 60
the regression Log(odds)

26
function Dr. Robert Chapman
Visual comparison of linear regression (the red line) and logistic
regression (the blue curve). In the linear model, probability can go
below zero and above one. In the logistic model, probability is forced
to stay between zero and one. (LP = linear probability)

Modeled
probability

27 Dr. Robert Chapman


Logistic regression calculates the parameters of the RF (a and the b's)
by a process called maximum likelihood estimation. This maximizes the
likelihood of occurrence of the observed data from a theoretical
underlying distribution. In logistic regression, the maximum likelihood
estimation process uses the binomial distribution. This is different from
the least-squares method used in linear regression.
The model constant or Y-intercept (a) is the modeled log odds when all
the independent variables equal zero. The coefficients (b1, b2, b3,...,bn)
are logs of odds ratios for a one-unit increase in each independent
variable (e.g., increase from 0 to 1 in the variable for gender, increase
of one level in a socioeconomic scale, increase of 1 kg in body weight)
To summarize, in binary logistic regression we obtain the modeled log
odds of occurrence of the dependent variable (e.g., occurrence of the
illness of interest). From this we can calculate the modeled odds. We
can then calculate the modeled probability from the modeled odds.
(We cannot calculate modeled probability from case-control data.)

(Note: there are types of logistic regression other than binary logistic.
In these the dependent variable is categorical but not dichotomous. We
will discuss only binary logistic regression in this class.)
28 Dr. Robert Chapman
Linear and logistic regression: some similarities and differences
Some similarities
Both enable us to evaluate more than one independent variable
simultaneously (at the same time).
Independent variables can be continuous or categorical in both linear
and logistic regression.
Both enable us to assess the relative importance of multiple independent
variables
Both can give modeled magnitudes of effect, and confidence intervals
and p-values, for independent variables.
Both use statistical models (regression models or prediction models) of
the form, Y = a + b1x1 + b2x2 + + bnxn.
Both derive specific constant values of the model coefficients (a and
b1 to bn).
Both allow calculation of the predicted (modeled) level of the dependent
variable at specific levels of the independent variables.
29 Dr. Robert Chapman
Some differences between linear regression and binary logistic regression

Characteristic Linear Regression Logistic Regression

Dependent variable Continuous Dichotomous (binary, 2 levels only),


e.g., does or does not have illness,
usually coded 0,1

Underlying probability Normal (continuous Binomial (dichotomous dependent


distribution dependent variable) variable)
What is modeled The dependent Log odds of occurrence of the
(predicted) variable itself dependent variable, e.g., log odds of
illness prevalence

Y-intercept (constant) Value of Y when all Log odds when all Xs equal zero
Xs equal zero

Model coefficients (Bs) Change in Y per unit Log of the odds ratio per unit
increase in X increase in X*

* We calculate the exponential (antilog) of the coefficient to obtain the odds ratio per
unit increase in the independent variable. For example, if the coefficient is 0.66, the
modeled odds ratio per unit increase is
30
e0.66 = 1.93. Dr. Robert Chapman
Some differences between linear and binary logistic regression (continued)
Characteristic Linear Regression Logistic Regression

Model fitting procedure Least-squares (minimizes Maximum likelihood


(deriving the model's y- the sum of squared
intercept and coefficients) deviations from the
regression equation)
Evaluating how well the R-squared (the proportion Likelihood ratio chi-square
model describes (fits) the of total variation test for the model, d.f. =
data "explained" by the model). no. of independent
Also the F-test for the variables. Also Hosmer-
model Lemeshow chi-square test
Test statistic for individual t-statistic with degrees of Wald chi-square statistic
coefficients in the model freedom = the residual d.f.* with 1 d.f.**

Statistic used in calculating t-statistic with 2-tailed 1.96 (z-statistic from the
the 95% confidence probability = 0.05 and normal distribution, for
interval degrees of freedom = the which 2-tailed probability =
residual d.f.* 0.05)

Numerator d.f.=no. of independent variables. Denominator d.f.=residual d.f.


* Residual degrees of freedom = (N number of independent variables 1)
31 ** The Wald chi-square statistic = Dr.
(coefficient / standard error)2
Robert Chapman
In the variable view of dataset
HighSchoolTestScoresPH507Chapman0813.sav, you will find a variable
called HIWRITE. This is a dichotomous variable with values 0 and 1.
0 = lower writing score, 1 = higher writing score. (In binary logistic
regression, we usually code the dependent variable 0,1)

We will use HIWRITE as the dependent variable in logistic regression.


We will do a bivariate analysis with the same 3 independent variables as
we used in linear regression.

With logistic regression we are not modeling the writing score directly,
as we did with linear regression.

Instead, we are modeling the log of the odds of having a high writing
score. We then use this model information to calculate probability of
having a high writing score, using the method given previously (log odds
to odds to probability).

32 Dr. Robert Chapman


In SPSS, go to "regression." Click "binary logistic." Put variable HIWRITE
in the box for the dependent variable. Put FEMALE in the box for
covariates (covariates are the same as independent variables). Click OK.
You will find the model summary at the end of the SPSS output.
Variable s in the Equation

B S.E. Wald df Sig. Exp(B)


Step
a
female .993 .302 10.837 1 .001 2.699
1 Constant .022 .210 .011 1 .917 1.022
a. Variable(s) entered on step 1: female.

The model coefficient, B, for being female is 0.993. The exponential of


0.993, Exp(B) is e0.993 = 2.699. Remember that this is an odds ratio. In
this example the modeled odds of having a high writing score are 2.699
times higher in females than males. This difference is statistically
significant (p=0.001). The Constant is the Y-intercept, = the log(odds)
when all independent variables = 0.
When B = 0, exp(B) = 1. Exp(B) is the odds ratio (OR) per unit increase
in the independent variable, so when OR = 1 there is no association
between the independent and the dependent variable.
33 Dr. Robert Chapman
Now use binary logistic regression to separately analyze relationships of weight
and SES with HIWRITE. You will find that each of these variables, like FEMALE,
is significantly associated with HIWRITE.

Variable s in the Equation

B S.E. Wald df Sig. Exp(B)


Step
a
weight -.108 .045 5.699 1 .017 .897
1 Constant 5.083 1.919 7.019 1 .008 161.27
a. Variable(s) entered on step 1: weight.

Variable s in the Equation

B S.E. Wald df Sig. Exp(B)


Step
a
ses .683 .214 10.207 1 .001 1.979
1 Constant -.840 .447 3.537 1 .060 .432
a. Variable(s) entered on step 1: ses.

The modeled odds ratio for having a higher writing score, per kilogram increase in
weight, is 0.897. The modeled OR for having a higher writing score, per unit
increase in SES, is 1.979.
In the bivariate analysis in logistic regression, as in linear regression, all 3
independent variables are significantly associated with the dependent variable
34 Dr. Robert Chapman
Here is the output for the logistic regression model in which all 3
independent variables are included.

Variable s in the Equation

B S.E. Wald df Sig. Exp(B)


Step
a
female 1.348 .487 7.653 1 .006 3.851
1 weight .020 .071 .081 1 .776 1.020
ses .867 .233 13.868 1 .000 2.380
Constant -2.749 3.243 .719 1 .397 .064
a. Variable(s) entered on step 1: female, weight, ses.

As in the linear regression example, FEMALE and SES remain significant.


and WEIGHT completely loses significance. Now we will construct a new
model in which we include only FEMALE and SES.
Note: It is important to look at the standard errors (SEs) of the Bs for the
independent variables. In this example, these SEs are all < 1. This
increases confidence that the model is giving trustworthy information. If
one or more SEs are very large, a new model should be constructed. (This
does not apply to the SE of the constant)
35 Dr. Robert Chapman
Model with FEMALE and SES (with confidence intervals).

Variable s in the Equation

95.0% C.I.for
EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
Step
a
female 1.246 .327 14.514 1 .000 3.477 1.831 6.601
1 ses .864 .232 13.821 1 .000 2.373 1.505 3.744
Constant -1.841 .547 11.314 1 .001 .159
a. Variable(s) entered on step 1: female, ses.
Again, the SEs are both < 1.
This is good.

The regression function (model) is


LogOdds of high writing score = 1.246(female) + 0.864(ses) - 1.841.
In this new model, both independent variables have statistically significant
positive relationships with HIWRITE (ORs > 1). In this example the overall
impression is similar with linear regression and logistic regression. (This is
not always true.)
36 Dr. Robert Chapman
What are the modeled (predicted) log odds, odds, and probability of
having a high writing score when gender=female and SES=2?
Log odds = 1.246(1) + 0.864(2) - 1.841 = 1.133
Odds = exp(1.133) = 3.105
Probability = 3.105 / (3.105 + 1) = 0.756 = 75.6%
What are the modeled (predicted) log odds, odds, and probability of
having a high writing score when gender=male and SES=1?
Log odds = 1.246(0) + 0.864(1) - 1.841 = -0.977
Odds = exp(-0.977) = 0.376
Probability = 0.376 / (0.376 + 1) = 0.273 = 27.3%
What are the modeled (predicted) log odds, odds, and probability of
having a high writing score when gender=male and SES=0? (NOTE: this
is not realistic.)
Log odds = -1.841. This is simply the Y-intercept
Odds = exp(-1.841) = 0.159
Probability = 0.159 / (0.159 + 1) = 0.137 = 13.7%
What is the highest value in the data range of log odds, odds, and
probability?
37 Dr. Robert Chapman
IMPORTANT NOTE: Logistic regression can be used to calculate odds
ratios from case-control data. HOWEVER, it is not possible to calculate
modeled probability of disease with case-control data. Remember that
we cannot calculate probability of disease or odds of disease from case
control data. Therefore, we cannot use logistic regression to calculate
modeled probability of disease from case-control data.

NOTE: If your calculator does not accept negative logs, recall that
enegative log = (1 / epositive log). For example, e-3.143 = (1 / e3.143). Your
calculator may be able to calculate (1 / epositive log).

38 Dr. Robert Chapman


39 Dr. Robert Chapman

You might also like