You are on page 1of 31

Logistic Regression - Part I

Statistics 149 Spring 2006

Copyright c 2006 by Mark E. Irwin

Logistic Regression Model


Let Yi be the response for the ith observation where Yi = 1 Success 0 Failure

Then Yi Bin(1, (xi)) where xi is the level of the predictor of observation i. An equivalent way of thinking of this, is to model (Yi|xi) = (xi) One possible model would be (xi) = 0 + 1xi
Logistic Regression Model 1

As we saw last class, this model will give invalid values for for extreme xs. In fact there is a problem if 0 1 1 0 1

x<

or

x>

We would still like to keep a linear type predictor. One way to do this is to apply a link function g () to transform the mean to a linear function, i.e. g ( ) = 0 + 1x

As mentioned last class, we want the function g () to transform the interval [0,1] to (, ). While their are many possible choices for this, there is one that matches up with what we have seen before, the logit transformation logit( ) = log = log = 1
Logistic Regression Model 2

So instead of looking at things on the probability scale, lets look at things on the log odds ( ) scale. Transforming back gives = = e = e0+1x 1 and e e0+1x = = = 1+ 1+e 1 + e0+1x

Note the standard Bernoulli distribution results hold here e0+1x (Y |X ) = = 1 + e0+1x and Var(Y |X ) = (1 ) = e0+1x (1 + e0+1x)
2

Logistic Regression Model

While the above just assumes a single predictor, it is trivial to extend this to multiple predictors, just set logit( ) = 0 + 1x1 + . . . + pxp = X giving

= = e = e0+1x1+...+pxp = eX 1 e e0+1x1+...+pxp eX = = = = + x + ... + x p p 0 1 1 1+ 1+e 1 + eX 1+e

and

Logistic Regression Model

Example: Low Birth Weight in Infants Hosmer and Lemeshow (1989) look at a data set on 189 births at Baystate Medical Center, Springeld, Mass during 1986, with the main interest being in low birth weight. low: birth weight less than 2.5 kg (0/1) age: age of mother in years lwt: weight of mother (lbs) at last menstrual period race: white/black/other smoke: smoking status during pregnancy ptl: number of previous premature labours ht: history of hypertension (0/1) ui: has uterine irritability (0/1) ftv: number of physician visits in rst trimester bwt: actual birth weight (grams) We will focus on maternal age for now.
Logistic Regression Model 5

This data set is available in R in the data frame birthwt, though you may need to give the command library(MASS) to access it.
1.0 Low Birth Weight 0.0 0.2 0.4 0.6 0.8

15

20

25

30 Maternal Age

35

40

45

Lets t the model logit( (age)) = 0 + 1age in R with the function glm
Logistic Regression Model 6

> birthwt.glm <- glm(low ~ age, data=birthwt, family=binomial) > summary(birthwt.glm) Call: glm(formula = low ~ age, family = binomial, data = birthwt) Deviance Residuals: Min 1Q Median -1.0402 -0.9018 -0.7754

3Q 1.4119

Max 1.7800

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.38458 0.73212 0.525 0.599 age -0.05115 0.03151 -1.623 0.105 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 Residual deviance: 231.91 AIC: 235.91
Logistic Regression Model

on 188 on 187

degrees of freedom degrees of freedom

The tted curves are (age) = 0.385 0.051age (age) = e0.3850.051age e0.3850.051age (age) = 1 + e0.3850.051age
1.0 Low Birth Weight 0.6 0.8

0.0

0.2

0.4

^ (age)

15

20

25

30 Maternal Age

35

40

45

Logistic Regression Model

So in this case, it appears that age increases, the probability/odds of having a low birth weight baby decreases. What is the eect of changing x on , , and ? Lets see what happens as x goes to x + x in the single predictor case.

(x + x) = 0 + 1(x + x) = 0 + 1x + 1x = (x) + 1x So the log odds work the same way as linear regression. Changing x by one leads to a change in log odds of 1.

(x+x) = e0+1(x+x) = e0+1x+1x = (x)e1x = (x)(e1 )x So for this model, the changing x has a multiplicative eect on the odds. Increasing x by 1 leads to multiplying the odds by e1 . Increasing x by another 1 leads to another multiplication of e1 .
Logistic Regression Model 9

Another way of thinking of this is through the odds ratio (x + x) = e1x = (e1 )x (x) Note that the dierence in odds depends on x (through the odds) as (x + x) (x) = (x)(e1x 1) So the bigger (x), the bigger the absolute dierence. For there is not a nice relationship as (x) has an S shape. e0+1(x+x) = (x) + something ugly (x + x) = 1 + e0+1(x+x) As can be seen in the following gure, the change (x) depends on (x) and not in a nice way. However the biggest changes occur when (x) 0.5 and the size of the change decreases as (x) approaches 0 and 1.
Logistic Regression Model 10

Odds
1.0

Probability

=e

50

=
0.8

e 1 + e

40

0.6 30

Changes by 0.231

= 1 20 0.4

Changes by 0.072 10 0.2 Factor of 2.718 0 = 1 0.0 4 2 0 2 4

Logistic Regression Model

11

These formulas also imply that the sign of 1 indicates whether (x) and (x) increases (1 > 0) or decreases (1 < 0) as x increases. 1 = -0.051, older mothers should be less likely to So in the example, since have low birth weight babies (ignoring the eects of other predictors). More precisely, each additional year of age lowers the odds of a low birth weight birth by a factor of e0.051 = 0.95 per year. Actually, there isnt enough evidence to declare age to be statistically signicant, but the data does suggest the previous statements. In the case of multiple predictors, you need to be a bit more careful. If there are no interaction terms in the model, e.g. nothing like logit( ) = 0 + 1x1 + 2x2 + 12x1x2 then the previous ideas go through if you x all but one xj and only allow one to vary.
Logistic Regression Model 12

For example for the model logit (x1, x2) = 0 + 1x1 + 2x2 The log odds satisfy (x1 + x, x2) = 0 + 1(x + x) + 2x2 = 0 + 1x + 1x + 2x2 = (x1, x2) + 1x imply that the odds satisfy (x1 + x, x2) = e0+1(x+x)+2x2 = e0+1x+1x+2x2 = (x1, x2) e1x = (x1, x2) (e1 )x
Logistic Regression Model 13

However for the interaction model logit( ) = 0 + 1x1 + 2x2 + 12x1x2 the log odds satisfy (x1 + x, x2) = 0 + 1(x + x) + 2x2 + 12(x1 + x)x2 = 0 + 1x + 2x2 + 12x1x2 + x(1 + 12x2) = (x1, x2) + x(1 + 12x2) implying the eect of changing x1 depends on the level of x2 as (x1 + x, x2) = e0+1(x+x)+2x2+12(x1+x)x2 = e0+1x+2x2+12x1x2+x(1+12x2) = (x1, x2) ex(1+12x2)

Logistic Regression Model

14

Fitting the Logistic Regression Model


One approach you might think of tting the model would be do to least squares on the data (xi, logit(Yi)). Unfortunately this approach has some problems. If X Bin(n, ) and p =
X n,

then

1 Var(logit( p)) n (1 ) so we dont have constant variance. If Y Bin(1, ), then logit(Y ) equals or . While there are ways around this (weighted least squares & fudging the Yis), a better approach is maximum likelihood.
Fitting the Logistic Regression Model 15

Maximum Likelihood Estimation


Lets assume that that Y1, Y2, . . . , Yn are an independent random sample from a population with a distribution described by the density (or pmf if discrete) f (Yi|). The parameter might be a vector = (1, 2, . . . , p). Then the likelihood function is
n

L() =
i=1

f (Yi|)

The maximum likelihood estimate (MLE) of is = arg sup L() i.e. the value of that maximizes the likelihood function. One way of thinking of the MLE is that its the value of the parameter that is most consistent with the data.
Maximum Likelihood Estimation 16

So for logistic regression, the likelihood function has the form


n

L( ) =
i=1 n

yi i (1 i)1yi

=
i=1 n

i 1 i

yi

(1 i)

=
i=1 n

yi i (1 i)

=
i=1

e0+1xi

yi

1 1 + e0+1xi

where logit(i) = log i = 0 + 1xi.

Maximum Likelihood Estimation

17

One approach to maximizing the likelihood is via calculus by solving the equations L() = 0, 1 L() = 0, 2 ..., L() =0 p

with respect to the parameter . Note that when determining MLEs, it is usually easier to work with the log likelihood function
n

l() = log L() =


i=1

log f (xi|)

It has the same optimum since log is an increasing function and it is easier to work with since derivatives of sums are usually much nicer than derivatives of products.

Maximum Likelihood Estimation

18

Thus we can solve the score equations

l() = 1 l() = 2 ... l() = p

log f (xi|) =0 1 i=1 log f (xi|) =0 2 i=1 log f (xi|) =0 p i=1


n n

for instead.

Maximum Likelihood Estimation

19

For logistic regression, the log likelihood function is


n

l( ) =
i=1 n

(yi log i + (1 yi) log(1 i)) (yi log i + log(1 i))


i=1 n

= =
i=1

yi(0 + 1xi) log(1 + e0+1xi )

Maximum Likelihood Estimation

20

Normally there are not closed form solutions to these equations as can be seen from the score equations l( ) = 0 l( ) = 1
n i=1 n i=1

e0+1xi yi 1 + e0+1xi

=0 =0

e0+1xi xiyi xi 1 + e0+1xi

These equations will need to be solved by numerical methods, such as Newton-Raphson or iteratively reweighted least squares. However there are some special cases which we will discuss later where there is a nice function of (xi, yi)) are closed formed solutions (

Maximum Likelihood Estimation

21

Key Properties of MLEs


1. For large n, MLEs are nearly unbiased (they are consistent) ) can be estimated. 2. Var( The information matrix I () is an p p matrix with entries 2l() Iij = ij Then the inverse of the observed information matrix satises ) I 1( ) Var( 3. Among approximately unbiased estimators, the MLE has a variance smaller than any other estimator.
Key Properties of MLEs 22

4. For large n, the sampling distribution of an MLE is approximately normal. This implies we can get condence intervals for i easily. is the MLE of , then g ( ) is the MLE of g (), for any nice function 5. If g (). (Transformations of MLEs are MLEs - Invariance property.) An example of where this is useful is the estimation of success probabilities in logistic regression. So (x) = is the MLE of e0+1x
0 + 1 x 1 + e

e0+1x (x) = 1 + e0+1x

So the tted curve on the earlier plot is the MLE of the probabilities of a low birth weight for ages 14 to 45.

Key Properties of MLEs

23

Least Squares Versus Maximum Likelihood in Normal Based Regression


One question some of you may be asking is, if maximum likelihood is so nice, why was least squares used earlier in the book for regression. If Yi|Xi N (Xi, 2), i = 1, . . . , n, then the least squares and the maximum likelihood estimates of are exactly the same. The log likelihood function in this case can be written as
ind

n 1 n l(, 2) = log 2 log 2 2 2 2 2

(yi 0 1xi1 . . . pxip)2


i=1

So maximizing this with respect to is the same as minimizing the least square criteria.
Least Squares Versus Maximum Likelihood in Normal Based Regression 24

The one place where there is a slight dierence is in estimating 2. The MLE is SSE np1 2 2 = = np np where 2 is the usual unbiased method of moments estimator.

Least Squares Versus Maximum Likelihood in Normal Based Regression

25

Inference on Individual s
j is approximately normally distributed with mean j As mentioned before and a variance we can estimate. So we can base our inference on the result j j z= j ) SE (
approx.

N (0, 1)

Thus an approximate condence interval for j is j z SE ( j ) /2


where z/ 2 is the usual normal critical value.

While R doesnt give you these CIs directly, it does give you the information needed to calculate them.
Inference on Individual s 26

> birthwt.glm <- glm(low ~ age, data=birthwt, family=binomial) > summary(birthwt.glm) Deviance Residuals: Min 1Q Median -1.0402 -0.9018 -0.7754

3Q 1.4119

Max 1.7800

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.38458 0.73212 0.525 0.599 age -0.05115 0.03151 -1.623 0.105 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 Residual deviance: 231.91 AIC: 235.91 on 188 on 187 degrees of freedom degrees of freedom

Inference on Individual s

27

s and standard errors into vectors, and create In addition you can get the condence intervals by the following code > betahat <- coef(birthwt.glm) > betahat (Intercept) age 0.38458192 -0.05115294 > se.betahat <- sqrt(diag(vcov(birthwt.glm))) > se.betahat (Intercept) age 0.73212479 0.03151376 > me.betahat <- qnorm(0.975) * se.betahat > ci.betahat <- cbind(Lower=betahat - me.betahat, Upper=betahat + me.betahat) > ci.betahat # 95% approximate CIs Lower Upper (Intercept) -1.0503563 1.81952014 age -0.1129188 0.01061290
Inference on Individual s 28

In addition, testing the hypothesis H0 : j = 0 is usually done by Walds test j z= j ) SE ( which is compared to the N (0, 1) distribution. This is given in the standard R output with the summary command vs HA : j = 0

Inference on Individual s

29

> birthwt.glm <- glm(low ~ age, data=birthwt, family=binomial) > summary(birthwt.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.38458 0.73212 0.525 0.599 age -0.05115 0.03151 -1.623 0.105 So in this case age does not appear to statistically signicant. However remember there are a number of potential confounders ignored in this analysis so you may not want to read too much into this. As in regular regression, inference on the intercept usually is not interesting. In this case 0 gives information about mothers with age = 0, a situation that cant happen.

Inference on Individual s

30

You might also like