Lecture 9

Logistic Regression - Part I
Statistics 149 Spring 2006
Copyright c 2006 by Mark E. Irwin
Logistic Regression Model

Let Yi be the response for the ith observation where Yi = 1 Success 0 Failure
Then Yi Bin(1, (xi)) where xi is the level of the predictor of observation i. An equivalent way of thinking of this, is to model (Yi|xi) = (xi) One possible model would be (xi) = 0 + 1xi
Logistic Regression Model 1
As we saw last class, this model will give invalid values for for extreme xs. In fact there is a problem if 0 1 1 0 1
x<
or
x>
We would still like to keep a linear type predictor. One way to do this is to apply a link function g () to transform the mean to a linear function, i.e. g ( ) = 0 + 1x
As mentioned last class, we want the function g () to transform the interval [0,1] to (, ). While their are many possible choices for this, there is one that matches up with what we have seen before, the logit transformation logit( ) = log = log = 1
So instead of looking at things on the probability scale, lets look at things on the log odds ( ) scale. Transforming back gives = = e = e0+1x 1 and e e0+1x = = = 1+ 1+e 1 + e0+1x
Note the standard Bernoulli distribution results hold here e0+1x (Y |X ) = = 1 + e0+1x and Var(Y |X ) = (1 ) = e0+1x (1 + e0+1x)
2
While the above just assumes a single predictor, it is trivial to extend this to multiple predictors, just set logit( ) = 0 + 1x1 + . . . + pxp = X giving
= = e = e0+1x1+...+pxp = eX 1 e e0+1x1+...+pxp eX = = = = + x + ... + x p p 0 1 1 1+ 1+e 1 + eX 1+e
and
Example: Low Birth Weight in Infants Hosmer and Lemeshow (1989) look at a data set on 189 births at Baystate Medical Center, Springeld, Mass during 1986, with the main interest being in low birth weight. low: birth weight less than 2.5 kg (0/1) age: age of mother in years lwt: weight of mother (lbs) at last menstrual period race: white/black/other smoke: smoking status during pregnancy ptl: number of previous premature labours ht: history of hypertension (0/1) ui: has uterine irritability (0/1) ftv: number of physician visits in rst trimester bwt: actual birth weight (grams) We will focus on maternal age for now.
This data set is available in R in the data frame birthwt, though you may need to give the command library(MASS) to access it.
1.0 Low Birth Weight 0.0 0.2 0.4 0.6 0.8
15
20
25
30 Maternal Age
35
40
45
Lets t the model logit( (age)) = 0 + 1age in R with the function glm
> birthwt.glm <- glm(low ~ age, data=birthwt, family=binomial) > summary(birthwt.glm) Call: glm(formula = low ~ age, family = binomial, data = birthwt) Deviance Residuals: Min 1Q Median -1.0402 -0.9018 -0.7754
3Q 1.4119
Max 1.7800
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.38458 0.73212 0.525 0.599 age -0.05115 0.03151 -1.623 0.105 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 Residual deviance: 231.91 AIC: 235.91
on 188 on 187
degrees of freedom degrees of freedom
The tted curves are (age) = 0.385 0.051age (age) = e0.3850.051age e0.3850.051age (age) = 1 + e0.3850.051age
1.0 Low Birth Weight 0.6 0.8
0.0
0.2
0.4
^ (age)
15
20
25
30 Maternal Age
35
40
45
So in this case, it appears that age increases, the probability/odds of having a low birth weight baby decreases. What is the eect of changing x on , , and ? Lets see what happens as x goes to x + x in the single predictor case.
(x + x) = 0 + 1(x + x) = 0 + 1x + 1x = (x) + 1x So the log odds work the same way as linear regression. Changing x by one leads to a change in log odds of 1.
(x+x) = e0+1(x+x) = e0+1x+1x = (x)e1x = (x)(e1 )x So for this model, the changing x has a multiplicative eect on the odds. Increasing x by 1 leads to multiplying the odds by e1 . Increasing x by another 1 leads to another multiplication of e1 .
Another way of thinking of this is through the odds ratio (x + x) = e1x = (e1 )x (x) Note that the dierence in odds depends on x (through the odds) as (x + x) (x) = (x)(e1x 1) So the bigger (x), the bigger the absolute dierence. For there is not a nice relationship as (x) has an S shape. e0+1(x+x) = (x) + something ugly (x + x) = 1 + e0+1(x+x) As can be seen in the following gure, the change (x) depends on (x) and not in a nice way. However the biggest changes occur when (x) 0.5 and the size of the change decreases as (x) approaches 0 and 1.
Odds
1.0
Probability
=e
50
=
0.8
e 1 + e
40
0.6 30
Changes by 0.231
= 1 20 0.4
Changes by 0.072 10 0.2 Factor of 2.718 0 = 1 0.0 4 2 0 2 4
11
These formulas also imply that the sign of 1 indicates whether (x) and (x) increases (1 > 0) or decreases (1 < 0) as x increases. 1 = -0.051, older mothers should be less likely to So in the example, since have low birth weight babies (ignoring the eects of other predictors). More precisely, each additional year of age lowers the odds of a low birth weight birth by a factor of e0.051 = 0.95 per year. Actually, there isnt enough evidence to declare age to be statistically signicant, but the data does suggest the previous statements. In the case of multiple predictors, you need to be a bit more careful. If there are no interaction terms in the model, e.g. nothing like logit( ) = 0 + 1x1 + 2x2 + 12x1x2 then the previous ideas go through if you x all but one xj and only allow one to vary.
For example for the model logit (x1, x2) = 0 + 1x1 + 2x2 The log odds satisfy (x1 + x, x2) = 0 + 1(x + x) + 2x2 = 0 + 1x + 1x + 2x2 = (x1, x2) + 1x imply that the odds satisfy (x1 + x, x2) = e0+1(x+x)+2x2 = e0+1x+1x+2x2 = (x1, x2) e1x = (x1, x2) (e1 )x
However for the interaction model logit( ) = 0 + 1x1 + 2x2 + 12x1x2 the log odds satisfy (x1 + x, x2) = 0 + 1(x + x) + 2x2 + 12(x1 + x)x2 = 0 + 1x + 2x2 + 12x1x2 + x(1 + 12x2) = (x1, x2) + x(1 + 12x2) implying the eect of changing x1 depends on the level of x2 as (x1 + x, x2) = e0+1(x+x)+2x2+12(x1+x)x2 = e0+1x+2x2+12x1x2+x(1+12x2) = (x1, x2) ex(1+12x2)
14
Fitting the Logistic Regression Model

One approach you might think of tting the model would be do to least squares on the data (xi, logit(Yi)). Unfortunately this approach has some problems. If X Bin(n, ) and p =
X n,
then
1 Var(logit( p)) n (1 ) so we dont have constant variance. If Y Bin(1, ), then logit(Y ) equals or . While there are ways around this (weighted least squares & fudging the Yis), a better approach is maximum likelihood.
Fitting the Logistic Regression Model 15
Maximum Likelihood Estimation

Lets assume that that Y1, Y2, . . . , Yn are an independent random sample from a population with a distribution described by the density (or pmf if discrete) f (Yi|). The parameter might be a vector = (1, 2, . . . , p). Then the likelihood function is
n
L() =
i=1
f (Yi|)
The maximum likelihood estimate (MLE) of is = arg sup L() i.e. the value of that maximizes the likelihood function. One way of thinking of the MLE is that its the value of the parameter that is most consistent with the data.
Maximum Likelihood Estimation 16
So for logistic regression, the likelihood function has the form

n
L( ) =
i=1 n
yi i (1 i)1yi
=
i=1 n
i 1 i
yi
(1 i)
=
i=1 n
yi i (1 i)
=
i=1
e0+1xi
yi
1 1 + e0+1xi
where logit(i) = log i = 0 + 1xi.
17
One approach to maximizing the likelihood is via calculus by solving the equations L() = 0, 1 L() = 0, 2 ..., L() =0 p
with respect to the parameter . Note that when determining MLEs, it is usually easier to work with the log likelihood function
n
l() = log L() =

i=1
log f (xi|)
It has the same optimum since log is an increasing function and it is easier to work with since derivatives of sums are usually much nicer than derivatives of products.
18
Thus we can solve the score equations
l() = 1 l() = 2 ... l() = p
log f (xi|) =0 1 i=1 log f (xi|) =0 2 i=1 log f (xi|) =0 p i=1

n n
for instead.
19
For logistic regression, the log likelihood function is

n
l( ) =
i=1 n
(yi log i + (1 yi) log(1 i)) (yi log i + log(1 i))

i=1 n
= =
i=1
yi(0 + 1xi) log(1 + e0+1xi )
20
Normally there are not closed form solutions to these equations as can be seen from the score equations l( ) = 0 l( ) = 1
n i=1 n i=1
e0+1xi yi 1 + e0+1xi
=0 =0
e0+1xi xiyi xi 1 + e0+1xi
These equations will need to be solved by numerical methods, such as Newton-Raphson or iteratively reweighted least squares. However there are some special cases which we will discuss later where there is a nice function of (xi, yi)) are closed formed solutions (
21
Key Properties of MLEs

1. For large n, MLEs are nearly unbiased (they are consistent) ) can be estimated. 2. Var( The information matrix I () is an p p matrix with entries 2l() Iij = ij Then the inverse of the observed information matrix satises ) I 1( ) Var( 3. Among approximately unbiased estimators, the MLE has a variance smaller than any other estimator.
Key Properties of MLEs 22
4. For large n, the sampling distribution of an MLE is approximately normal. This implies we can get condence intervals for i easily. is the MLE of , then g ( ) is the MLE of g (), for any nice function 5. If g (). (Transformations of MLEs are MLEs - Invariance property.) An example of where this is useful is the estimation of success probabilities in logistic regression. So (x) = is the MLE of e0+1x
0 + 1 x 1 + e
e0+1x (x) = 1 + e0+1x
So the tted curve on the earlier plot is the MLE of the probabilities of a low birth weight for ages 14 to 45.
Key Properties of MLEs
23
Least Squares Versus Maximum Likelihood in Normal Based Regression

One question some of you may be asking is, if maximum likelihood is so nice, why was least squares used earlier in the book for regression. If Yi|Xi N (Xi, 2), i = 1, . . . , n, then the least squares and the maximum likelihood estimates of are exactly the same. The log likelihood function in this case can be written as
ind
n 1 n l(, 2) = log 2 log 2 2 2 2 2
(yi 0 1xi1 . . . pxip)2

i=1
So maximizing this with respect to is the same as minimizing the least square criteria.
Least Squares Versus Maximum Likelihood in Normal Based Regression 24
The one place where there is a slight dierence is in estimating 2. The MLE is SSE np1 2 2 = = np np where 2 is the usual unbiased method of moments estimator.
Least Squares Versus Maximum Likelihood in Normal Based Regression
25
Inference on Individual s
j is approximately normally distributed with mean j As mentioned before and a variance we can estimate. So we can base our inference on the result j j z= j ) SE (
approx.
N (0, 1)
Thus an approximate condence interval for j is j z SE ( j ) /2

where z/ 2 is the usual normal critical value.
While R doesnt give you these CIs directly, it does give you the information needed to calculate them.
Inference on Individual s 26
> birthwt.glm <- glm(low ~ age, data=birthwt, family=binomial) > summary(birthwt.glm) Deviance Residuals: Min 1Q Median -1.0402 -0.9018 -0.7754
3Q 1.4119
Max 1.7800
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.38458 0.73212 0.525 0.599 age -0.05115 0.03151 -1.623 0.105 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 Residual deviance: 231.91 AIC: 235.91 on 188 on 187 degrees of freedom degrees of freedom
27
s and standard errors into vectors, and create In addition you can get the condence intervals by the following code > betahat <- coef(birthwt.glm) > betahat (Intercept) age 0.38458192 -0.05115294 > se.betahat <- sqrt(diag(vcov(birthwt.glm))) > se.betahat (Intercept) age 0.73212479 0.03151376 > me.betahat <- qnorm(0.975) * se.betahat > ci.betahat <- cbind(Lower=betahat - me.betahat, Upper=betahat + me.betahat) > ci.betahat # 95% approximate CIs Lower Upper (Intercept) -1.0503563 1.81952014 age -0.1129188 0.01061290
Inference on Individual s 28
In addition, testing the hypothesis H0 : j = 0 is usually done by Walds test j z= j ) SE ( which is compared to the N (0, 1) distribution. This is given in the standard R output with the summary command vs HA : j = 0
29
> birthwt.glm <- glm(low ~ age, data=birthwt, family=binomial) > summary(birthwt.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.38458 0.73212 0.525 0.599 age -0.05115 0.03151 -1.623 0.105 So in this case age does not appear to statistically signicant. However remember there are a number of potential confounders ignored in this analysis so you may not want to read too much into this. As in regular regression, inference on the intercept usually is not interesting. In this case 0 gives information about mothers with age = 0, a situation that cant happen.
30

Lecture 9

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 9

Uploaded by

Copyright:

Available Formats

Logistic Regression - Part I

Statistics 149 Spring 2006

Copyright c 2006 by Mark E. Irwin

Logistic Regression Model

Logistic Regression Model

= = e = e0+1x1+...+pxp = eX 1 e e0+1x1+...+pxp eX = = = = + x + ... + x p p 0 1 1 1+ 1+e 1 + eX 1+e

Logistic Regression Model

degrees of freedom degrees of freedom

Logistic Regression Model

Changes by 0.072 10 0.2 Factor of 2.718 0 = 1 0.0 4 2 0 2 4

Logistic Regression Model

Logistic Regression Model

Fitting the Logistic Regression Model

Maximum Likelihood Estimation

So for logistic regression, the likelihood function has the form

where logit(i) = log i = 0 + 1xi.

Maximum Likelihood Estimation

l() = log L() =

Maximum Likelihood Estimation

Thus we can solve the score equations

l() = 1 l() = 2 ... l() = p

log f (xi|) =0 1 i=1 log f (xi|) =0 2 i=1 log f (xi|) =0 p i=1

Maximum Likelihood Estimation

For logistic regression, the log likelihood function is

(yi log i + (1 yi) log(1 i)) (yi log i + log(1 i))

yi(0 + 1xi) log(1 + e0+1xi )

Maximum Likelihood Estimation

e0+1xi xiyi xi 1 + e0+1xi

Maximum Likelihood Estimation

Key Properties of MLEs

e0+1x (x) = 1 + e0+1x

Key Properties of MLEs

Least Squares Versus Maximum Likelihood in Normal Based Regression

n 1 n l(, 2) = log 2 log 2 2 2 2 2

(yi 0 1xi1 . . . pxip)2

Least Squares Versus Maximum Likelihood in Normal Based Regression

Thus an approximate condence interval for j is j z SE ( j ) /2

You might also like