Professional Documents
Culture Documents
Then Yi Bin(1, (xi)) where xi is the level of the predictor of observation i. An equivalent way of thinking of this, is to model (Yi|xi) = (xi) One possible model would be (xi) = 0 + 1xi
Logistic Regression Model 1
As we saw last class, this model will give invalid values for for extreme xs. In fact there is a problem if 0 1 1 0 1
x<
or
x>
We would still like to keep a linear type predictor. One way to do this is to apply a link function g () to transform the mean to a linear function, i.e. g ( ) = 0 + 1x
As mentioned last class, we want the function g () to transform the interval [0,1] to (, ). While their are many possible choices for this, there is one that matches up with what we have seen before, the logit transformation logit( ) = log = log = 1
Logistic Regression Model 2
So instead of looking at things on the probability scale, lets look at things on the log odds ( ) scale. Transforming back gives = = e = e0+1x 1 and e e0+1x = = = 1+ 1+e 1 + e0+1x
Note the standard Bernoulli distribution results hold here e0+1x (Y |X ) = = 1 + e0+1x and Var(Y |X ) = (1 ) = e0+1x (1 + e0+1x)
2
While the above just assumes a single predictor, it is trivial to extend this to multiple predictors, just set logit( ) = 0 + 1x1 + . . . + pxp = X giving
and
Example: Low Birth Weight in Infants Hosmer and Lemeshow (1989) look at a data set on 189 births at Baystate Medical Center, Springeld, Mass during 1986, with the main interest being in low birth weight. low: birth weight less than 2.5 kg (0/1) age: age of mother in years lwt: weight of mother (lbs) at last menstrual period race: white/black/other smoke: smoking status during pregnancy ptl: number of previous premature labours ht: history of hypertension (0/1) ui: has uterine irritability (0/1) ftv: number of physician visits in rst trimester bwt: actual birth weight (grams) We will focus on maternal age for now.
Logistic Regression Model 5
This data set is available in R in the data frame birthwt, though you may need to give the command library(MASS) to access it.
1.0 Low Birth Weight 0.0 0.2 0.4 0.6 0.8
15
20
25
30 Maternal Age
35
40
45
Lets t the model logit( (age)) = 0 + 1age in R with the function glm
Logistic Regression Model 6
> birthwt.glm <- glm(low ~ age, data=birthwt, family=binomial) > summary(birthwt.glm) Call: glm(formula = low ~ age, family = binomial, data = birthwt) Deviance Residuals: Min 1Q Median -1.0402 -0.9018 -0.7754
3Q 1.4119
Max 1.7800
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.38458 0.73212 0.525 0.599 age -0.05115 0.03151 -1.623 0.105 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 Residual deviance: 231.91 AIC: 235.91
Logistic Regression Model
on 188 on 187
The tted curves are (age) = 0.385 0.051age (age) = e0.3850.051age e0.3850.051age (age) = 1 + e0.3850.051age
1.0 Low Birth Weight 0.6 0.8
0.0
0.2
0.4
^ (age)
15
20
25
30 Maternal Age
35
40
45
So in this case, it appears that age increases, the probability/odds of having a low birth weight baby decreases. What is the eect of changing x on , , and ? Lets see what happens as x goes to x + x in the single predictor case.
(x + x) = 0 + 1(x + x) = 0 + 1x + 1x = (x) + 1x So the log odds work the same way as linear regression. Changing x by one leads to a change in log odds of 1.
(x+x) = e0+1(x+x) = e0+1x+1x = (x)e1x = (x)(e1 )x So for this model, the changing x has a multiplicative eect on the odds. Increasing x by 1 leads to multiplying the odds by e1 . Increasing x by another 1 leads to another multiplication of e1 .
Logistic Regression Model 9
Another way of thinking of this is through the odds ratio (x + x) = e1x = (e1 )x (x) Note that the dierence in odds depends on x (through the odds) as (x + x) (x) = (x)(e1x 1) So the bigger (x), the bigger the absolute dierence. For there is not a nice relationship as (x) has an S shape. e0+1(x+x) = (x) + something ugly (x + x) = 1 + e0+1(x+x) As can be seen in the following gure, the change (x) depends on (x) and not in a nice way. However the biggest changes occur when (x) 0.5 and the size of the change decreases as (x) approaches 0 and 1.
Logistic Regression Model 10
Odds
1.0
Probability
=e
50
=
0.8
e 1 + e
40
0.6 30
Changes by 0.231
= 1 20 0.4
11
These formulas also imply that the sign of 1 indicates whether (x) and (x) increases (1 > 0) or decreases (1 < 0) as x increases. 1 = -0.051, older mothers should be less likely to So in the example, since have low birth weight babies (ignoring the eects of other predictors). More precisely, each additional year of age lowers the odds of a low birth weight birth by a factor of e0.051 = 0.95 per year. Actually, there isnt enough evidence to declare age to be statistically signicant, but the data does suggest the previous statements. In the case of multiple predictors, you need to be a bit more careful. If there are no interaction terms in the model, e.g. nothing like logit( ) = 0 + 1x1 + 2x2 + 12x1x2 then the previous ideas go through if you x all but one xj and only allow one to vary.
Logistic Regression Model 12
For example for the model logit (x1, x2) = 0 + 1x1 + 2x2 The log odds satisfy (x1 + x, x2) = 0 + 1(x + x) + 2x2 = 0 + 1x + 1x + 2x2 = (x1, x2) + 1x imply that the odds satisfy (x1 + x, x2) = e0+1(x+x)+2x2 = e0+1x+1x+2x2 = (x1, x2) e1x = (x1, x2) (e1 )x
Logistic Regression Model 13
However for the interaction model logit( ) = 0 + 1x1 + 2x2 + 12x1x2 the log odds satisfy (x1 + x, x2) = 0 + 1(x + x) + 2x2 + 12(x1 + x)x2 = 0 + 1x + 2x2 + 12x1x2 + x(1 + 12x2) = (x1, x2) + x(1 + 12x2) implying the eect of changing x1 depends on the level of x2 as (x1 + x, x2) = e0+1(x+x)+2x2+12(x1+x)x2 = e0+1x+2x2+12x1x2+x(1+12x2) = (x1, x2) ex(1+12x2)
14
then
1 Var(logit( p)) n (1 ) so we dont have constant variance. If Y Bin(1, ), then logit(Y ) equals or . While there are ways around this (weighted least squares & fudging the Yis), a better approach is maximum likelihood.
Fitting the Logistic Regression Model 15
L() =
i=1
f (Yi|)
The maximum likelihood estimate (MLE) of is = arg sup L() i.e. the value of that maximizes the likelihood function. One way of thinking of the MLE is that its the value of the parameter that is most consistent with the data.
Maximum Likelihood Estimation 16
L( ) =
i=1 n
yi i (1 i)1yi
=
i=1 n
i 1 i
yi
(1 i)
=
i=1 n
yi i (1 i)
=
i=1
e0+1xi
yi
1 1 + e0+1xi
17
One approach to maximizing the likelihood is via calculus by solving the equations L() = 0, 1 L() = 0, 2 ..., L() =0 p
with respect to the parameter . Note that when determining MLEs, it is usually easier to work with the log likelihood function
n
log f (xi|)
It has the same optimum since log is an increasing function and it is easier to work with since derivatives of sums are usually much nicer than derivatives of products.
18
for instead.
19
l( ) =
i=1 n
= =
i=1
20
Normally there are not closed form solutions to these equations as can be seen from the score equations l( ) = 0 l( ) = 1
n i=1 n i=1
e0+1xi yi 1 + e0+1xi
=0 =0
These equations will need to be solved by numerical methods, such as Newton-Raphson or iteratively reweighted least squares. However there are some special cases which we will discuss later where there is a nice function of (xi, yi)) are closed formed solutions (
21
4. For large n, the sampling distribution of an MLE is approximately normal. This implies we can get condence intervals for i easily. is the MLE of , then g ( ) is the MLE of g (), for any nice function 5. If g (). (Transformations of MLEs are MLEs - Invariance property.) An example of where this is useful is the estimation of success probabilities in logistic regression. So (x) = is the MLE of e0+1x
0 + 1 x 1 + e
So the tted curve on the earlier plot is the MLE of the probabilities of a low birth weight for ages 14 to 45.
23
So maximizing this with respect to is the same as minimizing the least square criteria.
Least Squares Versus Maximum Likelihood in Normal Based Regression 24
The one place where there is a slight dierence is in estimating 2. The MLE is SSE np1 2 2 = = np np where 2 is the usual unbiased method of moments estimator.
25
Inference on Individual s
j is approximately normally distributed with mean j As mentioned before and a variance we can estimate. So we can base our inference on the result j j z= j ) SE (
approx.
N (0, 1)
While R doesnt give you these CIs directly, it does give you the information needed to calculate them.
Inference on Individual s 26
> birthwt.glm <- glm(low ~ age, data=birthwt, family=binomial) > summary(birthwt.glm) Deviance Residuals: Min 1Q Median -1.0402 -0.9018 -0.7754
3Q 1.4119
Max 1.7800
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.38458 0.73212 0.525 0.599 age -0.05115 0.03151 -1.623 0.105 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 Residual deviance: 231.91 AIC: 235.91 on 188 on 187 degrees of freedom degrees of freedom
Inference on Individual s
27
s and standard errors into vectors, and create In addition you can get the condence intervals by the following code > betahat <- coef(birthwt.glm) > betahat (Intercept) age 0.38458192 -0.05115294 > se.betahat <- sqrt(diag(vcov(birthwt.glm))) > se.betahat (Intercept) age 0.73212479 0.03151376 > me.betahat <- qnorm(0.975) * se.betahat > ci.betahat <- cbind(Lower=betahat - me.betahat, Upper=betahat + me.betahat) > ci.betahat # 95% approximate CIs Lower Upper (Intercept) -1.0503563 1.81952014 age -0.1129188 0.01061290
Inference on Individual s 28
In addition, testing the hypothesis H0 : j = 0 is usually done by Walds test j z= j ) SE ( which is compared to the N (0, 1) distribution. This is given in the standard R output with the summary command vs HA : j = 0
Inference on Individual s
29
> birthwt.glm <- glm(low ~ age, data=birthwt, family=binomial) > summary(birthwt.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.38458 0.73212 0.525 0.599 age -0.05115 0.03151 -1.623 0.105 So in this case age does not appear to statistically signicant. However remember there are a number of potential confounders ignored in this analysis so you may not want to read too much into this. As in regular regression, inference on the intercept usually is not interesting. In this case 0 gives information about mothers with age = 0, a situation that cant happen.
Inference on Individual s
30