You are on page 1of 19

URBN LOFTS

URBN LOFTS

Logistic REGRESSION
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
Early uses of logistic regression were in
biomedical studies, for instance, to model whether
subjects have a particular condition such as lung
cancer. The past 25 years have seen much use in
social science research, for modeling opinions and

behavior decisions, and in business applications. In


credit-scoring, logistic regression is used to model
the probability that a subject is credit worthy.
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
For instance, the probability that a subject
pays a bill on time may use predictors such as the
size of the bill, annual income, occupation,
mortgage and debt obligations, percentage of bills
paid on time in the past, and other aspects of an

applicant's credit history.

URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
Another area of increasing application is
genetics, such as to estimate quantitative trait loci
effects by modeling the probability that an
offspring inherits an allele of one type instead of
another type as a function of phenotypic values on

various traits for that offspring.

URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
For binary response variable Y and an explanatory X, let
= = 1 = = 1 = 0 = . The logistic

regression model is

exp +
=
1 + exp +
Equivalently, the logit (log odds) has the linear relationship


=
= +
1
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS

The sign determines whether (x) is increasing or


decreasing as x increases.
The rate of climb or descent increases as ||
increases; as 0 the curve flattens to a horizontal
straight line.
When = 0, Y is independent of X.

For quantitative x with > 0, the curve for (x) has


the shape of the CDF of the logistic distribution. Since
the logistic density is symmetric, (x) approaches 1
at the same rate that it approaches 0.
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS

Exponentiating both sides of the equation shows that the odds


are an exponential function of x.

This provides a basic interpretation for the magnitude of : The


odds multiply by for every 1 -unit increase in x.

In other words, is an odds ratio, the odds at X = x + 1 divided


by the odds at X = x.

The intercept parameter is not usually of particular interest.

By centering the predictor about 0 [i.e., replacing x by (x )],


becomes the logit at x = , and thus /(1 + ) = (). As in
ordinary regression, centering is also helpful in complex models
containing quadratic or interaction terms to reduce correlations
among model parameter estimates.
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
To illustrate logistic regression, we re-analyze the
horseshoe crab data introduced last time.

The binary response is whether a female crab has any


male crabs residing nearby (satellites). For crab i, let
yi = 1 if she has at least one satellite and yi = 0 if she
has none.
Here, we use as a predictor the female crab's carapace
width.

In the succeeding graph, it appears that yi = 1 tends to


occur relatively more often at higher x values, in fact,
all crabs with width >29 cm have satellites.
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS

URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
In each of the eight width categories, we computed the
sample proportion of crabs having satellites and the
mean width for the crabs in that category.
A curve based on smoothing the data using the
generalized additive modeling method, assuming a
binomial response and logit link is also in the graph
This curve shows a roughly increasing trend and is more
informative than viewing the binary data alone.

It suggests that an S-shaped regression function may


describe this relationship relatively well.

URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS

URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
The ML fit is

Substituting x = 26.3 cm, the mean width level in this


sample, (x) = 0.674.
The estimated probability equals 0.5 when

12.351
x= =
= 24.8
0.497

The estimated odds of a satellite multiply by exp() =


exp(0.497) = 1.64 for each 1-cm increase in width; that
is, there is a 64% increase.
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
The statistic z = /s.e. = 0.497/0.102 = 4.89 provides
strong evidence of a positive width effect (P < 0.0001).
The equivalent Wald chi-squared statistic, z2 = 23.89,
has df = 1.

The maximized log likelihoods equal 112.88 under H0:


= 0 and 97.23 for the full model.
The likelihood-ratio statistic equals
2[112.88 (-97.23)] = 31.31, with df = 1.
This provides even stronger evidence than the Wald
test.
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
The Wald 95% confidence interval for is 0.497
1.96(0.102), or (0.298, 0.697).
The table reports a likelihood-ratio confidence interval
of (0.308, 0.709), based on the profile likelihood
function.
The confidence interval for the effect on the odds per
1-cm increase in width equals (e0.308, e0.709) =
(1.36,2.03).
We infer that a 1-cm increase in width has at least a
36% increase and at most a doubling in the odds of a
satellite.
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
In practice, there is no guarantee that a certain logistic
regression model fits the data well.
For any type of binary data, one way to detect lack of fit
uses a likelihood-ratio test to compare the model to more

complex ones.
A more complex model might contain a nonlinear effect.
Models with multiple predictors would consider interaction.
If more complex models do not fit better, this provides
URBN LOFTS chosen is reasonable.
some assurance that the model
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
For models with a continuous explanatory variable, X2 and
G2 do not approximate chi-squared distributions due to very
few counts for each value of x.
One solution for this is to bin the continuous variable

into several categories and then perform a goodness-of-fit


test.
If the test resulted to a good fit, then it can be inferred
that the model with the continuous explanatory variable
has also a good fit.

URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS

URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
In each width category, the fitted value for a "yes" response
is the sum of the estimated probabilities (x) for all crabs
having width in that category
The fitted value for a "no" response is the sum of 1 (x)

for those crabs.


In this table we have an 8 by 2 table, to compute for the
chi-squared statistic, we square the difference of the count
for observed and the fitted values.
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

URBN LOFTS
Their values are X2 = 5.3 and G2 = 6.2.
The constructed table has eight binomial samples, one for
each width setting.
The model has two parameters, so df = 8 2 = 6.
Neither X2 nor G2 shows evidence of lack of fit (P-value >

0.4).
Thus, we can feel more comfortable about using the model
for the original ungrouped data.
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080

You might also like