Session 10 (Logistic Regression)

URBN LOFTS
URBN LOFTS
Logistic REGRESSION
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
Early uses of logistic regression were in
biomedical studies, for instance, to model whether
subjects have a particular condition such as lung
cancer. The past 25 years have seen much use in
social science research, for modeling opinions and
behavior decisions, and in business applications. In

credit-scoring, logistic regression is used to model
the probability that a subject is credit worthy.
URBN LOFTS
URBN LOFTS
For instance, the probability that a subject
pays a bill on time may use predictors such as the
size of the bill, annual income, occupation,
mortgage and debt obligations, percentage of bills
paid on time in the past, and other aspects of an
applicant's credit history.
URBN LOFTS
URBN LOFTS
Another area of increasing application is
genetics, such as to estimate quantitative trait loci
effects by modeling the probability that an
offspring inherits an allele of one type instead of
another type as a function of phenotypic values on
various traits for that offspring.
URBN LOFTS
URBN LOFTS
For binary response variable Y and an explanatory X, let
= = 1 = = 1 = 0 = . The logistic
regression model is
exp +
=
1 + exp +
Equivalently, the logit (log odds) has the linear relationship

=
= +
1
URBN LOFTS
URBN LOFTS
The sign determines whether (x) is increasing or

decreasing as x increases.
The rate of climb or descent increases as ||
increases; as 0 the curve flattens to a horizontal
straight line.
When = 0, Y is independent of X.
For quantitative x with > 0, the curve for (x) has

the shape of the CDF of the logistic distribution. Since
the logistic density is symmetric, (x) approaches 1
at the same rate that it approaches 0.
URBN LOFTS
URBN LOFTS
Exponentiating both sides of the equation shows that the odds

are an exponential function of x.
This provides a basic interpretation for the magnitude of : The

odds multiply by for every 1 -unit increase in x.
In other words, is an odds ratio, the odds at X = x + 1 divided

by the odds at X = x.
The intercept parameter is not usually of particular interest.
By centering the predictor about 0 [i.e., replacing x by (x )],

becomes the logit at x = , and thus /(1 + ) = (). As in
ordinary regression, centering is also helpful in complex models
containing quadratic or interaction terms to reduce correlations
among model parameter estimates.
URBN LOFTS
URBN LOFTS
To illustrate logistic regression, we re-analyze the
horseshoe crab data introduced last time.
The binary response is whether a female crab has any

male crabs residing nearby (satellites). For crab i, let
yi = 1 if she has at least one satellite and yi = 0 if she
has none.
Here, we use as a predictor the female crab's carapace
width.
In the succeeding graph, it appears that yi = 1 tends to

occur relatively more often at higher x values, in fact,
all crabs with width >29 cm have satellites.
URBN LOFTS
URBN LOFTS
URBN LOFTS
URBN LOFTS
In each of the eight width categories, we computed the
sample proportion of crabs having satellites and the
mean width for the crabs in that category.
A curve based on smoothing the data using the
generalized additive modeling method, assuming a
binomial response and logit link is also in the graph
This curve shows a roughly increasing trend and is more
informative than viewing the binary data alone.
It suggests that an S-shaped regression function may

describe this relationship relatively well.
URBN LOFTS
URBN LOFTS
URBN LOFTS
URBN LOFTS
The ML fit is
Substituting x = 26.3 cm, the mean width level in this

sample, (x) = 0.674.
The estimated probability equals 0.5 when
12.351
x= =
= 24.8
0.497
The estimated odds of a satellite multiply by exp() =

exp(0.497) = 1.64 for each 1-cm increase in width; that
is, there is a 64% increase.
URBN LOFTS
URBN LOFTS
The statistic z = /s.e. = 0.497/0.102 = 4.89 provides
strong evidence of a positive width effect (P < 0.0001).
The equivalent Wald chi-squared statistic, z2 = 23.89,
has df = 1.
The maximized log likelihoods equal 112.88 under H0:

= 0 and 97.23 for the full model.
The likelihood-ratio statistic equals
2[112.88 (-97.23)] = 31.31, with df = 1.
This provides even stronger evidence than the Wald
test.
URBN LOFTS
URBN LOFTS
The Wald 95% confidence interval for is 0.497
1.96(0.102), or (0.298, 0.697).
The table reports a likelihood-ratio confidence interval
of (0.308, 0.709), based on the profile likelihood
function.
The confidence interval for the effect on the odds per
1-cm increase in width equals (e0.308, e0.709) =
(1.36,2.03).
We infer that a 1-cm increase in width has at least a
36% increase and at most a doubling in the odds of a
satellite.
URBN LOFTS
URBN LOFTS
In practice, there is no guarantee that a certain logistic
regression model fits the data well.
For any type of binary data, one way to detect lack of fit
uses a likelihood-ratio test to compare the model to more
complex ones.
A more complex model might contain a nonlinear effect.
Models with multiple predictors would consider interaction.
If more complex models do not fit better, this provides
URBN LOFTS chosen is reasonable.
some assurance that the model
URBN LOFTS
For models with a continuous explanatory variable, X2 and
G2 do not approximate chi-squared distributions due to very
few counts for each value of x.
One solution for this is to bin the continuous variable
into several categories and then perform a goodness-of-fit

test.
If the test resulted to a good fit, then it can be inferred
that the model with the continuous explanatory variable
has also a good fit.
URBN LOFTS
URBN LOFTS
URBN LOFTS
URBN LOFTS
In each width category, the fitted value for a "yes" response
is the sum of the estimated probabilities (x) for all crabs
having width in that category
The fitted value for a "no" response is the sum of 1 (x)
for those crabs.

In this table we have an 8 by 2 table, to compute for the
chi-squared statistic, we square the difference of the count
for observed and the fitted values.
URBN LOFTS
URBN LOFTS
Their values are X2 = 5.3 and G2 = 6.2.
The constructed table has eight binomial samples, one for
each width setting.
The model has two parameters, so df = 8 2 = 6.
Neither X2 nor G2 shows evidence of lack of fit (P-value >
0.4).
Thus, we can feel more comfortable about using the model
for the original ungrouped data.
URBN LOFTS

Session 10 (Logistic Regression)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 10 (Logistic Regression)

Uploaded by

Copyright:

Available Formats

URBN LOFTS

behavior decisions, and in business applications. In

applicant's credit history.

various traits for that offspring.

The sign determines whether (x) is increasing or

For quantitative x with > 0, the curve for (x) has

Exponentiating both sides of the equation shows that the odds

This provides a basic interpretation for the magnitude of : The

In other words, is an odds ratio, the odds at X = x + 1 divided

The intercept parameter is not usually of particular interest.

By centering the predictor about 0 [i.e., replacing x by (x )],

The binary response is whether a female crab has any

In the succeeding graph, it appears that yi = 1 tends to

It suggests that an S-shaped regression function may

Substituting x = 26.3 cm, the mean width level in this

The estimated odds of a satellite multiply by exp() =

The maximized log likelihoods equal 112.88 under H0:

into several categories and then perform a goodness-of-fit

for those crabs.

You might also like