Professional Documents
Culture Documents
Robert Chapman
June 2016
PH 507
N Mean
APTITUDE Aptitude 9 7.11
ACHIEVMT Achievement 9 8.00
Valid N (listwise) 9
N Sum Mean SD
ymybsqar
(Achievment-mean)**2, 9 86 9.56 8.805
part of SSyy
5 Dr. Robert Chapman
Scatter diagram of aptitude (X) and achievement (Y)
12
10
8
Achievement
4 5 6 7 8 9 10
Aptitude
Correlations r2 = 0.677
Achiev
Aptitude ement
Aptitude Pearson Correlation 1.000 .823**
Sig. (2-tailed) . .006
N 9 9
Achievement Pearson Correlation .823** 1.000
Sig. (2-tailed) .006 .
N 9 9
**. Correlation is significant at the 0.01 level
(2-tailed).
12 Ycap =
a + bX
10
For each
Achievement
Mean of Y = Ybar = 8
8 data point,
Ycap-Ybar, variation explained, Y-Ybar =
the piece for relationship (Y-Ycap) +
6 Y-Ybar, related (Ycap-Ybar)
to total variation Y-Ycap, error or residual, the
in Y piece for lack of relationship
4 Ycap is
also
2 written
4 5 6 7 8 9 10 Y
8
Aptitude
Dr. Robert Chapman
(Y - Ycap) is the error, the distance from Y to the line, also known as the
deviation or residual.
Ycap = a + bX, so (Y - Ycap) = (Y - a - bX).
We fit the regression line so that the sum of the squared deviations from
the line is a minimum, that is
(Y - a - bX)2 = minimum
In calculus, when something is a minimum, its derivative is zero. We can
solve for a and b by differentiating (Y - a - bX)2 with respect to a, then
with respect to b. Then we set the derivatives to zero. The solutions are
a = Ybar - bXbar, same as Ybar = a + bXbar. In other words,
the regression line passes through the point (Xbar,Ybar).
b = [(Y - Ybar)(X - Xbar)] / (X - Xbar)2 (Also see slide 15.)
This is called the least-squares solution for a and b. The fitted line is
called the least-squares line, and linear regression is often called least-
squares regression.
Note: The equations above are for the situation when there is only one
independent variable. The equations are more complicated for more than
one independent variable, but the basic concepts are the same.
9 Dr. Robert Chapman
Here is the least-squares regression line for the relationship between
aptitude and achievement. Note that the line passes through (Xbar,Ybar).
Note that R-square (R Sq) = the square of the Pearson correlation
coefficient. This shows that there is a close relationship between
correlation and linear regression. The equation for this line is
achievement = -2.092 + 1.419(aptitude).
12
Ybar=8.00
10
8
Achievement
4
R Sq Linear = 0.677
Xbar=7.11
2
4 5 6 7 8 9 10
Aptitude
Std. Error
Adjusted of the
Model R R Square R Square Estimate
1 .823a .677 .630 1.99
a. Predictors: (Constant), Aptitude
Thus the total variation = the piece for relationship + the piece for error.
This is the basis of the ANOVA table in linear regression. In other words,
the total sum of squares = sum of squares for regression + sum of squares
for residual (SStotal = SSregression + SSresidual. R2 for the model = the
piece for regression / total variation = SSregression / SS total = proportion
of total variation that the model explains. See the ANOVA table in the next
slide.
N Sum Mean SD
xmxbymyb
(xminxbar)(yminybar), part 9 41.00 4.56 5.878
of SSxy
xmxbsqar (xminxbar)**2,
part of SSxx 9 28.89 3.21 3.639
Standar Standar
dized dized
Unstandardized Coeffici Unstandardized 95% Confidence
Coeffici
Coefficients ents Coefficients Interval for B
ents
Lower Upper
B Std. Error Beta
Model Bt Std.Sig.
Error Bound
Beta Bound
t Sig.
-2.092 1 2.720 (Constant) -2.092
-.769 .467 -8.523
2.720 4.338
-.769 .46
1.419 Aptitude
.371 .823 1.419
3.827 .006
.371 .542
.823 2.296
3.827 .00
a. Dependent Variable: Achievement
able: Achievement
ANOVAb
Sum of Mean
Model Squares df Square F Sig.
1 Regression 1176.21 1 1176.2 13.943 .000a
Residual 16702.7 198 84.357
Total 17878.9 199
a. Predictors: (Constant), female DichotomousVariableForGender
b. Dependent Variable: write WritingTestScore
Standar
dized
Unstandardized Coeffici 95% Confidence
Coefficients ents Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 50.121 .963 52.057 .000 48.222 52.020
female
Dichotomous 4.870 1.304 .256 3.734 .000 2.298 7.442
VariableForGender
a. Dependent Variable: write WritingTestScore
18 Dr. Robert Chapman
Weight in kilograms
Coe fficie ntsa
Standar
dized
Unstandardized Coeffici 95% Confidence
Coefficients ents Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 76.131 8.308 9.163 .000 59.747 92.515
weight
-.558 .198 -.197 -2.820 .005 -.948 -.168
WeightInKilograms
a. Dependent Variable: write WritingTestScore
3-level SES scale (SES): Note: SES scale is ordinal categorical, but we
are considering it to be continuous in this example.
Coe fficie ntsa
Standar
dized
Unstandardized Coeffici 95% Confidence
Coefficients ents Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 47.195 1.982 23.814 .000 43.287 51.103
ses
Ordinal3Level 2.715 .910 .207 2.985 .003 .921 4.510
SESScale
a. Dependent Variable: write WritingTestScore
19 Dr. Robert Chapman
In bivariate analysis, all 3 independent variables are statistically
significantly associated with writing score. Now we will enter all 3 into
one single multivariable (multiple) linear regression model.
Model Summary
Std. Error
Adjusted of the
Model R R Square R Square Estimate
1 .352 .124 .111 8.939
ANOVAb
Sum of Mean
Model Squares df Square F Sig.
1 Regression 2218.70 3 739.568 9.256 .000a
Residual 15660.2 196 79.899
Total 17878.9 199
a. Predictors: (Constant), ses Ordinal3LevelSESScale, weight
WeightInKilograms, female DichotomousVariableForGender
b. Dependent Variable: write WritingTestScore
20 Dr. Robert Chapman
Weight is no longer statistically significant, probably because it is
associated with gender (girls are lighter than boys).
26
function Dr. Robert Chapman
Visual comparison of linear regression (the red line) and logistic
regression (the blue curve). In the linear model, probability can go
below zero and above one. In the logistic model, probability is forced
to stay between zero and one. (LP = linear probability)
Modeled
probability
(Note: there are types of logistic regression other than binary logistic.
In these the dependent variable is categorical but not dichotomous. We
will discuss only binary logistic regression in this class.)
28 Dr. Robert Chapman
Linear and logistic regression: some similarities and differences
Some similarities
Both enable us to evaluate more than one independent variable
simultaneously (at the same time).
Independent variables can be continuous or categorical in both linear
and logistic regression.
Both enable us to assess the relative importance of multiple independent
variables
Both can give modeled magnitudes of effect, and confidence intervals
and p-values, for independent variables.
Both use statistical models (regression models or prediction models) of
the form, Y = a + b1x1 + b2x2 + + bnxn.
Both derive specific constant values of the model coefficients (a and
b1 to bn).
Both allow calculation of the predicted (modeled) level of the dependent
variable at specific levels of the independent variables.
29 Dr. Robert Chapman
Some differences between linear regression and binary logistic regression
Y-intercept (constant) Value of Y when all Log odds when all Xs equal zero
Xs equal zero
Model coefficients (Bs) Change in Y per unit Log of the odds ratio per unit
increase in X increase in X*
* We calculate the exponential (antilog) of the coefficient to obtain the odds ratio per
unit increase in the independent variable. For example, if the coefficient is 0.66, the
modeled odds ratio per unit increase is
30
e0.66 = 1.93. Dr. Robert Chapman
Some differences between linear and binary logistic regression (continued)
Characteristic Linear Regression Logistic Regression
Statistic used in calculating t-statistic with 2-tailed 1.96 (z-statistic from the
the 95% confidence probability = 0.05 and normal distribution, for
interval degrees of freedom = the which 2-tailed probability =
residual d.f.* 0.05)
With logistic regression we are not modeling the writing score directly,
as we did with linear regression.
Instead, we are modeling the log of the odds of having a high writing
score. We then use this model information to calculate probability of
having a high writing score, using the method given previously (log odds
to odds to probability).
The modeled odds ratio for having a higher writing score, per kilogram increase in
weight, is 0.897. The modeled OR for having a higher writing score, per unit
increase in SES, is 1.979.
In the bivariate analysis in logistic regression, as in linear regression, all 3
independent variables are significantly associated with the dependent variable
34 Dr. Robert Chapman
Here is the output for the logistic regression model in which all 3
independent variables are included.
95.0% C.I.for
EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
Step
a
female 1.246 .327 14.514 1 .000 3.477 1.831 6.601
1 ses .864 .232 13.821 1 .000 2.373 1.505 3.744
Constant -1.841 .547 11.314 1 .001 .159
a. Variable(s) entered on step 1: female, ses.
Again, the SEs are both < 1.
This is good.
NOTE: If your calculator does not accept negative logs, recall that
enegative log = (1 / epositive log). For example, e-3.143 = (1 / e3.143). Your
calculator may be able to calculate (1 / epositive log).