Professional Documents
Culture Documents
Chapter 14
Multiple Regressions and Correlation Analysis
1.
a.
b.
c.
2.
a.
b.
c.
d.
e.
3.
a.
b.
4.
a.
b.
c.
5.
a.
sY .12
b.
c.
6.
SSE
n (k 1)
583.693
65 ( 2 1)
9.414 3.068
.118
SStotal
661.6
The independent variables explain 11.8% of the variation.
SSE
583.693
9.414
n ( k 1)
65 ( 2 1)
2
Radj
1
1
1
1 .911 .089
SStotal
661.6
10.3375
n 1
65 1
SSE
n (k 1)
2647.38
52 (5 1)
a.
sY .12345
b.
.584
SStotal 6357.38
The independent variables explain 58.4% of the variation.
14-1
57.55 7.59
c.
7.
a.
b.
c.
d.
e.
f.
8.
a.
b
c.
d.
e.
9.
a.
b.
c.
2
Radj
SSE
2647.38
57.55
n (k 1)
46
1
1
1
1 .462 .538
SStotal
6357.38
124.65
n 1
51
d.
10.
The regression equation is: Commissions = 315 + 2.15 Calls + 0.0941 Miles 0.000556
X1X2
Predictor
Constant
Calls
Miles
X1X2
Coef
-314.6
2.150
0.09410
-0.0005564
SE
162.6
1.159
0.05944
0.0004215
Coef
-1.93
1.86
1.58
-1.32
T
0.067
0.078
0.128
0.201
The t value corresponding to the interaction term is 1.32. This is not significant. So we
conclude there is no interaction.
11.
a.
b.
14-3
c.
In the stepwise procedure, the number of bidders enters the equation first. Then the
interaction term enters. The variable age would not be included as it is not significant.
Response is Price on 3 predictors, with N = 25
Step
1
2
Constant
4507 4540
Bidders
T-Value
P-Value
-57
-256
-3.53 -5.59
0.002 0.000
X1X2
T-Value
P-Value
2.25
4.49
0.000
S
R-Sq
R-Sq(adj)
12.
218
66.14
63.06
The number of family members would enter first. Then their income would come in next.
None of the other variables is significant.
Response is Size on 4 predictors, with N = 10
Step
1
2
Constant 1271.6 712.7
Family
T-Value
P-Value
Income
T-Value
P-Value
541 372
6.12 3.59
0.000 0.009
12.1
2.26
0.059
S
336
R-Sq
82.42
R-Sq(adj) 80.22
13.
295
35.11
32.29
273
89.82
86.92
a.
b.
c.
d.
n = 40
4
R2 = 750/1250 = 0.60
S y 1234 500 / 35 3.7796
e.
Ho: 1 = 2 = 3 = 4 = 0
H1: Not alls equal 0
Ho is rejected if F > 2.65
750 / 4
F
13.125 Ho is rejected. At least one I does not equal zero.
500 / 35
14-4
14.
Ho: 1 = 0
Ho: 2 = 0
H1: 1 0
H1: 2 0
Ho is rejected if t < 2.074 or t > 2.074
2.676
0.880
t
4.779
t
1.239
0.56
0.710
The second variable can be deleted.
15.
a.
b.
c.
d.
e.
n = 26
R2 = 100/140 = 0.7143
1.4142, found by 2
Ho: 1 = 2 = 3 = 4 = 5 = 0 H1: Not all s are 0
Reject Ho if F > 2.71
Computed F = 10.0. Reject Ho. At least one regression coefficient is not zero.
Ho is rejected in each case if t < 2.086 or t > 2.086.
X1 and X5 should be dropped.
16.
Ho: 1 = 2 = 3 = 4 = 5 = 0
H1: Not all s equal zero.
df1 = 5
df2 = 20 (5 + 1) = 14, so Ho is rejected if F > 2.96
Source
SS
df MSE F
Regression
448.28
5
89.656 17.58
Error
71.40 14
5.10
Total
519.68 19
So Ho is rejected. Not all the regression coefficients equal zero.
17.
a.
$28,000
b.
0.5809 found by R 2
c.
d.
e.
18.
a.
b.
c.
d.
e.
SSR
3050
SStotal 5250
9.199 found by 84.62
Ho is rejected if F > 2.97 (approximately) Computed F = 1016.67/84.62 = 12.01 Ho is
rejected. At least one regression coefficient is not zero.
If computed t is to the left of 2.056 or to the right of 2.056, the null hypothesis in
each of these cases is rejected. Computed t for X2 and X3 exceed the critical value.
Thus, population and advertising expenses should be retained and number of
competitors, X1 dropped.
The strongest relationship is between sales and income (0.964). A problem could
occur if both cars and outlets are part of the final solution. Also, outlets and
income are strongly correlated (0.825). This is called multicollinearity.
1593.81
R2
0.9943
1602.89
Ho is rejected. At least one regression coefficient is not zero. The computed value of
F is 140.42.
Delete outlets and bosses. Critical values are 2.776 and 2.776
1593.66
R2
0.9942
1602.89
14-5
f.
g.
19.
a.
20.
a.
b.
c.
d.
e.
f.
14-6
There does not appear to be a pattern to the residuals according to the following
MINITAB plot.
Residuals(YY')
g.
6
4
2
0
-2 10
15
20
25
30
35
40
45
-4
-6
Y'
21.
a.
b.
c.
The correlation of Screen and Price is 0.893. So there does appear to be a linear
relationship between the two.
Price is the dependent variable.
The regression equation is Price = 2484 + 101 Screen. For each inch increase in
screen size, the price increases $101 on average.
d.
Using dummy indicator variables for Sharp and Sony, the regression equation is
Price = 2308 + 94.1 Screen + 15 Manufacturer Sharp + 381 Manufacturer Sony
Sharp can obtain on average $15 more than Samsung and Sony can collect an
additional benefit of $381.
e.
Here is some of the output.
Predictor
Coef SE Coef
T
P
Constant
-2308.2
492.0 -4.69 0.000
Screen
94.12
10.83
8.69 0.000
Manufacturer_Sharp
15.1
171.6
0.09 0.931
Manufacturer_Sony
381.4
168.8
2.26 0.036
The p-value for Sharp is relatively large. A test of their coefficient would not be
rejected. That means they may not have any real advantage over Samsung. On the
other hand, the p-value for the Sony coefficient is quite small. That indicates that it
did not happen by chance and there is some real advantage to Sony over Samsung.
14-7
f.
Frequency
g.
-600
-400
-200
0
RESI1
200
400
600
RESI1
250
-250
-500
1000
22.
a.
be a
b.
c.
d.
e.
both
1500
2000
FITS1
2500
3000
The correlation of Income and Median Age = 0.721. So there does appear to
linear relationship between the two.
Income is the dependent variable.
The regression equation is Income = 22805 + 362 Median Age. For each year
increase in age, the income increases $362 on average.
Using an indicator variable for a coastal county, the regression equation is
Income = 29619 + 137 Median Age + 10640 Coastal.
That changes the estimated effect of an additional year of age to $137 and the
effect of living in the coastal area as adding $10,640 to income.
Notice the p-values are virtually zero on the output following. That indicates
variables are significant influences on income
Predictor
Constant
Median Age
Coastal
Coef
29619.2
136.887
10639.7
SE Coef
0.3
0.008
0.2
14-8
T
90269.67
18058.14
54546.92
P
0.000
0.000
0.000
f.
3.0
-3.23376E-12
0.2005
9
Frequency
2.5
2.0
1.5
1.0
0.5
0.0
-0.2
0.0
RESI2
0.2
0.4
The variation is about the same across the different fitted values.
Residuals vs. Fits
0.4
0.3
0.2
RESI2
-0.4
0.1
0.0
-0.1
-0.2
-0.3
35000
40000
45000
FITS2
14-9
50000
a.
Scatterplot of Sales vs Advertising, Accounts, Competitors, Potential
Advertising
Accounts
300
200
100
0
Sales
23.
6
Competitors
30
40
50
Potential
60
70
300
200
100
0
4
10
12
10
15
20
Sales seem to fall with the number of competitors and rise with the number of accounts and
potential.
b.
c.
d.
Pearson correlations
Sales
Advertising
Accounts
Competitors
Advertising
0.159
Accounts
0.783
0.173
Competitors
-0.833
-0.038
-0.324
Potential
0.407
-0.071
0.468
-0.202
The number of accounts and the market potential are moderately correlated.
The regression equation is
Sales = 178 + 1.81 Advertising + 3.32 Accounts - 21.2 Competitors + 0.325 Potential
Predictor
Coef
SE Coef
T
P
Constant
178.32
12.96
13.76
0.000
Advertising
1.807
1.081
1.67
0.109
Accounts
3.3178
0.1629
20.37
0.000
Competitors -21.1850
0.7879
-26.89 0.000
Potential
0.3245
0.4678
0.69 0.495
S = 9.60441 R-Sq = 98.9% R-Sq(adj) = 98.7%
Analysis of Variance
Source
DF SS
MS
F
P
Regression
4
176777
44194
479.10 0.000
Residual Error 21
1937
92
Total
25
178714
The computed F value is quite large. So we can reject the null hypothesis that all of
the regression coefficients are zero. We conclude that some of the independent
variables are effective in explaining sales.
Market potential and advertising have large p-values (0.495 and 0.109, respectively).
You would probably cut them.
14-10
e.
f.
Histogram of the Residuals
(response is Sales)
9
8
7
Frequency
6
5
4
3
2
1
0
g.
-20
-10
0
Residual
10
20
The histogram looks to be normal. There are no problems shown in this plot.
The variance inflation factor for both variables is 1.1. They are less than 10. There
are no troubles as this value indicates the independent variables are not strongly
correlated with each other.
14-11
24.
a.
b.
Ho: 1 = 2 = 3 = 0
Hi: Not all Is = 0
11.3437 / 3
F
35.38
2.2446 / 21
All coefficients are significant. Do not delete any.
The residuals appear to be random. No problem.
Residual
c.
d.
0.8
0.6
0.4
0.2
0
-0.2 37
-0.4
-0.6
-0.8
38
39
40
41
Fitted
25.
e.
a.
b.
14-12
d.
e.
R2
21.182
0.888
23.857
Frequency
c.
0
-1.0
-0.8
-0.6
-0.4
-0.2
Residual
14-13
0.0
0.2
0.4
0.6
f.
The variance is the same as we move from small values to large. So there is no
homoscedasticity problem.
RESI1
0.5
0.0
-0.5
-1.0
32
33
34
35
FITS2
26.
a.
Y = 28.2 + 0.0287X1 + 0.650X2 0.049X3 0.004X4 + 0.723X5
b.
R2 = 0.750
c & d. Predictor
Coef
Stdev
t-ratio
P
Constant
28.242
2.986
9.46
0.000
Value
0.028669 0.004970
5.77
0.000
Years
0.6497
0.2412
2.69
0.014
Age
0.04895
0.3126
1.57
0.134
Mortgage 0.000405 0.001269
0.32
0.753
Sex
0.7227
0.2491
2.90
0.009
s = 0.5911
R-sq = 75.0% R-sq(adj) = 68.4%
Analysis of Variance
Source
DF
SS
MS
F
p
Regression
5 19.8914 3.9783 11.39 0.000
Error
19
6.6390 0.3494
Total
24 26.5304
Consider dropping the variables age and mortgage.
e.
The new regression equation is: Y = 29.81 + 0.0253X1 + 0.41X2 + 0.708X5
The R-square value is 0.7161
14-14
27.
a.
=
651.9
+
13.422X
1 6.710X2 + 205.65X3 33.45X4
Y
2
b.
R = 0.433, which is somewhat low for this type of study.
c.
Ho: 1 = 2 = 3 = 4 = 0
Hi: Not all Is = 0
Reject Ho if F > 2.76
1066830 / 4
F
4.77
Ho is rejected. Not all the is equal zero.
1398651/ 25
d.
Using the 0.05 significance level, reject the hypothesis that the regression coefficient
is 0 if t < 2.060 or t > 2.060. Service and gender should remain in the analyses, age
and job should be dropped.
e.
Following is the computer output using the independent variable service and gender.
Predictor
Coef
Stdev
t-ratio
P
Constant
784.2
316.8
2.48
0.020
Service
9.021
3.106
2.90
0.007
Gender
224.41
87.35
2.57
0.016
Analysis of Variance
Source
DF
SS
MS
F
p
Regression
2
998779 499389
9.19 0.001
Error
27 1466703 54322
Total
29 2465481
A man earns $224 more per month than a woman. The difference between technical
and clerical jobs in not significant.
28.
a.
b.
14-15
d.
e.
R2
18.41
0.826
22.29
Frequency
c.
4
3
2
1
0
-0.8
-0.6
-0.4
-0.2
0.0
Residual
14-16
0.2
0.4
0.6
0.8
f.
The variance is the same as we move from small values to large. So there is no
homoscedasticity problem.
Residuals Versus the Fitted Values
(response is Food)
Residual
-1
4
Fitted Value
a.
b.
c.
d.
e.
7
6
5
Frequency
29.
4
3
2
1
0
-20
-15
-10
-5
10
Residuals
14-17
f.
Residuals Versus the Fitted Values
(response is P/E)
10
Residual
-10
-20
10
20
30
40
Fitted Value
30.
g.
a.
b.
c.
d.
14-18
e.
Histogram of the Residuals
(response is Tip)
7
6
Frequency
5
4
3
2
1
0
-1.2
-0.6
0.0
Residual
0.6
1.2
Residual
0.5
0.0
-0.5
-1.0
4
5
Fitted Value
14-19
31.
a.
The global test on F demonstrates there is a substantial connection between sales and the
number of commercials.
b.
Histogram of the Residuals
(response is Sales)
3.0
Frequency
2.5
2.0
1.5
1.0
0.5
0.0
-0.4
-0.2
0.0
Residual
0.2
0.4
a.
b.
SS
36.677
11.559
48.236
MS
36.677
0.889
F
41.25
P
0.000
14-20
33.
a.
SS
5966725061
798944439
6765669500
MS
1988908354
49934027
F
39.83
P
0.000
The computed F is 39.83. It is much larger than the critical value 3.2388. The pvalue is also quite small. Thus the null hypothesis that all the regression coefficients
are zero can be rejected. At least one of the multiple regression coefficients is
different from zero.
b.
c.
34.
Predictor
Coef
Constant
-118929
Loan
1.6268
Monthly Payment
2.06
Payments Made
50.3
SE Coef
19734
0.1809
14.95
134.9
T
-6.03
8.99
0.14
0.37
P
0.000
0.000
0.892
0.714
The null hypothesis is that the coefficient is zero in the individual test. It would be
rejected if t is less than -2.1199 or more than 2.1199. In this case the t value for the
loan variable is larger than the critical value. Thus, it should not be removed.
However, the monthly payment and payments made variables would likely be
removed.
The revised regression equation is: Auction Price = 119893 + 1.67 Loan.
14-21
35.
14-22
f.
g.
h.
x2xxx
50+xxxx
xxxxxxx
RESI1xxxxx
xxxxxxxxx2xxx
xxx2xxxxxxx
0+xxxxxxxxxx2
xx2xxxxxxx
x2xxxx2x
xxxxxxx
x2xxxxx
50+xxxx
xx2
x
+++++FITS1
175200225250275
14-23
36.
a.
The equation is: Wins = 54.05 + 318.5 Batting + 0.0057 SB 0.200 Errors
11.656 ERA + 0.0702 HR 0.478 Surface
For each additional point that the team batting average increases, the number
of wins goes up by 0.318. Each stolen base adds 0.0057 to average wins. Each error
lowers the average number of wins by 0.200. An increase of one on the ERA
decreases the number of wins by 11.7. Each home run adds 0.0702 wins. Playing on
artificial lowers the number of wins by 0.48.
1640
0.655 Sixty-five percent of the variation in the number of wins is
2505
b.
R2
c.
d.
e.
Batting
SB
Errors
ERA
HR
0.386
-0.224
-0.027
0.042
-0.126
-0.058
0.103
0.062
0.027
-0.067
0.093
0.118
0.052
0.161
-0.096
Batting and ERA have the strong correlation with winning. Stolen bases and surface
have the weakest correlation with winning. The highest correlation among the
independent variables is 0.386. So multicollinearity does not appear to be a problem.
To conduct the global test: Ho: 1 = 2 == 6 = 0, H1: Not all Is = 0
At the 0.05 significance level, Ho is rejected if F > 2.528.
Source
DF
SS
MS
F
P
Regression
6
1639.6
273.27
7.26
0.000
Residual Error 23
865.3
37.62
Total
29
2504.9
Since the p-value is virtually zero. You can reject the null hypothesis that all of the
net regression coefficients are zero and conclude at least one of the variables has a
significant relationship to winning.
Surface would be deleted first because of its large p-value.
Predictor
Constant
Batting
SB
Errors
ERA
HR
Surface
f.
Wins
0.496
0.147
-0.342
-0.542
0.187
-0.184
Coef
54.05
318.5
0.00569
-0.19986
-11.656
0.07024
0.478
SE Coef
34.17
111.0
0.03687
0.08147
2.578
0.03815
3.866
T
1.58
2.87
0.15
-2.45
-4.52
1.84
0.12
P
0.127
0.009
0.879
0.022
0.000
0.079
0.903
The p-values for stolen bases and surface are large. Think about delegating these two
variables.
The reduced regression equation is:
Wins = 57.88 + 3336.6 Avg - 0.182 Errors - 11.264 ERA
14-24
g.
Frequency
6
5
4
3
2
1
0
h.
-10
-5
0
RESI1
10
10
RESI1
-5
-10
65
37.
a.
b.
70
75
80
FITS1
85
90
95
14-25
c.
Ed
Fe
Ex
Age
Union
Wage
0.408
0.356
0.071
0.167
0.052
Ed
Fe
0.082
0.440 0.051
0.253 0.037
0.083 0.128
Ex
0.980
0.236
Age
0.273
Education has the strongest correlation (0.408) with wage. Age and experience are very
highly correlated (0.980)! Age is removed to avoid multicollinearity
d.
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
4
10395077453 2598769363 13.69
0.000
Residual Error 95
18038129030
189875042
Total
99
28433206483
The p-value is so small we reject the null hypothesis of no significant coefficients and
conclude at least one of the variables is a predictor of annual wage.
e.
Predictor
Coef
SE Coef
T
P
Constant
14174
8720
1.63 0.107
Ed
3325.1
566.0
5.87 0.000
Fe
11675
2796
4.18 0.000
Ex
448.0
119.7
3.74 0.000
Union
5355
3813
1.40 0.163
Union membership does not appear to be significant.
f. The reduced regression equation is: Wage = 12228 +3160Ed 11149Fe +396Ex.
g.
Frequency
20
10
0
-40000
-20000
20000
RESI1
The residuals are normally distributed.
14-26
40000
60000
h.
60000
RESI1
40000
20000
0
-20000
-40000
10000
20000
30000
40000
50000
60000
FITS1
38.
a.
b.
c.
d.
e.
f.
The regression equation is: Unemployment = 79.0 +0.439 65 & over 0.423 Life
Expectancy 0.465 Literacy %.
RSq =0.555
Unemploy 65 & over Life Exp
65 & over
0.467
Life Exp
0.637
0.640
Literacy
0.684
0.794 0.681
The correlation between the percent of the population over 65 and the literacy rate is
0.794, which is above the 0.7 rule of thumb level, indicating multicollinearity. One
of those two variables should be dropped.
Analysis of Variance
Source
D
SS
MS
F
P
F
Regression
3 1316.92 438.97 15.38 0.000
Residual Error 37 1056.16 28.54
Total
40 2373.08
The
pvalue is so small we reject the null hypothesis of no significant coefficients and
conclude at least one of the variables is a predictor of unemployment.
Predicto
Coef
SE
T
P
r
Coef
Constant
78.98
11.93
6.62 0.000
65 & ove
0.4390 0.2785
1.58 0.123
Life Exp 0.4228 0.1698 2.49 0.017
Literacy 0.4647 0.1345 3.46 0.001
The variable percent
of the population over 65 appears to be an insignificant because the pvalue is well
above 0.10. Hence it should be dropped from the analysis.
The regression equation is:
Unemployment = 67.0 0.357 Life Expectancy 0.334 Literacy %
14-27
g.
9
8
Frequency
7
6
5
4
3
2
1
0
-10
10
20
RESI1
RESI1
10
-10
10
20
30
FITS2
There may be some homoscedasticity as the variance seems larger for the bigger values.
14-28