You are on page 1of 8

Assignment 1 EMET2007

Jaron Lee
April 23, 2014
1
1.a
X N(2, 3
2
).
P(X > 3) = 0.9522
P(X < 7) = 0.9522
P(|X 2| > 2) = P(X 2 > 2) +P(X + 2 > 2)
= P(X > 4) +P(X < 0)
= 0.2525 + 0.2525
= 0.505
1.b
We want to nd c such that P(|T| > c) = 0.05 i.e. P(T > c) = P(T < c) = 0.025. We calculate c rst:
[1] 2.446912
Now we nd c such that P(T > c) = 0.05.
[1] 1.94318
1.c
We want to nd c such that P( > c) = 0.05 for df = 2. Then c is given by:
[1] 5.991465
We want to nd P( > d) = 0.01. Then d is given by:
[1] 9.21034
1.d
We want to nd P(F
3,50
> c) = 0.05. Then c is given by:
[1] 2.790008
2
2.a
For ATNDRTE, the min, max and mean values are respectively:
[1] 6.25
[1] 100
[1] 81.70956
1
For PRIGPA, the min, max and mean values are respectively:
[1] 0.857
[1] 3.93
[1] 2.586775
For ACT the min, max and mean values are respectively:
[1] 13
[1] 32
[1] 22.51029
2.b
Using a linear model we obtain the following summary:
Call:
lm(formula = ATNDRTE ~ PRIGPA + ACT, data = attend)
Residuals:
Min 1Q Median 3Q Max
-65.373 -6.765 2.125 9.635 29.615
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 75.700 3.884 19.49 <2e-16 ***
PRIGPA 17.261 1.083 15.94 <2e-16 ***
ACT -1.717 0.169 -10.16 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 14.38 on 677 degrees of freedom
Multiple R-squared: 0.2906, Adjusted R-squared: 0.2885
F-statistic: 138.7 on 2 and 677 DF, p-value: < 2.2e-16
The intercept represents the percentage of classes attended given that the cumulative GPA prior to term and
the ACT score are both zero. This intercept is unlikely to be signicant because the predictor variable values
(i.e. 0) required to achieve it fall outside the range of values for the predictors as seen in 2a).
2.c
The slope of PRIGPA represents the change in percentage points of classes attended given a one percentage
point increase in the prior cumulative GPA, all other factors constant. The given value of 17.261 indicates that
higher prior cumulative GPA correlates with a higher percentage of classes attended. This is unsurprising -
intuitvely, well-performing students attend class.
The slope of ACT represents the change in percentage points of classes attended given a one percentage
point increase in the ACT score, all other factors constant. The value given is 1.717. This is surprising, as it
indicates that students who perform better on the standardised test then do not attend class more. We would
expect a relationship similar to PRIGPA above - however, the number is fairly small and not much weight
should be put into interpreting it.
2.d
The prediction function returns a result of 104.3705 i.e. 104.37% of class attended. This does not make sense,
since you cannot attend more than all your classes. In this case, the predicted value for ATNDRTE falls outside
the range of ATNDRTE in the dataset. By inspection of the data we nd there exists a student with the
explantory variables: student 569 obtained ACT = 20, PRIGPA = 3.65, but the percentage of classes attended
was 87.5%.
2
2.e
The prediction function returns the dierence in class attendance rates (A - B) as 25.8434. Thus the dierence
in class attendance rates is 25.84 percentage points.
2.f
1.0 1.5 2.0 2.5 3.0 3.5 4.0

6
0

4
0

2
0
0
2
0
PRIGPA
r
e
s
i
d
(
e
q
n
1
)
Figure 1: Plot of the residuals against PRIGPA
The relationship between the residuals and PRIGPA is very weak - perhaps slightly negative. However,
the range of the residuals is not symmetric about zero - there are a few points lying very far in the negative
direction. The spread of residuals is largest around PRIGPA of 2 to 3. The implication is that the variance of
the residuals is not constant for all values of PRIGPA - so it is not homoskedastic. This is cause for concern as
it violates the DGP for the multiple linear regression model.
2.g
Call:
lm(formula = TERMGPA ~ ATNDRTE + PRIGPA + ACT, data = attend)
Residuals:
Min 1Q Median 3Q Max
-1.71915 -0.27337 0.01315 0.29626 1.55887
Coefficients:
Estimate Std. Error t value Pr(>|t|)
3
(Intercept) -1.060329 0.168597 -6.289 5.73e-10 ***
ATNDRTE 0.017412 0.001335 13.040 < 2e-16 ***
PRIGPA 0.574736 0.044125 13.025 < 2e-16 ***
ACT 0.033403 0.006303 5.299 1.57e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.4996 on 676 degrees of freedom
Multiple R-squared: 0.5421, Adjusted R-squared: 0.54
F-statistic: 266.7 on 3 and 676 DF, p-value: < 2.2e-16
The estimated coecient on ACT represents the change in GPA for term given a one unit change in the ACT
score, all other factors held constant. It is very small because we would not expect the results of a single college
entrance examination to have much of an impact on the current GPA of a student years later. As we have seen
earlier the ACT is a poor predictor of PRIGPA, so we would expect it to also perform poorly for term GPA.
2.h
If we hold all other factors constant, then the following holds: TERMGPA = (0.0174)ATNDRTE. At the
median, ATNDRTE = 87.5 so ATNDRTE = 0.1(0.0174)(87.5). Hence the change in TERMGPA = 0.1524
units at the median of the variables.
2.i
ATNDRTE =
0
+
1
PRIGPA +
2
ACT +
TERMGPA =
0
+
1
ATNDRTE +
2
PRIGPA +
3
ACT +
Thus after substituting appropriately the estimated model for TERMGPA will look like:
TERMGPA =
0
+
1

0
+ (
1

1
+

2
)PRIGPA + (
1

2
+
3
)ACT
Let:

0
=
0
+
1

0
New intercept

1
=
1

1
+

2
Coecient on PRIGPA

2
=
1

2
+
3
Coecient on ACT
Then using the coecients from above we get:

0
= 1.0603 + 0.0174 75.7004
= 0.2578

1
= 0.0174 17.2606 + 0.5747
= 0.8753

2
= 0.0174 1.7166 + 0.0334
= 0.0035
2.j
We estimate the model as follows:
Call:
lm(formula = TERMGPA ~ PRIGPA + ACT, data = attend)
Residuals:
Min 1Q Median 3Q Max
-2.31498 -0.29553 0.04514 0.33470 1.69377
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.257753 0.150849 1.709 0.088 .
PRIGPA 0.875274 0.042065 20.808 <2e-16 ***
ACT 0.003514 0.006564 0.535 0.593
---
4
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.5585 on 677 degrees of freedom
Multiple R-squared: 0.4269, Adjusted R-squared: 0.4252
F-statistic: 252.1 on 2 and 677 DF, p-value: < 2.2e-16
The estimates agree very closely to the calculations in part i).
2.k
Part j) suggests that the eects of omitting ATNDRTE from the TERMGPA model are fairly minimal. These
results appear to suggest that PRIGPA and ACT together explain suciently well ATNDRTE, to the point
where these variables in a suitable equation may substitute in for actual values of ATNDRTE.
However, the adjusted R
2
does fall from 0.54 to 0.43, meaning that the model in part j) explained the
variation in the model comparatively less well.
2.l
Call:
lm(formula = ATNDRTE ~ PRIGPA + ACT + SOPH + FROSH, data = attend)
Residuals:
Min 1Q Median 3Q Max
-64.382 -6.720 2.012 9.463 30.379
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.8777 4.1726 16.987 <2e-16 ***
PRIGPA 18.2016 1.1216 16.229 <2e-16 ***
ACT -1.6920 0.1681 -10.066 <2e-16 ***
SOPH 1.1008 1.4485 0.760 0.4475
FROSH 5.1710 1.7302 2.989 0.0029 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 14.29 on 675 degrees of freedom
Multiple R-squared: 0.3017, Adjusted R-squared: 0.2976
F-statistic: 72.92 on 4 and 675 DF, p-value: < 2.2e-16
SOPH returns a p-value of 0.4475. This means we fail to reject the null hypothesis that the coecient for SOPH
is zero (two-tail test). This variable is not statistically signicant, implying that being a sophomore or not does
not have a signicant eect on the intercept.
FROSH returns a p-value of 0.0029. This is statistically signicant as we can reject the null hypothesis (that
the coecient for the FROSH variable is zero) at the = 0.01 level. Thus the intercept for whether a student
is a freshman or not is signicant.
Possible multicollinearity may occur here, since being a freshman or a sophomore are not independent events.
Running a correlation test, we nd that R
2
= 0.6419. This is a fairly strong negative correlation. Thus a
joint hypothesis test may be necessary. Let the null be H
0
:
FROSH
=
SOPH
= 0. We run the test at the
= 0.05 level. Then: F
test
= 5.3881 > F
critical
= 3.0091. Hence we have enough evidence to reject the null -
the inclusion of both of these variables are signicant in the model.
3
3.a
s =

SSR
n 2
=

110
100 2
= 1.059
5
3.b
sleephours = 7.8 0.1 8
= 7 hours of sleep per day
3.c
We rst note that the standard error of the slope estimate is given by s.e.(b
1
) =

SSR
(n2)

(x
i
x)
2
. Thus we
need the summed squares of the dierence in x
i
, x to calculate the standard error of the slope estimate.
3.d
Assume a 7 day working week.
The new model would look like: sleephours = 54.6 0.1workhours.
Note only the intercept changes, but not the slope since the change in units applies to both axes.
R
2
should not change, since the model has not fundamentally changed and explains the same proportion of
variation in the model. Consider that R
2
= 1
SSR
SST
. The change in units will scale both SSR and SST by the
same amount and so R
2
will not change.
Consider that SSR =

e
2
. Since all the errors are now increased by a factor of 7, the new SSR =

(7e)
2
= 49SSR = 5390.
The new standard error of the slope is given by:
s.e.(b
1
) =

SSR
(n 2)

(x
i
x)
2
s.e.(b
1
)

49SSR
(n 2)

(x
i
x)
2
= 7 s.e.(b
1
)
The standard error of the slope estimate would increase by a factor of 7.
4
4.a
According to the Gauss Markov DGP assumptions, E(
i
) = 0. Then this implies that the sum of all errors is in
fact zero since E(
i
) =

n
i=1
(
i
)
n
where n = 0. Hence true.
4.b
Consider y
i
=

0
+

1
x
i
as the simple linear model. The point ( x, y) is included in the model because by
denition
0
= y

1
x. Hence true.
4.c
Consider that Cov(x
i
, u
i
) = E[(x
i
x)( u
i


u)] = E(x
i
u
i
) x

u.
x
i
(the mean of the xs) is not necessarily zero. However, the sum of the residuals is zero. This implies that

u (the mean of the residuals) is zero, so x

u
i
is zero.
E(x
i
u
i
) =
1
n
n

i=1
x
i
u
i
This term does not have to be zero since there are no restrictions on arbitrary x
i
, u
i
. Hence Cov(x
i
, u
i
) is
not zero.
6
5
5.a
x = 2.5
y = 5.75
b
1
=

n
i=1
(x
i
x)(y
i
y)

n
i=1
(x
i
x)
2
=
(4 2.5)(4 5.75)
(4 2.5)
2
+
(3 2.5)(6 5.75)
(3 2.5)
2
+
(2 2.5)(6 5.75)
(2 2.5)
2
+
(1 2.5)(7 5.75)
(1 2.5)
2
= 0.9
b
0
= y b x
= 5.75 (0.9)2.5
= 8
Hence the estimated model is y = 8 0.9 x.
5.b
e
1
= y
1
y
1
= 4 4.4
= 0.4
e
2
= y
2
y
2
= 6 5.3
= 0.7
e
3
= y
3
y
3
= 6 6.2
= 0.2
e
4
= y
4
y
4
= 7 7.1
= 0.1
4

i=1
e
i
= 0.4 + 0.7 0.2 0.1
= 0
5.c
R
2
= 1
SSR
SST
SSR =
4

i=1
(e
i
)
2
= (0.4)
2
+ (0.7)
2
+ (0.2)
2
+ (0.1)
2
= 0.7
SST =
4

i=1
(y
i
y)
2
= (4 5.75)
2
+ (6 5.75)
2
+ (6 5.75)
2
+ (7 5.75)
2
= 4.75
R
2
= 1
0.7
4.75
= 0.8526
7
5.d
4

i=1
(x
i
x)
2
= (4 2.5)
2
+ (3 2.5)
2
+ (2 2.5)
2
+ (1 2.5)
2
= 5
s.e.(b
1
) =

SSR
(n 2)

(x
i
x)
2
=

0.7
(4 2)(5)
= 0.2646
8

You might also like