You are on page 1of 17

Econ 1629: Applied Research Methods

Assignment 4 Suggested Solutions

Section I – Traffic Fatalities


Variable Description
state State ID (FIPS) Code
allmort Number of Vehicle Fatalities per year
allnite Number of Night-time Vehicle Fatalities per year
pop Population
yngdrv Proportion of Drivers Age 15-24
breath Breath Test Law for Suspected Drunk Drivers [Dummy Variable]
jaild Mandatory Jail Sentence for Drunk Driving [Dummy variable]
comserd Mandatory Community Service for Drunk Driving [Dummy Variable]
beertax Tax on Case of Beer (in dollars)
mlda Minimum Legal Drinking Age

(1) Create summary statistics for the above variables, and write one paragraph summarizing the
main findings. The goal of this paragraph is to provide the background you would like the
reader to have before you present any information about the effect of certain policies in
reducing traffic fatalities.
.*Question 1
.sum state allmort allnite pop yngdrv breath jaild comserd beertax mlda

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
state | 48 30.1875 15.44883 1 56
allmort | 48 974.75 981.0172 104 5390
allnite | 48 179.5625 186.9059 18 936
pop | 48 5074334 5347035 478999.7 2.83e+07
yngdrv | 48 .161999 .0221605 .073137 .220724
-------------+--------------------------------------------------------
breath | 48 .5 .5052912 0 1
jaild | 47 .2978723 .4622673 0 1
comserd | 47 .212766 .4136881 0 1
beertax | 48 .4798154 .434836 .0433109 2.194418
mlda | 48 20.96875 .2165064 19.5 21

1
Here is an example of a summary paragraph describing the data:
The data for this study provide information about traffic fatality rates and selected policies
to address driving under the influence of alcohol in 48 U.S. states for 1988. The study also
provides population figures and the percentage of young drivers for each of the participating
states. In 1988, the average number of vehicle related fatality deaths per state was 974.8,
ranging from a minimum of 104 to a maximum of 5390. Part of this wide range is due to the
variation in the populations of the states, which range from less than 500,000 residents to
over 20 million, as well as variation in the proportion of young drivers (min of 7% and max
of 22%), who may be particularly likely to be involved in traffic fatalities. Policies related to
driving under the influence of alcohol varied across states: 50% required a breath test for
suspected drunk drivers, 30% had a mandatory jail sentence for convicted drunk drivers,
and 21% required community service for convicted drunk driver. All states have a beer tax,
which averaged about 48 cents per case of beer, but varied widely across the states, from
less than a nickel to over $2.

(2) Create a new variable (called vfr) that indicates the vehicle fatality rate (defined as the
number of vehicle fatalities per 100,000 people living in the state). What is the average
vehicle fatality rate in this sample?
. *Question 2
. gen vfr=(allmort/pop)*100000

. sum vfr

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
vfr | 48 20.69594 5.211829 12.3111 32.3591

The average fatality rate, by state, in this sample is 20.7 fatalities per 100,000 people.

(3) Create a new variable (called nightvfr) that indicates the night-time vehicle fatality rate
(defined as the number of night-time vehicle fatalities per 100,000 people living in the state).
What is the average night-time vehicle fatality rate in this sample?
  . *Question 3
. gen nightvfr=(allnite/pop)*100000

. sum nightvfr

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
nightvfr | 48 3.680209 .9816015 1.71598 6.104852

The average night-time fatality rate, by state, in the sample is 3.7 fatalities per 100,000 people.

2
(4) Many people believe that younger drivers are more likely to get into accidents, especially
when road conditions are more challenging, such as after dark (auto insurers underwrite
policies reflecting this assumption). Which states have the lowest and highest proportions of
young drivers, those drivers aged 15-24? Where do these states rank in terms of vfr and
nightvfr?
Indiana has the lowest percentage of young drivers, at 7.3%, and New Mexico has the
highest percentage of young drivers, at 22.1%.
New Mexico has the second-highest rate of overall vehicle fatalities and the highest rate of
night-time vehicle fatalities. Indiana is in the middle for both of these statistics: 25th highest
(or 24th lowest) for overall vehicle fatality rate and 21st highest (or 28th lowest) for night-time
vehicle fatality rate.

(5) Many night-time car accidents involve alcohol. We are interested in estimating the
relationship between the night-time fatality rate (nightvfr) and whether the state has a breath
test law for suspected drunk drivers (breath) using the following PRF:
nightvfr = β 0 + β1breath + u

(a) What sign would you expect β1 to have? Explain.


Most people expect β1 to be negative since they expect that having a mandatory
breath test should discourage drunk driving and should therefore lead to lower
traffic fatalities.
Other people would argue that states that have high traffic fatality rates may be
more likely to establish a breath test law, which would lead to β1 being positive.
Both answers are reasonable.

(b) Estimate the above PRF using the data provided. Remember to use the option for
robust standard errors. Interpret the coefficient on the variable breath. Is the
coefficient on the variable breath statistically significant at the 5% level?
 

3
. *Question 5
. regress nightvfr breath, robust

Linear regression Number of obs = 48


F( 1, 46) = 2.46
Prob > F = 0.1236
R-squared = 0.0508
Root MSE = .9667

------------------------------------------------------------------------------
| Robust
nightvfr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
breath | -.4377213 .2790617 -1.57 0.124 -.9994434 .1240009
_cons | 3.89907 .2289213 17.03 0.000 3.438275 4.359865
------------------------------------------------------------------------------

States with breath test laws have a 0.44 lower night-time vehicle fatality rate (per 100,000 people)
than states that do not have this policy.
The t-statistic on the breath coefficient is -1.57, which in absolute value does not exceed the critical
value of 1.96. Therefore, the coefficient is not statistically significant at the 5% level.

(c) Is the sign of β̂1 the one you expected? Explain.


The answer to this question depends on which of the two explanations was offered in part (a). If a
negative coefficient was predicted in part (a), then the sign is as expected. If a positive coefficient
was predicted, then we can reconcile our results with reasoning along the lines of the first
explanation – that breath test laws discourage drunk driving and thus are associated with fewer
night-time traffic fatalities (per 100,000 people).

(d) Is β1 likely to represent the causal effect of a breath-test law on the night-time
vehicle fatality rate? Explain.
No. We can say that having a breath test law is associated with a lower fatality rate,but we cannot
say that we observe a lower vehicle fatality rate in states with breath test laws because of those
laws. States that establish breath test laws may be extremely vigilant about drunk driving or want
to discourage any tendencies for people to drive while under the influence. They may also be more
generally safety-conscious with a population of more cautious drivers. States without breath test
laws may have more lenient attitudes about drinking and driving and more popular opposition to
these types of laws – though these same “cultural” issues are likely to contribute to a higher vehicle
fatality rate.
In other words, states that have these laws may be different from states that do not have these laws
in ways that also contribute to vehicle fatality rates.

4
(e) From a policymaker’s perspective, why would we care if β1 represented the causal
effect of a breath-test law on the night-time vehicle fatality rate? [1 paragraph]

If this were indeed a causal effect, then establishing a breath test law could be an effective way of
reducing the night-time traffic fatality rate (as long as the associated benefits are larger than the
costs of implementing such policies). If β1 were simply reflecting an association but not a causal
one, then implementing a breath test law is not likely to lead to a reduction in the night-time traffic
fatality rate, on its own. So from a policymaker’s perspective, it would be crucial to know whether
this association is causal or not.

(6) Other policies that could be used to reduce night-time traffic fatalities, particularly those
linked to alcohol use, include: mandatory jail sentences for drunk driving (jaild), mandatory
community service for drunk-driving offenders (comserd), and a beer tax (beertax). This
question asks you to examine how these policies and the proportion of the state’s drivers
who are “young” drivers relate to the night-time vehicle fatality rate.
(a) Estimate the following PRF (remember to specify robust standard errors):
nightvfr = β 0 + β1breath + β 2 jaild + β 3 comserd + β 4 beertax + β 5 yngdrv + u
. *Question 6
. regress nightvfr breath jaild comserd beertax yngdrv, robust

Linear regression Number of obs = 47


F( 5, 41) = 0.90
Prob > F = 0.4912
R-squared = 0.1223
Root MSE = .98306

------------------------------------------------------------------------------
| Robust
nightvfr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
breath | -.2212992 .3078117 -0.72 0.476 -.8429375 .4003391
jaild | .6884527 .4825511 1.43 0.161 -.286079 1.662984
comserd | -.3982458 .5069437 -0.79 0.437 -1.422039 .6255477
beertax | .2700878 .3547532 0.76 0.451 -.4463508 .9865265
yngdrv | 2.170482 9.186893 0.24 0.814 -16.38282 20.72379
_cons | 3.196759 1.247599 2.56 0.014 .6771815 5.716337
------------------------------------------------------------------------------

(b) Interpret the coefficient on jaild


States with a mandatory jail sentence for drunk drivers have a vehicle fatality rate of 0.69 deaths
higher (per 10,000 people) than states without this policy, holding breath, comserd, beertax, and
yngdrv constant.

5
(c) Interpret the coefficient on beertax
An additional dollar in the beer tax per case is associated with an increase in the night-time
vehicle fatality rate of 0.27 additional fatalities per 100,000 people, holding breath, jaild,
comserd, and yngdrv constant.

(d) Interpret the coefficient on yngdrv


An increase of 1 percentage point in the proportion of a state’s drivers who are age 15-24 is
associated with 0.022 additional night-time vehicle fatality deaths per 100,000 residents,
controlling for breath, jaild, comserd, and beertax.

(7) An alternative way to examine the association between yngdrv and nightvfr is to divide states
into 3 groups according to the proportion of young drivers.
(a) Create the following 3 variables – lowyng, medyng, and highyng:
gen lowyng=(yngdrv<0.15) if yngdrv!=.
gen medyng=(yngdrv>=0.15 & yngdrv<0.17) if yngdrv!=.
gen highyng=(yngdrv>=0.17) if yngdrv!=.

Be sure to check your results as suggested in the assignment to make sure that each state is
coded for only one of the young driver classifications!

(b) Estimate the following PRF (remember to specify robust standard errors):
nightvfr = β 0 + β1breath + β 2 jaild + β 3 comserd + β 4 beertax + β 5lowyng + β 6 medyng + u

. regress nightvfr breath jaild comserd beertax lowyng medyng, robust

Linear regression Number of obs = 47


F( 6, 40) = 1.83
Prob > F = 0.1169
R-squared = 0.1613
Root MSE = .97292

------------------------------------------------------------------------------
| Robust
nightvfr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
breath | -.1972487 .3184678 -0.62 0.539 -.8408961 .4463987
jaild | .7008364 .457141 1.53 0.133 -.2230801 1.624753
comserd | -.3977197 .4788624 -0.83 0.411 -1.365537 .5700973
beertax | .336772 .2974011 1.13 0.264 -.2642982 .9378421
lowyng | .0236589 .4775119 0.05 0.961 -.9414287 .9887465
medyng | .4055903 .376392 1.08 0.288 -.3551264 1.166307
_cons | 3.289217 .5665373 5.81 0.000 2.144203 4.434232
------------------------------------------------------------------------------
6
(c) Interpret the coefficient on medyng
Relative to states with high proportions of young drivers (more than 17% of drivers age 15-24),
states with medium proportions of young drivers experience 0.41 more night-time vehicle
fatalities per 100,000 residents.

(d) Why did we not include highyng in the PRF above?


Highyng is our "base" or "omitted" category. So the coefficients on medyng and lowyng are relative
to that highyng group.

[Intuitively, We are not leaving out anything because the states with highyng are just represented by
the combination (medyng=0 and lowyng=0). The states with medyng are represented by (medyng=1
and lowyng=0) and the states with lowyng are represented by (medyng=0 and lowyng=1). So the
combination of these two dummy variables accounts for all three of our categories.

In fact, if we included highyng in the regression, this would be a violation of assumption 4 (no
perfect multicollinearity). To see that it violates assumption 4, notice that
lowyng+medyng+highyng=1. If you tried it in Stata, you will note that Stata does not even let you
run this regression!]

(8) One variable that might also contribute to vehicle fatalities is the minimum legal drinking
age. The dataset does include such a variable (mlda). In practice, however, this variable
is of limited use in estimating a regression equation explaining nightvfr. Why is this the
case?

. *Question 8
. sum mlda

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
mlda | 48 20.96875 .2165064 19.5 21

. tab mlda

minimum |
legal |
drinking |
age | Freq. Percent Cum.
------------+-----------------------------------
19.5 | 1 2.08 2.08
21 | 47 97.92 100.00
------------+-----------------------------------
Total | 48 100.00

7
There is so little variation in the minimum legal drinking age, it is of little value in explaining the
variation in the outcome variable, night-time vehicle fatality rates. Only one state does not have
a minimum legal drinking age of 21.

8
Section II – Errors in Variables
(Your numeric results may differ because of different random draws, but results should
qualitatively be similar)
No Errors
The first worksheet shows the true data from a population (X and Y), and the true results of
regressing a model of the form:
Y = bX + a

(1) Graph the true relationship between X and Y. Make sure your axes are labeled.

(2) Write the true values of a and b (shown on the graph)


a = 2.0297
b = 0.4589

Random Error
Now let’s see what happens to results when there is error in the variables. We’ll start by
exploring the case where the error is totally random (each observation is equally likely to have an
the same error). For example, if we were measuring people’s heights, if we were equally likely to
over- or under-estimate the height of people who were tall or short by the same amount, the error
would be random.

9
(3) How do you think error will affect estimates? (Be specific.)
Here your initial intuitions regarding the effect of error on the estimates may diverge. It is
important to keep in mind that errors in the dependent variable (Y) and in the independent
variable (X) will have different kinds of effects on both the level and the variance of the
estimates. We explore this in greater detail below.

Error in X variable
Error Estimate
Trial Mean Correlation with True X a b
1 -0.01 -0.24 2.1121 0.2722
2 -0.03 0.04 2.2224 0.0133
3 -0.07 0.03 2.1306 0.2690
4 -0.01 -0.03 2.0865 0.3328
5 -0.03 -0.27 2.1072 0.2980
6 0.00 0.19 2.0944 0.3096
7 -0.09 -0.13 2.1629 0.1900
8 -0.03 -0.15 2.1340 0.2357
Mean -0.03 -0.07 2.1313 0.2401

(4) Compare the mean estimates to the true estimates. Does random error in the X variable seem
to affect the estimates of a and b? How?
Keep in mind that your values will differ from those in my table above. I have that a appears
to be slightly overestimated, by about 5%; and b appears to be substantively underestimated,
by nearly 50%.
Let us focus on the difference between the true parameter b and its estimated value. When N
is large, it can be shown that the estimated value for b can be written as follows
^ " σX %
b = b ⋅$ '
#σ X +σε &
where σ X is the variance of X and σ ε is the variance of the error added to X. Notice that if
σ ε =0, or there is no noise, then the fraction to the right of b is equal to one and the
estimated value is correct. Whenever σ ε > 0 , or noise is added to X, then the fraction to the
right of b is less than one and the estimated value is biased towards zero. This is commonly
referred to as “Attenuation Bias”, since it attenuates the effect of X on Y.
Error in Y variable
Error Estimate
Trial Mean Correlation with True Y a b
1 0.12 -0.20 2.0473 0.7056
2 0.06 0.02 2.1719 0.2737
3 -0.04 -0.27 2.1609 0.0727
4 0.10 0.14 2.0862 0.5626
5 -0.07 0.02 1.9487 0.4810
10
6 -0.03 -0.09 2.2329 -0.0901
7 0.00 -0.35 2.1465 0.1852
8 -0.05 -0.02 1.8430 0.7829
Mean 0.01 -0.09 2.0797 0.3717

(5) Does random error in the Y variable seem to affect the estimates of a and b? How?
Compared to the solutions in (4), the random error in Y does not have a substantive effect on
the levels of our estimates. It does, however, increase the variance of these estimates, as
evidenced by the larger spread in recorded values for both a and b in the table above.
To understand this better, recall that the standard errors for OLS estimates are directly
proportional to the standard deviation of the error term in the PRF. Since the error added to
Y in this question is uncorrelated with X, OLS simply treats it as part of the PRF error term,
leading to more variance in our estimates and larger standard errors.

Error in X and Y variables


Estimate
Trial a b
1 2.2340 -0.0973
2 2.1282 0.1891
3 2.1792 0.2051
4 2.1957 0.0579
5 2.1279 0.0195
6 2.2621 0.1007
7 2.1192 0.1803
8 2.2627 0.0868
Mean 2.1886 0.0928

(6) Does random error in the X and Y variable seem to affect the estimates of a and b? How?
In these results, we can see the combined effects seen in (4) and (5) above. Not only is b
underestimated as in (4), the variance of both estimates is high as in (5).
(7) Over the past 3 tests, the error has been random. What do you notice about the mean error
and the correlation of the error with the variable?
The mean error tends to be close to zero, relative to the size of our estimates. This is also
true of the mean correlation between the errors and the variables. This is what we would
expect, because the errors in the preceding cases are independent of both X and Y.

Systematic Error
Error Estimate
Trial Mean Correlation with True Y a b
1 0.48 0.18 2.0557 1.5101
2 0.47 0.16 2.1728 1.2223
3 0.49 0.46 2.1999 1.1903
4 0.46 0.60 2.1502 1.2541
11
5 0.63 0.34 1.9942 2.0071
6 0.54 0.37 2.1844 1.3404
7 0.42 0.28 2.1734 1.1048
8 0.50 0.17 2.2259 1.1731
Mean 0.50 0.32 2.1446 1.3503

(8) Does systematic error in the Y variable seem to affect the estimates of a and b?
The estimates of a seem to be slightly over-estimated, as in previous cases. The most
noticeable effect of the error is on b in this case, however, which is over-estimated by about
300%! Intuitively, this is because OLS is treating the error correlated with Y as an effect of
X. As we can see in the table above, the error is highly correlated with the true value of Y,
and this causes the substantial over-estimation of b that we see in the final column.

Conclusions

(9) In what cases would you expect errors in a variable to affect regression estimates?
Measurement error is inescapable, but when we understand its effects on our estimates, it
does not prevent us from making valid inference. Random errors in Y only have the effect of
increasing our standard errors, which reduces the precision of our estimates but does not
bias them in any way. Random errors in X bias our estimates toward zero, but we can easily
incorporate this into our interpretation of our results because we know the direction in
which the bias occurs. Systematic measurement errors are the kind that we have to be most
concerned about, because they can substantially bias our estimates. Moreover, if we do not
know the characteristics of such errors, we will not even be able to say anything about the
sign, let alone the size, of the bias that they cause.

12
Section III – Omitted Variable Bias
As indicated in class, the sign of the bias (positive or negative) of β̂1 when omitting X2 in the
estimation of Y = β0 + β1X1 + β2X2 + u can be summarized in the table below.

Corr (X1,X2) >0 Corr (X1,X2) <0


β2>0 + -
β2<0 - +

Illustrate each of the 4 cases in the table with an example that does not draw on the material
in section I of this problem set or specific examples we have discussed in class (in other
words, please think of examples other than vehicle fatality rates; reperfusion treatment after a
heart attack; training programs, education, and wages; and test scores and class size!).
Assume the regression equation you would like to estimate is Y = β0 + β1X1 + β2X2 + u, but
for some reason (lack of data or knowledge, for example), you end up omitting X2 from the
regression.

For each example:


1) Describe the variables Y, X1, and X2.
2) Indicate the sign of the correlation between X1 and X2, and explain why you would
expect this sign
3) Indicate the sign of β2, and explain why you would expect this sign
4) Indicate the sign of the bias if you were to omit X2 from the regression
5) Indicate how your estimated β̂1 is likely to change when omitting X2 from the regression,
i.e. will it get closer to or further from zero relative to when you estimate the full
regression (with both X1 and X2 as explanatory variables)? Explain your reasoning both
in technical terms (using your answers to the previous questions) and in terms a
policymaker can understand.

13
Case 1: β 2>0, Corr (X1,X2)>0

Suppose we would like to estimate the following equation:

wage = β0 + β1 educ + β2faminc + u

(1) So:
Y=wage= wage earned (measured in $/hour)
X1=educ=completed education (measured in years of schooling)
X2=faminc= family income (measured in thousands of dollars)

(2) Corr (educ, faminc) is probably positive since individuals with high family income are
more likely to be able to afford going to college than individuals with low family income.

(3) β2 is probably positive because individuals with high family income may have certain
characteristics (such as good family connections) that may allow them to get better-paying
jobs more easily than individuals with low family income.

(4) Since β2>0 and Corr (educ, faminc)>0, then bias>0. This means that if we were to
estimate:

wage = α0 + α1 educ + v

Then the expected value of α̂ 1 will be higher than β1, i.e. E[αˆ1 ] > β1

(5) Given that β1 is likely to be positive (since individuals with high levels of schooling are
more likely to get high wages than individuals with low levels of schooling), by omitting
faminc from the regression we are likely to overestimate the effect of education on wages.
We would be attributing to education what is really due to family income.

14
Case 2: β 2<0, Corr (X1, X2)>0

Suppose we would like to estimate the following equation:

birthweight = β0 + β1 cigs + β2alcohol + u

(1) So:
Y= birthweight =Weight at birth of a newborn baby (usually thought of as a measure
of a baby’s health/development)
X1=cigs=number of cigarettes mother smoked during pregnancy
X2=alcohol=number of alcoholic drinks mother drank during pregnancy

(2) Corr (cigs, alcohol) is probably positive since mothers who smoked during pregnancy
were probably more likely to engage in other potentially dangerous behavior (to the child)
such as drinking.

(3) β2 is probably negative given that doctors advise women not to drink frequently during
pregnancy because of the possible harm this may cause to the baby’s development.

(4) Since β2<0 and Corr (educ, faminc)>0, then bias<0. This means that if we were to
estimate:

birthweight = α0 + α1 cigs + v

Then the expected value of α̂ 1 will be lower than β1, i.e. E[αˆ1 ] < β1

(5) Given that β1 is likely to be negative (since mothers who smoke during pregnancy are
more likely to have low birthweight babies than mothers who don’t smoke), by omitting
alcohol from the regression we are likely to overestimate the effect of smoking on
birthweight. We would be attributing to smoking what is really due to alcohol drinking.

15
Case 3: β 2>0, Corr (X1, X2)<0

Suppose we would like to estimate the following equation:

life expectancy = β0 + β1 cigs + β2 exercise+ u

(1) So:
Y= life expectancy
X1= cigs = annual average cigarette consumption
X2= exercise = annual average hours of exercise

(2) Corr (cigs, exercise) is probably negative since people who exercise are less likely to
smoke.
(3) β2 is probably positive given that frequent exercise may increase life expectancy.

(4) Since β2>0 and Corr (cigs, exercise)<0, then bias<0. This means that if we were to
estimate:

life expectancy = α0 + α1 cigs + v

Then the expected value of α̂ 1 will be lower than β1, i.e. E[αˆ1 ] < β1

(5) Given that β1 is likely to be negative (since smokers on average live less years than non-
smokers), by omitting exercise from the regression we are likely to overestimate the effect of
smoking on life expectancy. We would be attributing to smoking what is really due to lack of
exercise.

16
Case 4: β 2<0, Corr (X1, X2) <0

Suppose we would like to estimate the following equation:

testscorei = β0 + β1 schoolspendingi + β2povratei + u

(1) So:
Y=testscorei=average test score of students in school district i
X1=schoolspendingi=spending of school district i
X2=povratei = poverty rate in school district i

(2) Corr (schoolspending, povrate) is probably negative since communities that have high
poverty rates are likely to have school districts that have low levels of resources.

(3) β2 is probably negative since poor communities are usually associated with low academic
performance.

(4) Since β2<0 and Corr (educ, faminc)<0, then bias>0. This means that if we were to
estimate:

testscorei = α0 + α1 schoolspendingi + v

Then the expected value of α̂ 1 will be higher than β1, i.e. E[αˆ 1 ] > β1

(5) Given that β1 is likely to be positive (since we would think that higher spending is associated
with higher- or at least not lower- test scores), by omitting povrate from the regression we are
likely to overestimate the effect of school spending on test scores. We would be attributing to
school spending what is really due to poor socio-economic conditions (as measured by povrate).

17

You might also like