You are on page 1of 11

Summary

While there are a lot of text and pictures below, these are the most important things you should take
away from the Cereal example:

1. Always make sure to directly observe the data to get a feel for what it shows us (and build
intuition for what relationships may exist).
a. Correlation matrix, scatterplots, distributions, scanning the data for missing data are a
good place to start.
b. If time series data, plot each data series against time
2. After fitting the model, check the following:
a. Are there leverage points (leverage plots)?
b. Are there any systematic trends for the residuals as a function of the predicted Y’s
(residuals by predicted plot)? Maybe need to do a Box Cox transformation.
c. Is there multicollinearity present in the data (VIF)?
d. How good is my fit overall (check R^2)?
e. Do I have any insignificant predictors in the model?
3. For time series data, check the following:
a. Does my data have any issue with autocorrelation? (Do the Durbin Watson test)
b. Can I predict well on holdout data? (Do a holdout validation as per below)
i. If the data is not time series data, can instead do a split-sample
4. When comparing how good two models are against one another on holdout data, compare the
Mean Squared Error. This is equal to the Bias^2 + Variance. See below for how this is calculated.
5. Occam’s Razor applies very well in a prediction setting – simple models tend to generalize
better to future data than more complicated data.

Directly Observing the Data

It helps to look at the raw data over time. This is the raw data as a function of time ( Analyze  Fit Y
by X, then set Y = Vt, Pt and At, while X = time):
Volume has generally trended upwards, although at a faster rate at the beginning and end of the
dataset. It just so happens that these were the same periods when price was moving downwards. There
is a less clear relationship between volume and advertising.

Getting Model Fit

A log-linear regression of volume against price, advertising, and time implies the following outputs (
Analyze  Fit Model):

Residuals by predicted looks fairly well-behaved:


Leverage Plots

Ever wonder what those leverage plots are? As discussed in class, leverage plots are plots of the
relationship of Y against individual predictors, when we “strip out” the effect of the other covariates
that are also in the regression (for more information, see here). For example, the left-most leverage plot
below is the relationship between LnVt and time, controlling for LnAt and LnPt. The dark red line
represents the actual regression slot coefficient for time, while the lightly shaded red band represents
the confidence interval for that slope. Leverage points are points that heavily influence the slope of the
line – these are points that should potentially looked into (maybe they are residuals). Points that heavily
influence the slope coefficient are “bad” per se, but should be looked into because they are
singlehandedly affecting the regression.

Assessing *Need* For Log Transforms (vs Convenience in Interpreting Results)

Is the overall goodness of fit any better than if we didn’t use a log-linear regression? This is the actual
versus expected plot:

These are the residuals as a function of the predicted values:


Side by side, these plots look almost identical to the previous plots. So in terms of goodness of fit, the
results are pretty much the same. The benefit of the log-linear regression here is in terms of
interpretability. We have the ability to say that a percentage change in one variable is associated with a
percentage change in the thing that we are predicting.

Summary of fit – we can remove advertising from the model, and there are no apparent multicollinearity
issues (as per the VIF column):

The signs are as we would expect them to be. Higher advertising helps (marginally) while lower prices
helps (significantly). Demand generally moves up over time.

Residuals over time and autocorrelation

It can also be helpful to make sure that there are no issues with the residuals varying systematically over
time. Here, there are no apparent issues. No trend as a function of time that I can see by eye:
However, there is also a specific type of systematic variation over time which can be hard to discern by
eye – this is called “autocorrelation.” The idea behind autocorrelation is basically that if you
underpredicted last period, you may be more likely to underpredict again this period – this is called
“positive autocorrelation.” Positive autocorrelation can come about, for example, when we are
modeling volume of a product sold, and it just so happens that demand is sensitive to the overall
macroeconomy but we do not have a predictor in our model for macroeconomic conditions like GDP. IN
this case, if the economy goes through a recession, we may have a string of underpredictions, creating
positive autocorrelation. Negative autocorrelation implies that underpredicting last period makes it
more likely to overpredict this period. This can be tested with the “Durbin-Watson” test:

All you need to pay attention to are the third and fourth numbers above – these are the autocorrelation
coefficient and the p-value. This would suggest that there is no significant autocorrelation.

Holdout validation

We can also predict on holdout data. This is the most natural form of validation when we are working
with time series data (if we are not working with time series data, a split sample validation, as described
in the lecture slides, may be most appropriate). I train on the first 30 data points and predict the final 6.

75% of the data and predict the final 25% (in data table, highlight rows 28-36, right-click, select ‘Hide
and Exclude’, then run the regression in the usual way via  Analyze  Fit Model). I fit two models –
the first model includes all predictors (LnAt, LnPt, and time) while the second includes only the
significant predictors (LnPt and time).

To see how well we predict, plot expected versus actual LnVt as a function of time across all the data
points. I do this by saving the predictions for all data points ( red triangle in top-left  Save Columns
 prediction formula) for both models.
To visualize the results, unexclude and unhide the holdout data. Then  Graph  Graph Builder, then
highlight LnVt and the predictions under the two models, then drag them into the Y column. Drag ‘time’
into the X area.
It is at point 28 that the holdout period starts (pointed to with red arrow above).

We can make this into a pretty graph. Click ‘Done’ in the top left, then change the axes names, legend
names, and title as desired.

Computing the mean squared error

We want a quantitative way to know how well we did. To do this, we will want to analyze the residuals
in the calibration period and in the holdout period. This will first require that we first create a column to
distinguish the calibration period from the holdout period. Do this as we normally would in Excel –
create the column, then in row 1, type “Training”. Then copy and paste that down so that rows 2 to 28
are the same. In row 29, the start of the holdout period, type “Testing”. Copy and paste that down to
the end of the data set.

Now let’s get the residuals. The residuals are obtained by creating a new column by double clicking on it,
then right clicking the column header,  Formula.., then defining the formula as follows:
Clicking OK generates the residuals data (make sure to name the column appropriately). The data should
look like this:

These are the residuals in the calibration period when we include all predictors versus when we exclude
advertising ( Analyze  Distribution  set Y equal to the two sets of residuals, and set By =
‘Validation’, then  Red Triangle  Stack):
As we would expect, the error is lower in the calibration period when we include the advertising
predictor (this is evident from the fact that both have average residuals equal to ~0, while the Std Dev
for the residuals associated with the regression which excludes the Advertising variable is 0.0281 versus
0.0255 for the full regression). However, these are the corresponding residuals in the holdout period:
A standard measure of the goodness of fit of a predictor is the mean squared error, which is equal to
the bias^2 plus the variance. The Bias here is equal to the ‘Mean’ above while the Variance here is equal
to the ‘Std Dev’ squared. By this measure, the MSE is nearly double for the more complex model:

This is an example of Occam’s Razor at work. However, it also means removing advertising from the
model. Some managers may not be too happy with that. There are often two goals with models – one to
predict the future, the other to improve the future. Very frequently, these goals oppose one another.

What does it imply for the truth of what is driving the data? The model that includes advertising and
trains on the first 75% of the data has the following coefficients:
Mild price elasticity and very strong effect of advertising. Exclude advertising, and we see the following:

Price elasticity is stronger, as is the beneficial effect of time. This of course has clear implications for
what the firm should do going forward.

You might also like