Professional Documents
Culture Documents
Regression
Correlation Analysis
correlation analysis expresses the
relationship between two data series using
a single number.
The correlation coefficient is a measure of
how closely related two data series are.
The correlation coefficient measures the
linear association between two variables.
correlation coefficient
(X i X)(Yi Y)
Cov(X, Y)
(n 1)
i 1
n
2
(
X
X
)
s 2X i
(n 1)
i 1
n
sX s
Cov(X, Y)
r
s Xs Y
2
X
r n2
1 r
Linear Regression
Linear regression assumes a linear
relationship between the dependent (Y)
and the independent variables (X).
Yi b0 b1Xi i , i 1, . . . n
Yi b 0 b1 X i
i 1
n
^
^
n
n ^ 2
2
(Yi b 0 b1 X i ) (ei )
SEE
n2
i 1
i 1 n 2
1
2
Coefficient of Determination
The coefficient of determination measures the
fraction of the total variation in the dependent
variable that is explained by the independent
variable.
Explained variation
R
Total variation
Unexplaine d variation
1
Total variation
2
2
(
Y
Y
)
i
i
i 1
n
2
(
Y
Y
)
i
i 1
Hypothesis Testing
We can test to see if the slope coefficient is significant by using
a t-test.
^
b1 b1
t
s b1
b1 t cs ^
b1
ANOVA
Analysis of variance (ANOVA) is a statistical procedure for
dividing the total variability of a variable into components that can
be attributed to different sources.
RSS / 1
Mean regression sum of squares
F
SSE /( n 2)
Mean squared error
where,
SSE Yi Y i
i 1
n
RSS Y i Y
i 1
n
Yi b 0 b1X1i b 2 X 2i ... b k X ki i ,
for i 1, 2, . . ., n
A slope coefficient, bj , measures how much the dependent variable,
Y , changes when the independent variable, Xj , changes by one
unit, holding all other independent variables constant.
In practice, software programs are used to estimate the multiple
regression model.
ANOVA
Analysis of variance (ANOVA) is a
statistical procedure for dividing the total
variability of a variable into components
that can be attributed to different sources.
RSS / k
Mean regression sum of squares MSR
F
SSE /[ n (k 1)]
Mean squared error
MSE
R2
Adjusted R2 is a measure of goodness of
fit that accounts for additional explanatory
variables.
n 1
2
R 1
1
n k 1
2
Regressions with
Heteroskedasticity
Serial Correlation
When regression errors are correlated across
observations, we say that they are serially correlated (or
autocorrelated).
Serial correlation most typically arises in time-series
regressions.
The principal problem caused by serial correlation in
a linear regression is an incorrect estimate of the
regression coefficient standard errors
Positive serial correlation is serial correlation in which a
positive error for one observation increases the chance
of a positive error for another observation.
DW
2
(
)
t
t
t 2
^
2
t 1
Multicollinearity
Multicollinearity occurs when two or more independent
variables (or combinations of independent variables) are
highly (but not perfectly) correlated with each other.
does not affect the consistency of the OLS estimates
of the regression coefficients
estimates become extremely imprecise and unreliable
The classic symptom of multicollinearity is a high R2 (and
significant F-statistic) even though the t-statistics on the
estimated slope coefficients are not significant.
The most direct solution to multicollinearity is excluding
one or more of the regression variables.
Effect
Solution
Heteroskedasticity
Incorrect
standard errors
Serial Correlation
Incorrect
standard errors
Correct for
conditional
heteroskedasticity
Correct for serial
correlation
Multicollinearity
Model Specification
Model specification refers to the set of variables
included in the regression and the regression
equations functional form.
Possible mispecifications include:
One or more important variables could be omitted
from regression.
One or more of the regression variables may need to
be transformed (for example, by taking the natural
logarithm of the variable) before estimating the
regression.
The regression model pools data from different
samples that should not be pooled.