You are on page 1of 16

Assumptions of Regression

Assumptions
• Linear regression is an analysis that assesses whether
one or more predictor variables explain the dependent
(criterion) variable. The regression has following key
assumptions:
• Sample Size
• Outliers
• Linear relationship
• Multivariate normality
• No or little multicollinearity
• No auto-correlation
• Homoscedasticity
Sample Size

• A note about sample size. In Linear regression the


sample size rule of thumb is that the regression
analysis requires at least 20 cases per independent
variable in the analysis.
• Stevens (1996, p. 72) recommends that ‘for social
science research, about 15 participants per predictor
are needed for a reliable equation
• Tabachnick and Fidell (2007,p. 123) give a formula for
calculating sample size requirements, taking into
account the number of independent variables that you
wish to use: N > 50 + 8m (where m = number of
independent variables).
Outliers
• Multiple regression is very sensitive to outliers
(very high or very low scores). Checking for
extreme scores should be part of the initial
data screening. You should do this for all the
variables, both dependent and independent,
that you will be using in your regression
analysis.
Linear relationship
• Linear regression needs the relationship
between the independent and dependent
variables to be linear.
• It is also important to check for outliers since
linear regression is sensitive to outlier effects.
• The linearity assumption can best be tested
with scatter plots, the following two examples
depict two cases, where no and little linearity
is present.
Multivariate Normality

• Secondly, the linear regression analysis requires


all variables to be multivariate normal.
• This assumption can best be checked with a
histogram or a Q-Q-Plot.
• Normality can be checked with a goodness of fit
test, e.g., the Kolmogorov-Smirnov test. When
the data is not normally distributed a non-linear
transformation (e.g., log-transformation) might
fix this issue.
Normality
Multicollinearity

• Thirdly, linear regression assumes that there is little or no multicollinearity


in the data. Multicollinearity occurs when the independent variables are
too highly correlated with each other.
• Multicollinearity may be tested with three central criteria:
1) Correlation matrix – when computing the matrix of Pearson’s Bivariate
Correlation among all independent variables the correlation coefficients need
to be smaller than .7.
2) Tolerance – the tolerance measures the influence of one independent
variable on all other independent variables; the tolerance is calculated with
an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these
first step regression analysis. With T < 0.1 there might be multicollinearity in
the data and with T < 0.01 there certainly is.
3) Variance Inflation Factor (VIF) – the variance inflation factor of the linear
regression is defined as VIF = 1/T. With VIF > 10 there is an indication that
multicollinearity may be present; with VIF > 100 there is certainly
multicollinearity among the variables.
Multicollinearity
• If multicollinearity is found in the data, centering
the data (that is deducting the mean of the
variable from each score) might help to solve the
problem.
• However, the simplest way to address the
problem is to remove independent variables with
high VIF values.
• Other alternatives to tackle the problems is
conducting a factor analysis and rotating the
factors to insure independence of the factors in
the linear regression analysis
Autocorrelation
• Fourth, linear regression analysis requires that
there is little or no autocorrelation in the data.
Autocorrelation occurs when the residuals are
not independent from each other. For
instance, this typically occurs in stock prices,
where the price is not independent from the
previous price.
• In other words when the value of y(x+1) is not
independent from the value of y(x).
Autocorrelation
• While a scatterplot allows you to check for
autocorrelations, you can test the linear regression
model for autocorrelation with the Durbin-Watson
test.
• Durbin-Watson’s d tests the null hypothesis that the
residuals are not linearly auto-correlated. While d can
assume values between 0 and 4, values around 2
indicate no autocorrelation. As a rule of thumb values
of 1.5 < d < 2.5 show that there is no auto-correlation
in the data.
• However, the Durbin-Watson test only analyses linear
autocorrelation and only between direct neighbors,
which are first order effects.
Homoscedasticity
• The assumption of homoscedasticity (meaning “same
variance”) is central to linear regression models.
Homoscedasticity describes a situation in which the
error term (that is, the “noise” or random disturbance
in the relationship between the independent variables
and the dependent variable) is the same across all
values of the independent variables.
• Heteroscedasticity (the violation of homoscedasticity)
is present when the size of the error term differs across
values of an independent variable. The impact of
violating the assumption of homoscedasticity is a
matter of degree, increasing as heteroscedasticity
increases.
Homoscedasticity
• The scatter plot is good way to check whether the
data are homoscedastic (meaning the residuals
are equal across the regression line).
• The following scatter plots show examples of
data that are not homoscedastic (i.e.,
heteroscedastic)
• Examining a scatterplot of the residuals against
the predicted values of the dependent variable
would show a classic cone-shaped pattern of
heteroscedasticity.
Homoscedasticity
Reference
• http://www.statisticssolutions.com/assumptio
ns-of-linear-regression/

You might also like