You are on page 1of 60

Simple Linear Regression

Simple Linear Regression


Regression is a statistical method that attempts to
represent the relationship between two variables
by approximating this relationship by a straight
line
Since all relationships are not linear (straight line) in
fashion, simple LR only works well for bivariate data
that has a linear relationship
Regression analysis develops an linear equation
showing how the two variable are related
Requires a slope (b0) and Y-intercept (b1)

Computing Regression Line


6

3
data

data

0
-1

-1

0
-1

-1
time

time

Computing Regression Line


6

e4
4

data

e3
e2

e1

0
-1

-1
time

Computing Regression Line:


Least Squares Line
The least square line is the straight line that
best passes through the points of a scatter
diagram
The least squares line is the line through
the data that minimizes the sum of the
differences between the observations and
the line (these differences are commonly
called as residuals).
e2 = e12 + e22 + e32 + + en2

Sum of Squares of Error


How can we determine the sum of squares of
error?
By using the following formulas, remember, the
regression line minimizes this value (SSE)
n

i 1

SSE ( yi y i ) 2
i 1

y i b0 b1xi

Y-hat is the y value from regression line


Y is the actual observed y value

Least Squares Formula


n

xi x yi y
n xi yi xi yi
S xy
i 1

i 1
i 1 i 1
b1

2
2
S xx
n 2 n

n
n xi xi
xi x
i 1
i 1
i 1

b0 y b1 x

Straight-Line Relationship
Y = b0 + b1X
b0 represents the Y-intercept which is
the value of Y if X = 0.
b1 is the slope of the line which is the
amount of change in Y for a unit
change in X.

Assumptions of Simple LR
PDF of Y
at x=3

PDF of Y
at x=1

PDF of Y
at x=4

PDF of Y
at x=2

Y/X=4
Y/X=3
Y/X=2

Y/X=1

Note
E ( y | x) 0 1x

y 0 1x

y b0 b1 x

y b0 b1x e

Deterministic model representation for mean of Y given X but


not for actual y value. In any case, the derived LR equation,
as long as proven stable and reliable could be used to predict
either the average y-value or the actual y-value.

Assumptions of Simple LR
E ( / X ) 0

Zero mean

Var ( / X ) 2

Constant,
homogenous/homoscedastic
variance

is normally distributed
Value of associated with any particular value of Y is
independent of associated with any other value of Y. As if
errors come from a random sample.

Y1 0 1x1 1
Y2 0 1x2 2

1 and 2 are independent

Another Note

/ X ~ N (0, ) Y / X ~ N (0, )
An unbiased estimate of 2 is:

S yy b1S xy
SSE

n2
n2
n

S yy ( yi y )
i 1

Inferences on Regression Coefficients


On 1 using b1:
Using confidence interval:

b1

Using hypothesis testing:

On 0 using b0:
Using confidence interval:

b0

Using hypothesis testing:

t / 2 s
S xx

b 1
s/
S xx

t / 2 s

xi

i 1

nS xx

b0 0
n

xi

i 1

nS xx

v=n-2 for all


inferences

Hypothesis Test on the


Slope of the Regression Line
How do we know that a significant linear
relationship exists using regression?
The slope, b1, will give us an indication.

Hypothesis Test on the


Slope of the Regression Line
Therefore, if the slope of the least square line is
zero, there is no linear relationship. However, if the
slope of the least squares line is significantly greater
than 0 or is significantly less than 0, then we can
conclude a linear relationship exists
Therefore, we want to test the following hypothesis:

Ho : 1 = 0 (X provides no information)
Ha : 1 0 (X does provide information)

Hypothesis Test on the


Slope of the Regression Line
Ho : 1 = 0
Ha : 1 0

Ho : 1 0
Ha : 1 > 0

Ho : 1 0
Ha : 1 < 0

b1
t
sb1

b1
t
sb1

b1
t
sb1

Reject Ho if t*> t, n-2

Reject Ho if t* < -t, n-2

Reject Ho if |t*| > t/2, n-2

s
anddf n - 2
Note: sb1
S xx

Prediction
On confidence interval on mean response y/X0:
2

y 0 t / 2 s

x0 x

S xx

On prediction interval on single response y0:

1
y 0 t / 2 s 1
n

x0 x

S xx

Example: Salary and Experience


00
0
,
$55 has
s
arn and ience
e
ry ear, per
a
M r y f ex
pe rs o
ea
y
20

Salary vs. Years Experience

Experience
15
10
20
5
15
5

Salary
30
35
55
22
40
27

Salary ($thousand)

For n = 6 employees
Linear (straight line) relationship
Increasing relationship
higher salary generally goes with higher experience
Correlation r = 0.8667
60
50
40
30
20
0

10

20 Experience

Example: Salary and Experience


Summarizes bivariate data: Predicts Y from X
with smallest errors (in vertical direction, for Y axis)
Intercept is 15.32 salary (at 0 years of experience)
Slope is 1.673 salary (for each additional year of experience, on average)

Salary (Y)

60
50

ry
a
l
a
S

40
30

.32
5
1
=

E
3
7
6
+ 1.

Y = b0 + b1X

20
10
0

10

20

ce
n
e
i
r
xp e

Experience (X)

Predicted Values and Residuals


Predicted Value comes from the prediction equation
y = b0 + b1X = 15.32 + 1.673X
For example, Mary (with 20 years of experience) has a
predicted salary = 15.32 + 1.673(20) = 48.8
So does anyone with 20 years of experience

Residual is the actual Y minus predicted Y (Y )


Marys residual is 55 48.8 = 6.2
She earns about $6,200 more than the predicted salary for
a person with 20 years of experience
A person who earns less than predicted will have a
negative residual

Predicted and Residual (continued)


Marys residual is 6.2 (55 48.8)
60

Mary earns 55 thousand

50
Marys predicted value is 48.8
Salary

40
30
20
10
0

10

Experience

20

Simple Linear Regression Model


When we use a straight line to predict
parameters, we use a statistical model in the
form:

Y 0 1 X e

Assumed line about which


all values of X and Y will
fall

Error: contains all other


variability not explained
by the independent
variable (X)

Note: 0 and 1 refer to the straight line for the


population, we will be using sample data and will use b0
and b1 to refer to the straight line for the sample

Error Variance
The measures most commonly used to measure
how well a line fits through a set of points is to
use the error variance and error standard deviation

SSE
s
n2
2

SSE
s
n2

What is s?
Measure of the variation of the Y values around the
least squares line
Average distance of prediction from actual
Average size of residuals
Standard deviation of residuals

Error Variance
Interpretation: similar to standard deviation
Can move least-squares line up and down by s
About 68% of the data are within one standard error of
estimate of the least-squares line
(For a bivariate normal distribution)
60
Salary

50

u
q
s
st
a
e
(L

40
30
20
0

s
a re

li

+
ne)

S
t- s
s
a
(L e

10 Experience 20

re
a
u
q

)
line

Example: Salary and Experience


Regression and Prediction Error
Predicting Y as not using regression)
Errors are approximately SY = 11.686

Predicting Y as b0 + b1X (using regression)


Errors are approximately S = 6.52
Errors are smaller when regression is used!
This is often the true payoff for using regression

Measuring the Strength of the Model


Another item of interest is to determine how
well the regression model fits the data
To determine this, we use the coefficient of
determination (r2) which gives the
percentage of explained variation in the
dependent variable using the model.

Coefficient of Determination
r 2 coefficient of determination
SSE
1
SYY
percentage of explained variation in the dependent
variable using the simple linear regression model
Getting the square root of the coefficient of
determination gives the correlation coefficient r.

Correlation Coefficient
The sample correlation coefficient, r,
measures the strength of the linear
relationship that exists within a sample of n
bivariate data.
r b1

S xy
S xx

S yy
S xx S yy

Note: When one compute r by getting the square root of r2, affix
the sign of b1 to the final value.

Interpreting the Correlation


Coefficient
If r = 1 then X and Y have a perfect positive linear
relationship.
If r= -1 then X and Y have a perfect negative linear
relationship.
If r= 0 then X and Y have no linear relationship.
If 0 < r < 1 then X and Y are positively related. The closer
to 1 the stronger the linear relationship.
If 0 > r > -1 then X and Y are negatively related. The
closer to -1 the stronger the linear relationship.

Correlation Coefficient Summary


r ranges from -1.0 to 1.0.
The larger | r | is, the stronger the linear relationship.
R near zero indicates that there is no linear relationship. X
and Y are uncorrelated
The sign of r tells you whether the relationship between X
and Y is a positive or a negative relationship.
The value of r tells you very little about the slope of the
line. Except if the sign of r is positive the slope of the line
is positive and if r is negative then the slope is negative.

Examples: Interpreting Correlation

rxy

=1

A perfect straight line


tilting up to the right

rxy =

X
Y

No overall tilt
No linear relationship?

rxy = 1
A perfect straight line
tilting down to the right

X
Y

X
Y

X
Y

Various Values of rxy

Significance test for the Correlation


Do to sampling error, the value of r may not
reflect the true relationship of the entire
population, especially if the sample is quite small
Therefore, a formal hypothesis test may be needed
Hypothesis tested:
Ho: = 0 (no correlation)
Ha: 0 (correlation exists)
Note: = population correlation coefficient

Hypothesis Test
Another way to test if a linear relationship exist
between the two variables of interest is to use the
relationship between b1 and r (they are closely related)
Ho : 1 = 0 (no linear relationship exists)
Ha : 1 0 ( a linear relationship exists)

t
*

r
1 r 2
n2

This will give


exactly the same
b
value for t* as t * 1
sb1

Reject Ho if |t| > t, n-2

Hypothesis Test Continued


If one desires to carry out the general test:
Ho : = 0
Ha : 0 / > 0 / < 0
One can use:

n 3 (1 r )(1 0 )
z
ln

2
(1 r )(1 0 )
This works on the assumption that both X and Y
follows the bivariate normal distribution.

Exercise Problems Problems

Problem #5, p. 359 (manual)


Problem #6, p. 359 (excel)
Problem #7, p. 359 (excel)
Problem #7, p. 371.
Problems #1 and 2, p.396
Problem #5 p. 380

Check for Model Significance


and Adequacy
The ANOVA approach:
n

( yi y ) ( y i y ) ( yi y i )

i 1

i 1

i 1

S yy bS xy SSE
SST SSR SSE

Similar to testing:

Ho : 1 = 0
Ha : 1 0

Check for Model Significance


and Adequacy
The ANOVA table:

Sources of
Variation
Regression
Error
Total

Sum of
Squares
SSR
SSE
SST

Dof Mean
Comp. F
Square
1
SSR
SSR/s2
n-2 SSE/n-2
n-1

Reject H0 if comp F > F(1,n-2)

Check for Model Significance and Adequacy


If repeated observations are made at several X values
the SSE term shown previously could be further divided
into Error-Lack of Fit and Error-Pure Experimental.
Computational formula:
Yij = the jth value of the random variable Yi
Yi. = Ti. =

Y i.

nj

yij

Ti.
ni

k = no. distinct values of x

ni

si2

( yij y i. )

s2 = pure experimental error

j 1

ni 1
k

ni = no. of observations at xi

j 1

( ni 1) si

i 1

nk

k nj

mean square SSE(pure)

( yij y i. )

i 1 j 1

nk

SS(Lack of Fit)= SSE-SSE(pure)

Check for Model Significance and


Adequacy
The ANOVA table becomes:

Sources of
Variation
Regression
Error
Lack of Fit

Sum of
Squares
SSR
SSE
SSE- SSE(pure)

Pure Error SSE(pure)


Total
SST

Dof
1
n-2
k-2
n-k
n-1

Model significant if Freg > F(1,n-k)


Model adequate if Flack of fit < F(k-2,n-k)

Mean Square
(MS)
SSR

Comp. F
SSR/s

SSE- SSE(pure) MS(lack of fit)


k-2
s2
s2

Model Adequacy
Significant lack of fit means that there is
considerable variation being caused by
higher-ordered terms- these are terms in x
other than the linear or first-order terms.
Illustration given in Figure 11.11 and 11.12
pp. 378-379 of book

Checking Model Assumptions


1.

The errors are normally distributed with a mean of zero.

Construct a normal probability plot (plot of residuals)

2.

If the resulting graph is linear, the normality assumption is verified

Conduct goodness-of-fit test- chi-sqaure, KS, Shapiro-Wilcoxon


Statistical test on kurtosis and skewness

The variance of the error component is the same for each


value of X.

3.

Plot residual against independent variable, X


If no pattern exist, this assumption holds

The errors are independent of each other.

Look for autocorrelation

Plot sample residuals by time time series analysis

Normal Probability Plot


Compares the cumulative distribution of actual data values
with the cumulative distribution of a normal distribution.
(If normal points should fall around the diagonal straight
line.)

Deviations from Normality

Statistical Checking for Normality


Deviations from Normality
Kurtosis refers to the peakedness or
flatness of the distribution. If
normal,kurtosis is zero.
Skewness -deals with the symmetry of the
distribution, a skewed variable is a variable
whose mean is not in the center of the
distribution. If normal, skewness is zero.

Checking for Normality


Statistical Test for Kurtosis
kurtosis
z
24
N

Statistical Test for Skewness


skewness
z
6
N

Checking for Normality


Other Tools
Histogram (good for numerous data points)
Goodness of Fit Tests (good for many data
points- 30 or more; but overly sensitive for
very large sample- 1000 or more)

Checking Model Assumptions


Equal Variance

A: assumption holds

B: assumption violated

Checking Model Assumptions


autocorrelation

Autocorrelation exists, violation of errors being independent


of each other

Importance of Assumptions
Normality t-test, F-test.
Homoscedasticity- t-test, F-test, to ensure variance
used in explanation and prediction is distributed
across the range of independent variable value
Absence of Correlated Errors confidence that
prediction errors are independent of the levels at
which one is trying to predict, assurance that no
other systematic variable is affecting the results and
left out of the analysis

On Violation of Assumptions
One violation can be the result of another.
Example: violation of non-normality is
linked to or can be the result of nonconstant variance.
A remedy applied to one can solve another.
Remedy available: data or variable
transformation

Notes on Transformation
Two Purposes:
1. Correct violations of statistical assumptions.
2. Improve correlation between variables.

How to Choose?
1. Theoretical basis nature of data (e.g. sqrt

transform works well with frequency count data, arcsin


transform for proportion data)

2. Trial and Error

Not a magic cure to all violations. Will not


eliminate all violations but could lead to very
significant improvements.

Suggested Transforms for Non-normality

Note: Inverse transform usually works well with flat distributions.

Suggested Transforms for


Heteroscedasticity
If cone opens to the right try inverse
transform
If cone opens to the left try square root
transform

Some General Guidelines on


Transformation
For noticeable effect of transformation, ratio of
variable mean to std. dev < 4.
If transformation can be performed on two
variables (in non-linearity) , select variable with
smallest average to s ratio.
Transformation should be applied to IVs except
for cases of heteroscedasticity. If relationship is
heteroscedastic and non-linear there might be a
need to transform both IV and DV.
Transformation may change the interpretation of
variables. Be careful!

Suggested Transforms for Non-linearity

In any of the given illustrations, transformation could be


carried out on either the independent or dependent variable.
When multiple transformation possibilities are shown, start
with the top method in each quadrant and move downward
until linearity is achieved.

Simple Non-linear Regression by


Linearization

Function

Proper
Form of Simple
Transformation Linear Regression
*
*
Exponential: Y = ln y
Regress y vs x
Y=ex
*
*
*
Power:
Y = log y
Regress y against x
*

X = log x
Y=x
Reciprocal: X* = 1/X
Regress y against x*
Y=+(1/X)
*
8
*
Hyperbolic: Y = 1/Y
Regress y against x
*
x
X = 1/X
Y=
x

Notes on Non-Linear Regression


by Linearization
Model in the transformed variables that has
a proper additive error structure is a result of
a model in the natural variables with a
different type of error structure.
Performance criteria (s2 and R2) for the
transformed model should be based on
values of the residuals in the metric of the
untransformed response.
Sample Problem to be given in class.

Obtaining The Regression Output


To fit a linear regression using Excel
Choose Data Analysis, then Regression
Choose the two data columns for which the
Regression is to be calculated
The Y variable will be on the vertical axis
Click residual plot and normal probability plot
to check assumptions

Statistica could also be used.

Caveats in Simple LR
Linear Model May Be Wrong
Nonlinear? Unequal variability? Clustering?

Predicting Intervention from Experience is Hard


Relationship may become different if you intervene

Intercept May Not Be Meaningful


if there are no data near X = 0

Explaining Y from X vs. Explaining X from Y


Use care in selecting the Y variable to be predicted

Is there a hidden Third Factor?


Use it to predict better with multiple regression

You might also like