You are on page 1of 6

The Simple Linear

Regression Model

Purpose of Regression and
Correlation Analysis
Regression Analysis is Used Primarily for
Prediction
A statistical model used to predict the values of a
dependent or response variable based on values of
at least one independent or explanatory variable
Correlation Analysis is Used to Measure
Strength of the Association Between
Numerical Variables
The Scatter Diagram
0
50
100
0 20 40 60
Axis Title
Axis
Title
Plot of all (X
i
, Y
i
) pairs
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
Simple Linear Regression
Model
i i i
X Y c | | + + =
1 0
Y intercept
Slope
The Straight Line that Best Fit the Data
Relationship Between Variables Is a Linear Function
Random
Error
Dependent
(Response)
Variable
Independent
(Explanatory)
Variable
6
Error Variable: Required
Conditions
The error c is a critical part of the regression
model.
Four requirements involving the distribution of c
must be satisfied.
The probability distribution of c is normal.
The mean of c is zero: E(c) = 0.
The standard deviation of c is o
c
for all values of x.
The set of errors associated with different values of y
are all independent.
c
i
= Random Error
Y
X
Population
Linear Regression Model
Observed
Value
Observed Value
| |
YX
i
X = +
0 1
Y X
i i i
= + + | | c
0 1
Sample Linear Regression
Model
i i
X b b Y
1 0
+ =
.
Y
i

.
= Predicted Value of Y for observation i
X
i
= Value of X for observation i
b
0
= Sample Y - intercept used as estimate of
the population |
0
b
1
= Sample Slope used as estimate of the
population |
1
9
To calculate the estimates of the
coefficients
that minimize the differences
between the data
points and the line, use the
formulas:
x b y b
s
) Y , X cov(
b
1 0
2
x
1
=
=
The regression equation that
estimates
the equation of the first
order linear model
is:
x b b y
1 0
+ =
REGRESSION COEFFICIENTS
To calculate the estimates of the coefficients that
minimize the differences between the data points and
the line, use the formulas ( least squares method):



EXCEL offers several approaches to regression,
including trendlines, regression functions and the
regression analysis tool
X b Y b
X X n
Y X Y X n
b
i i
i i i i
1 0 2 2 1
et
) ( ) (
) )( (
=

=


Simple Linear Regression
Equation: Example
You wish to examine the
relationship between the
square footage of produce
stores and its annual sales.
Sample data for 7 stores
were obtained. Find the
equation of the straight
line that fits the data best
Annual
Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760

Scatter Diagram Example
0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0
S qua re Fe e t
A
n
n
u
a
l

S
a
l
e
s

(
$
0
0
0
)
Excel Output
Equation for the Best
Straight Line
i
i i
X . .
X b b Y
487 1 415 1636
1 0
+ =
+ =
.
From Excel Printout:
Coeffi ci ents
I n t e r c e p t 1 6 3 6 . 4 1 4 7 2 6
X V a r i a b l e 1 1 . 4 8 6 6 3 3 6 5 7
Graph of the Best
Straight Line
0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0
S q u a re F e e t
A
n
n
u
a
l

S
a
l
e
s

(
$
0
0
0
)
Interpreting the Results
Y
i
= 1636.415 +1.487X
i
The slope of 1.487 means for each increase of one
unit in X, the Y is estimated to increase 1.487units.
For each increase of 1 square foot in the size of the
store, the model predicts that the expected annual
sales are estimated to increase by $1487.
.
Inferences about the Slope: t
Test
t Test for a Population Slope
Is a Linear Relationship Between X & Y ?

1
1 1
b
S
b
t
|
= Test Statistic:

=
=
n
i
i
YX
b
) X X (
S
S
1
2
1
and df = n - 2
Null and Alternative Hypotheses
H
0
: |
1
= 0 (No Linear Relationship)
H
1
: |
1
= 0 (Linear Relationship)
Where
Standard Error of Estimate
2
=
n
SSE
S
yx
2
1
2


=
n
) Y Y (
n
i
i i
.
=
The standard deviation of the variation of
observations around the regression line
Graph of the Best
Straight Line
0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0
S q u a re F e e t
A
n
n
u
a
l

S
a
l
e
s

(
$
0
0
0
)
Example: Produce Stores
Data for 7 Stores:
Regression
Model Obtained:
The slope of this model
is 1.487.
Is there a linear
relationship between the
square footage of a store
and its annual sales?
.
Annual
Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760

Y
i
= 1636.415 +1.487X
i
t Stat P-value
I nte rce pt 3. 6244333 0. 0151488
X V a ri a bl e 1 9. 009944 0. 0002812
Inferences about the
Slope: t Test Example
H
0
: |
1
= 0
H
1
: |
1
= 0
o = .05
df = 7 - 2 = 7
Critical Value(s):

Test Statistic:
Decision:

Conclusion:

There is evidence of a
linear relationship.
t
0 2.5706 -2.5706
.025
Reject Reject
.025
From Excel Printout
Reject H
0
Inferences about the Slope:
Confidence Interval Example
Confidence Interval Estimate of the Slope
b
1
t
n-2

1
b
S
Excel Printout for Produce Stores
At 95% level of Confidence The confidence Interval for the
slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear relationship
between annual sales and the size of the store.
Lower 95% Upper 95%
I n te r ce p t 475. 810926 2797. 01853
X V a r i a b l e 11. 06249037 1. 91077694
Measures of Variation:
The Sum of Squares
SST = Total Sum of Squares
measures the variation of the Y
i
values around their
mean Y
SSR = Regression Sum of Squares
explained variation attributable to the relationship
between X and Y
SSE = Error Sum of Squares
variation attributable to factors other than the
relationship between X and Y
_
Measures of Variation: The
Sum of Squares
X
i
Y
X
Y
SST = (Y
i
- Y)
2
SSE =(Y
i
- Y
i
)
2

.
SSR = (Y
i
- Y)
2


.
_
_
_
df S S
R e g r e ssi o n 1 3 0 3 8 0 4 5 6 . 1 2
R e si d u a l 5 1 8 7 1 1 9 9 . 5 9 5
T o ta l 6 3 2 2 5 1 6 5 5 . 7 1
Measures of Variation
The Sum of Squares: Example
Excel Output for Produce Stores
SSR SSE SST
Testing the validity of the model

We pose the question:
Is there at least one independent variable linearly
related to the dependent variable?
To answer the question we test the hypothesis

H
0
: |
1
= 0

H
1
: At least one |
i
is not equal to 0

If at least one |
i
is not equal to zero, the model is valid.
ANOVA - Summary Table
Source of
Variation
Degrees
of
Freedom
Sum of
Squares
Mean
Square
(Variance)
Explained
(Factor)
k - 1 SSR MSR =
SSR/(k - 1)
MSR
MSE
Within
(Error)
n - k SSE MSE =
SSE/(n - k)
Total n - 1 SST =
SSR+SSE
F Test
Statistic
=
To test these hypotheses we perform an analysis
of variance procedure.

The F test
Construct the F statistic




Rejection region

F >F
o,k,n-k


MSE
MSR
F =
MSR=SSR/k-1
MSE=SSE/(n-k)
[Variation in y] = SSR + SSE.
Large F results from a large SSR.
Then, much of the variation in y is
explained by the regression model.
The null hypothesis should
be rejected; thus, the model is valid.
Required conditions
must be satisfied.
The Coefficient of
Determination
SSR regression sum of squares
SST total sum of squares
r
2
= =


Measures the proportion of variation that is
explained by the independent variable X in
the regression model
Coefficients of Determination
(r
2
) and Correlation (r)
r
2
= 1,
r
2
= 1,
r
2
= .8, r
2
= 0, Y
Y
i
= b
0
+ b
1
X
i
X
^
Y
Y
i
= b
0
+ b
1
X
i
X
^
Y
Y
i
= b
0
+ b
1
X
i
X
^
Y
Y
i
= b
0
+ b
1
X
i
X
^
r = +1
r = -1
r = +0.9 r = 0
Reg ressi o n S tati sti cs
M u l t i p l e R 0 . 9 7 0 5 5 7 2
R S q u a r e 0 . 9 4 1 9 8 1 2 9
A d j u s t e d R S q u a r e 0 . 9 3 0 3 7 7 5 4
S t a n d a r d E r r o r 6 1 1 . 7 5 1 5 1 7
O b s e r va t i o n s 7
Measures of Variation:
Example
Excel Output for Produce Stores
r
2
= .94 S
yx
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage
Estimation of
Predicted Values
Confidence Interval Estimate for
XY
The Mean of Y given a particular X
i

+ -
=

n
i
i
i
yx n i
) X X (
) X X (
n
S t Y

1
2
2
2
1
t value from table
with df=n-2
Standard error
of the estimate
Size of interval vary according to
distance away from mean, X.
Estimation of
Predicted Values
Confidence Interval Estimate for
Individual Response Y
i
at a Particular X
i

+ + -
=

n
i
i
i
yx n i
) X X (
) X X (
n
S t Y

1
2
2
2
1
1
Addition of this 1 increased width of
interval from that for the mean Y
Interval Estimates for
Different Values of X
X
Y
X
Confidence Interval
for a individual Y
i
A Given X
Confidence
Interval for the
mean of Y
_
Example: Produce Stores
Y
i
= 1636.415 +1.487X
i
Data for 7 Stores:
Regression Model Obtained:
Predict the annual
sales for a store with
2000 square feet.
.
Annual
Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760

Estimation of Predicted
Values: Example
Confidence Interval Estimate for Individual Y
Find the 95% confidence interval for the average annual sales
for stores of 2,000 square feet

+ -
=
n
i
i
i
yx n i
) X X (
) X X (
n
S t Y

1
2
2
2
1
Predicted Sales Y
i
= 1636.415 +1.487X
i
= 4610.45 ($000)
.
X = 2350.29 S
YX
= 611.75
t
n-2
= t
5
= 2.5706
= 4610.45 980.97
Confidence interval for mean Y
Estimation of Predicted
Values: Example
Confidence Interval Estimate for
XY
Find the 95% confidence interval for annual sales of one
particular store of 2,000 square feet
Predicted Sales Y
i
= 1636.415 +1.487X
i
= 4610.45 ($000)
.
X = 2350.29 S
YX
= 611.75
t
n-2
= t
5
= 2.5706
= 4610.45 1853.45
Confidence interval for individual
Y

+ + -
=
n
i
i
i
yx n i
) X X (
) X X (
n
S t Y

1
2
2
2
1
1

You might also like