Professional Documents
Culture Documents
3
data
data
0
-1
-1
0
-1
-1
time
time
e4
4
data
e3
e2
e1
0
-1
-1
time
i 1
SSE ( yi y i ) 2
i 1
y i b0 b1xi
xi x yi y
n xi yi xi yi
S xy
i 1
i 1
i 1 i 1
b1
2
2
S xx
n 2 n
n
n xi xi
xi x
i 1
i 1
i 1
b0 y b1 x
Straight-Line Relationship
Y = b0 + b1X
b0 represents the Y-intercept which is
the value of Y if X = 0.
b1 is the slope of the line which is the
amount of change in Y for a unit
change in X.
Assumptions of Simple LR
PDF of Y
at x=3
PDF of Y
at x=1
PDF of Y
at x=4
PDF of Y
at x=2
Y/X=4
Y/X=3
Y/X=2
Y/X=1
Note
E ( y | x) 0 1x
y 0 1x
y b0 b1 x
y b0 b1x e
Assumptions of Simple LR
E ( / X ) 0
Zero mean
Var ( / X ) 2
Constant,
homogenous/homoscedastic
variance
is normally distributed
Value of associated with any particular value of Y is
independent of associated with any other value of Y. As if
errors come from a random sample.
Y1 0 1x1 1
Y2 0 1x2 2
Another Note
/ X ~ N (0, ) Y / X ~ N (0, )
An unbiased estimate of 2 is:
S yy b1S xy
SSE
n2
n2
n
S yy ( yi y )
i 1
b1
On 0 using b0:
Using confidence interval:
b0
t / 2 s
S xx
b 1
s/
S xx
t / 2 s
xi
i 1
nS xx
b0 0
n
xi
i 1
nS xx
Ho : 1 = 0 (X provides no information)
Ha : 1 0 (X does provide information)
Ho : 1 0
Ha : 1 > 0
Ho : 1 0
Ha : 1 < 0
b1
t
sb1
b1
t
sb1
b1
t
sb1
s
anddf n - 2
Note: sb1
S xx
Prediction
On confidence interval on mean response y/X0:
2
y 0 t / 2 s
x0 x
S xx
1
y 0 t / 2 s 1
n
x0 x
S xx
Experience
15
10
20
5
15
5
Salary
30
35
55
22
40
27
Salary ($thousand)
For n = 6 employees
Linear (straight line) relationship
Increasing relationship
higher salary generally goes with higher experience
Correlation r = 0.8667
60
50
40
30
20
0
10
20 Experience
Salary (Y)
60
50
ry
a
l
a
S
40
30
.32
5
1
=
E
3
7
6
+ 1.
Y = b0 + b1X
20
10
0
10
20
ce
n
e
i
r
xp e
Experience (X)
50
Marys predicted value is 48.8
Salary
40
30
20
10
0
10
Experience
20
Y 0 1 X e
Error Variance
The measures most commonly used to measure
how well a line fits through a set of points is to
use the error variance and error standard deviation
SSE
s
n2
2
SSE
s
n2
What is s?
Measure of the variation of the Y values around the
least squares line
Average distance of prediction from actual
Average size of residuals
Standard deviation of residuals
Error Variance
Interpretation: similar to standard deviation
Can move least-squares line up and down by s
About 68% of the data are within one standard error of
estimate of the least-squares line
(For a bivariate normal distribution)
60
Salary
50
u
q
s
st
a
e
(L
40
30
20
0
s
a re
li
+
ne)
S
t- s
s
a
(L e
10 Experience 20
re
a
u
q
)
line
Coefficient of Determination
r 2 coefficient of determination
SSE
1
SYY
percentage of explained variation in the dependent
variable using the simple linear regression model
Getting the square root of the coefficient of
determination gives the correlation coefficient r.
Correlation Coefficient
The sample correlation coefficient, r,
measures the strength of the linear
relationship that exists within a sample of n
bivariate data.
r b1
S xy
S xx
S yy
S xx S yy
Note: When one compute r by getting the square root of r2, affix
the sign of b1 to the final value.
rxy
=1
rxy =
X
Y
No overall tilt
No linear relationship?
rxy = 1
A perfect straight line
tilting down to the right
X
Y
X
Y
X
Y
Hypothesis Test
Another way to test if a linear relationship exist
between the two variables of interest is to use the
relationship between b1 and r (they are closely related)
Ho : 1 = 0 (no linear relationship exists)
Ha : 1 0 ( a linear relationship exists)
t
*
r
1 r 2
n2
n 3 (1 r )(1 0 )
z
ln
2
(1 r )(1 0 )
This works on the assumption that both X and Y
follows the bivariate normal distribution.
( yi y ) ( y i y ) ( yi y i )
i 1
i 1
i 1
S yy bS xy SSE
SST SSR SSE
Similar to testing:
Ho : 1 = 0
Ha : 1 0
Sources of
Variation
Regression
Error
Total
Sum of
Squares
SSR
SSE
SST
Dof Mean
Comp. F
Square
1
SSR
SSR/s2
n-2 SSE/n-2
n-1
Y i.
nj
yij
Ti.
ni
ni
si2
( yij y i. )
j 1
ni 1
k
ni = no. of observations at xi
j 1
( ni 1) si
i 1
nk
k nj
( yij y i. )
i 1 j 1
nk
Sources of
Variation
Regression
Error
Lack of Fit
Sum of
Squares
SSR
SSE
SSE- SSE(pure)
Dof
1
n-2
k-2
n-k
n-1
Mean Square
(MS)
SSR
Comp. F
SSR/s
Model Adequacy
Significant lack of fit means that there is
considerable variation being caused by
higher-ordered terms- these are terms in x
other than the linear or first-order terms.
Illustration given in Figure 11.11 and 11.12
pp. 378-379 of book
2.
3.
A: assumption holds
B: assumption violated
Importance of Assumptions
Normality t-test, F-test.
Homoscedasticity- t-test, F-test, to ensure variance
used in explanation and prediction is distributed
across the range of independent variable value
Absence of Correlated Errors confidence that
prediction errors are independent of the levels at
which one is trying to predict, assurance that no
other systematic variable is affecting the results and
left out of the analysis
On Violation of Assumptions
One violation can be the result of another.
Example: violation of non-normality is
linked to or can be the result of nonconstant variance.
A remedy applied to one can solve another.
Remedy available: data or variable
transformation
Notes on Transformation
Two Purposes:
1. Correct violations of statistical assumptions.
2. Improve correlation between variables.
How to Choose?
1. Theoretical basis nature of data (e.g. sqrt
Function
Proper
Form of Simple
Transformation Linear Regression
*
*
Exponential: Y = ln y
Regress y vs x
Y=ex
*
*
*
Power:
Y = log y
Regress y against x
*
X = log x
Y=x
Reciprocal: X* = 1/X
Regress y against x*
Y=+(1/X)
*
8
*
Hyperbolic: Y = 1/Y
Regress y against x
*
x
X = 1/X
Y=
x
Caveats in Simple LR
Linear Model May Be Wrong
Nonlinear? Unequal variability? Clustering?