You are on page 1of 24

Lecture 9: Linear

Regression
Goals
Linear regression in R
Estimating parameters and hypothesis testing
with linear models
Develop basic concepts of linear regression from
a probabilistic framework
Regression
Technique used for the modeling and analysis of
numerical data
Exploits the relationship between two or more
variables so that we can gain information about one of
them through knowing values of the other
Regression can be used for prediction, estimation,
hypothesis testing, and modeling causal relationships
Regression Lingo
Y = X
1
+ X
2
+ X
3
Dependent Variable
Outcome Variable
Response Variable
Independent Variable
Predictor Variable
Explanatory Variable
Why Linear Regression?
Suppose we want to model the dependent variable Y in terms
of three predictors, X
1
, X
2
, X
3
Y = f(X
1
, X
2
, X
3
)
Typically will not have enough data to try and directly
estimate f
Therefore, we usually have to assume that it has some
restricted form, such as linear
Y = X
1
+ X
2
+ X
3
Linear Regression is a Probabilistic Model
Much of mathematics is devoted to studying variables
that are deterministically related to one another
!
y = "
0
+ "
1
x
!
"
0
!
y
!
x
!
"
1
=
#y
#x
!
"y
!
"x
But were interested in understanding the relationship
between variables related in a nondeterministic fashion
A Linear Probabilistic Model
!
"
0
!
y
!
x
!
y = "
0
+ "
1
x + #
Denition: There exists parameters , , and , such that for
any xed value of the independent variable x, the dependent
variable is related to x through the model equation
!
"
0
!
"
1
!
"
2

!
" is a rv assumed to be N(0, #
2
)
!
y = "
0
+ "
1
x
True Regression Line
!
"
1
!
"
2
!
"
3
Implications
The expected value of Y is a linear function of X, but for xed
x, the variable Y differs from its expected value by a random
amount
Formally, let x* denote a particular value of the independent
variable x, then our linear probabilistic model says:
!
E(Y | x*) =
Y| x*
= mean value of Y when x is x *
!
V(Y | x*) = "
Y| x*
2
= variance of Y when x is x *
Graphical Interpretation
!
y = "
0
+ "
1
x
!
"
0
+ "
1
x
1
!
"
0
+ "
1
x
2
!
x
1
!
x
2
!
y
!
x
For example, if x = height and y = weight then is the average
weight for all individuals 60 inches tall in the population
!

Y| x
1
=
!

Y| x
2
=
!

Y| x=60
One More Example
Suppose the relationship between the independent variable height
(x) and dependent variable weight (y) is described by a simple
linear regression model with true regression line
y = 7.5 + 0.5x and
Q2: If x = 20 what is the expected value of Y?
!

Y| x=20
= 7.5 + 0.5(20) = 17.5
Q3: If x = 20 what is P(Y > 22)?
Q1: What is the interpretation of = 0.5?
!
"
1
The expected change in height associated with a 1-unit increase
in weight !
" = 3
!
P(Y > 22 | x = 20) = P
22 -17.5
3
"
#
$
%
&
'
=1( )(1.5) = 0.067
Estimating Model Parameters
Point estimates of and are obtained by the principle of least
squares
!

"
0
!

"
1
!
f ("
0
,"
1
) = y
i
#("
0
+ "
1
x
i
)
[ ]
i=1
n
$
2
!
"
0
!
y
!
x

!

"
0
= y #

"
1
x
Predicted and Residual Values
Predicted, or tted, values are values of y predicted by the least-
squares regression line obtained by plugging in x
1
,x
2
,,x
n
into the
estimated regression line
!

y
1
=

"
0
#

"
1
x
1
!

y
2
=

"
0
#

"
1
x
2
Residuals are the deviations of observed and predicted values
!
e
1
= y
1
"

y
1
e
2
= y
2
"

y
2
!
y
!
x
!
e
1
!
e
2
!
e
3
!

y
1
!
y
1
Residuals Are Useful!
!
SSE = (e
i
i=1
n
"
)
2
= (y
i
i=1
n
"
#

y
i
)
2
They allow us to calculate the error sum of squares (SSE):
Which in turn allows us to estimate :
!
"
2
!

"
2
=
SSE
n #2
As well as an important statistic referred to as the coefcient of
determination:
!
r
2
=1"
SSE
SST
!
SST = (y
i
" y )
2
i=1
n
#
Multiple Linear Regression
Extension of the simple linear regression model to two or
more independent variables
!
y = "
0
+ "
1
x
1
+ "
2
x
2
+ ... + "
n
x
n
+#
Partial Regression Coefcients: !
i
! effect on the
dependent variable when increasing the i
th
independent
variable by 1 unit, holding all other predictors
constant
Expression = Baseline + Age + Tissue + Sex + Error
Categorical Independent Variables
Qualitative variables are easily incorporated in regression
framework through dummy variables
Simple example: sex can be coded as 0/1
What if my categorical variable contains three levels:
x
i
=
0 if AA
1 if AG
2 if GG
Categorical Independent Variables
Previous coding would result in colinearity
Solution is to set up a series of dummy variable. In general
for k levels you need k-1 dummy variables
x
1
=
1 if AA
0 otherwise
x
2
=
1 if AG
0 otherwise
AA
AG
GG
x
1
x
2
1
1
0
0
0 0
Hypothesis Testing: Model Utility Test (or
Omnibus Test)
The rst thing we want to know after tting a model is whether
any of the independent variables (Xs) are signicantly related to
the dependent variable (Y):
!
H
0
: "
1
= "
2
= ... = "
k
= 0
H
A
: At least one "
1
# 0
f =
R
2
(1$ R
2
)

k
n $(k +1)
!
Rejection Region : F
",k,n#(k+1)
Equivalent ANOVA Formulation of Omnibus Test
We can also frame this in our now familiar ANOVA framework
- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
Equivalent ANOVA Formulation of Omnibus Test
We can also frame this in our now familiar ANOVA framework
!
Rejection Region : F
",k,n#(k+1)
- partition total variation into two components: SSE (unexplained
variation) and SSR (variation explained by linear model)
n-1 Total
n-2 Error
k Regression
F MS Sum of Squares df Source of
Variation
!
SSR
k
!
SSE
n "2
!
MS
R
MS
E
!
SSR = (

y
i
" y )
2
#
!
SSE = (y
i
"

y
i
)
2
#
!
SST = (y
i
" y )
2
#
F Test For Subsets of Independent Variables
A powerful tool in multiple regression analyses is the ability to
compare two models
For instance say we want to compare:
!
Full Model : y = "
0
+ "
1
x
1
+ "
2
x
2
+ "
3
x
3
+ "
4
x
4
+#
!
Reduced Model : y = "
0
+ "
1
x
1
+ "
2
x
2
+#
!
f =
(SSE
R
" SSE
F
) /(k " l)
SSE
F
/([n "(k +1)]
Again, another example of ANOVA:
SSE
R
= error sum of squares for
reduced model with predictors
!
l
SSE
F
= error sum of squares for
full model with k predictors
Example of Model Comparison
We have a quantitative trait and want to test the effects at two
markers, M1 and M2.
!
f =
(SSE
R
" SSE
F
) /(3"2)
SSE
F
/([100 "(3+1)]
=
(SSE
R
" SSE
F
)
SSE
F
/96
Full Model: Trait = Mean + M1 + M2 + (M1*M2) + error
Reduced Model: Trait = Mean + M1 + M2 + error
!
Rejection Region : F
a, 1, 96
Hypothesis Tests of Individual Regression
Coefficients
Hypothesis tests for each can be done by simple t-tests:
!

"
i
!
H
0
:

"
i
= 0
H
A
:

"
i
# 0
T =

"
i
$"
i
se("
i
)
Condence Intervals are equally easy to obtain:
!

"
i
t
# / 2,n$(k$1)
se(

"
i
)
!
Critical value : t
" / 2,n#(k#1)
Checking Assumptions
Critically important to examine data and check assumptions
underlying the regression model
! Outliers
! Normality
! Constant variance
! Independence among residuals
Standard diagnostic plots include:
! scatter plots of y versus x
i
(outliers)
! qq plot of residuals (normality)
! residuals versus tted values (independence, constant variance)
! residuals versus x
i
(outliers, constant variance)
Well explore diagnostic plots in more detail in R
Fixed -vs- Random Effects Models
In ANOVA and Regression analyses our independent variables can
be treated as Fixed or Random
Fixed Effects: variables whose levels are either sampled
exhaustively or are the only ones considered relevant to the
experimenter
Random Effects: variables whose levels are randomly sampled
from a large population of levels
Expression = Baseline + Population + Individual + Error
Example from our recent AJHG paper: