You are on page 1of 17

B AYESIAN L EARNING - L ECTURE 5

Mattias Villani

Division of Statistics and Machine Learning


Department of Computer and Information Science
Linköping University

M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 1 / 17


L ECTURE OVERVIEW

I Normal model with conjugate prior


I The linear regression model
I Non-linear regression
I Regularization priors

M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 2 / 17


N ORMAL MODEL - NORMAL PRIOR

I Model
iid
y1 , ..., yn |θ, σ2 ∼ N (θ, σ2 )
I Conjugate prior

σ2
 
2
θ |σ ∼ N µ0 ,
κ0
σ2 ∼ Inv -χ2 (ν0 , σ02 )

M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 3 / 17


N ORMAL MODEL WITH NORMAL PRIOR
I Posterior
σ2
 
2
θ |y , σ ∼ N µn ,
κn
σ2 |y ∼ Inv -χ2 (νn , σn2 ).
where
κ0 n
µn = µ0 + ȳ
κ0 + n κ0 + n
κn = κ0 + n
νn = ν0 + n
κ0 n
νn σn2 = ν0 σ02 + (n − 1)s 2 + (ȳ − µ0 )2 .
κ0 + n
I Marginal posterior
θ ∼ tνn µn , σn2 /κn


M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 4 / 17


T HE L INEAR R EGRESSION M ODEL

I The ordinary linear regression model:

yi = β 1 xi1 + β 2 xi2 + ... + β k xik + ε i


iid
ε i ∼ N (0, σ2 ).

I Parameters θ = ( β 1 , β 2 , ..., β k , σ2 ).
I Assumptions:
I E (yi ) = β 1 xi 1 + β 2 xi 2 + ... + β k xik (linear function)
I Var (yi ) = σ2 (homoscedasticity)
I Corr (yi , yj |X , β, σ2 ) = 0, i 6= j.
I Normality of ε i .
I The x’s are assumed known (non-random).

M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 5 / 17


L INEAR REGRESSION IN MATRIX FORM
I The linear regression model in matrix form
y = Xβ + ε
(n ×1) (n×k )(k ×1) (n ×1)

     
y1 β1 ε1
y =  ...  , β =  ...  , ε =  .. 
    
. 
yn βk εn
 0   
x1 x11 · · · x1k
X =  ...  =  ... .. .. 
  
. . 
xn 0 xn1 · · · xnk
I Usually xi1 = 1, for all i. β 1 is the intercept.
I Likelihood for the full sample
y| β, σ2 , X ∼ N (Xβ, σ2 In )
M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 6 / 17
L INEAR REGRESSION - UNIFORM PRIOR
I Standard non-informative prior: uniform on (β, log σ2 )

p ( β, σ2 ) ∝ σ−2
I Joint posterior of β and σ2 :

β|σ2 , y ∼ N β̂, σ2 (X0 X)−1


 

σ2 |y ∼ Inv -χ2 (n − k, s 2 )

where β̂ = (X0 X)−1 X0 y and s 2 = n−1 k (y − X β̂)0 (y − X β̂).


I Simulate from the joint posterior by iteratively simulating from
I p ( σ 2 |y )
I p ( β | σ2 , y )

I Marginal posterior of β :

β|y ∼ tn−k β̂, s 2 (X 0 X )−1


 

M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 7 / 17


L INEAR REGRESSION - CONJUGATE PRIOR
I Joint prior for β and σ2

β|σ2 ∼ N µ0 , σ2 Ω0−1


σ2 ∼ Inv − χ2 ν0 , σ02


I Posterior
β|σ2 , y ∼ N µn , σ2 Ωn−1
 

σ2 |y ∼ Inv − χ2 νn , σn2


 −1
µ n = X0 X + Ω 0 X0 X β̂ + Ω0 µ0


Ω n = X0 X + Ω 0
νn = ν0 + n
νn σn2 = ν0 σ02 + y0 y + µ00 Ω0 µ0 − µn0 Ωn µn


M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 8 / 17


P OLYNOMIAL REGRESSION
I Polynomial regression
f (xi ) = β 0 + β 1 xi + β 2 xi2 + ... + β k xik .
y = XP β + ε,
where
XP = (1, x, x 2 , ..., x k ).
Quadratic regression

2
y

1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Constant basis function
1
y

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Linear basis function
1
y

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Quadratic basis function
1

0.5
y

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 9 / 17


S PLINE REGRESSION
I Polynomials are too global. Need more local basis functions.
I Truncated power splines given knot locations k1 , ..., km
(xi − kj )p if xi > kj

bij =
0 otherwise
Piece−wise linear regression

2
y

1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Constant basis function
1
y

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Linear basis function
1
y

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Basis function for (x−0.4)+
1

0.5
y

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 10 / 17


S PLINES , CONT.

I Note: given the knots, the non-parametric spline regression model is a


linear regression of y on the m ’dummy variables’ bj

y = Xb β + ε,

where Xb is the basis regression matrix

Xb = (b1 , ..., bm ).

I It is also common to include an intercept and the linear part of the


model separately. In this case we have

Xb = (1, x, b1 , ..., bm ).

M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 11 / 17


S MOOTHNESS PRIOR FOR SPLINES
I Problem: too many knots leads to over-fitting.
I Solution: smoothness/shrinkage/regularization prior
σ2
 
2 iid
β i |σ ∼ N 0,
λ
I Larger λ gives smoother fit. Note: here we have Ω0 = λI .
I Equivalent to a penalized likelihood:
−2 · log p ( β|σ2 , y, X) ∝ RSS ( β) + λβ0 β
I Posterior mean gives ridge regression estimator
 −1 0
β̃ = X0 X + λI Xy
I Shrinkage toward zero
As λ → ∞, β̃ → 0
I When X0 X = I
1
β̃ = β̂ OLS
1+λ
M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 12 / 17
B AYESIAN SPLINE WITH SMOOTHNESS PRIOR
λ = 0 λ = 0.5
0 0

−0.2 −0.2
LogRatio

LogRatio
−0.4 −0.4

−0.6 −0.6
Data
−0.8 Estimated E(y|x) −0.8

−1 −1
400 500 600 700 400 500 600 700
Range Range

λ = 1 λ = 10
0 0

−0.2 −0.2
LogRatio

LogRatio

−0.4 −0.4

−0.6 −0.6

−0.8 −0.8

−1 −1
400 500 600 700 400 500 600 700
Range Range
M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 13 / 17
S MOOTHNESS PRIOR FOR SPLINES , CONT.
I The famous Lasso variable selection method is equivalent to using the
posterior mode estimate under the prior:

σ2
 
2 iid
β i |σ ∼ Laplace 0,
λ

with density  
λ λ | βi |
p ( β i ) = 2 exp − 2
2σ σ
I The Bayesian shrinkage prior is interpretable. Not ad hoc.
I Laplace distribution have heavy tails.
I Laplace: many β i are close to zero, but some β i may be very large.
I Normal distribution have light tails.
I Normal prior: most β i are fairly equal in size, and no single β i can be
very much larger than the other ones.
M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 14 / 17
E STIMATING THE SHRINKAGE
I How do we determine the degree of smoothness, λ? Cross-validation
is one possible approach.
I Bayesian: λ is unknown ⇒ use a prior for λ.
I One possibility: λ ∼ Inv − χ2 (η0 , λ0 ). The user specifies η0 and λ0 .
I Alternative approach: specify the prior on the degrees of freedom.
I Hierarchical setup:

y| β, X ∼ N (Xβ, σ2 In )
β|σ2 , λ ∼ N 0, σ2 λ−1 Im


σ2 ∼ Inv − χ2 (ν0 , σ02 )


λ ∼ Inv − χ2 (η0 , λ0 )

so Ω0 = λIm in the previous notation.

M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 15 / 17


R EGRESSION WITH ESTIMATED SHRINKAGE
I The joint posterior of β, σ2 and λ is

β|σ2 , λ, y ∼ N µn , Ωn−1


σ2 |λ, y ∼ Inv − χ2 νn , σn2



s −νn /2
| Ω0 | νn σn2

p ( λ |y ) ∝ · p (λ)
| XT X + Ω 0 | 2

where Ω0 = λIm , and p (λ) is the prior for λ, and


  −1
µ n = XT X + Ω 0 XT y
Ω n = XT X + Ω 0
νn = ν0 + n
n Ωn µn
νn σn2 = ν0 σ02 + yT y − µT

M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 16 / 17


M ORE COMPLEXITY
I The location of the knots can be treated as unknown, and estimated
from the data. Joint posterior

p ( β, σ2 , λ, k1 , ..., km |y, X)

I The marginal posterior for λ, k1 , ..., km is a nightmare.

I MCMC can be used to simulate from the joint posterior. Li and Villani
(2013, SJS).
I The basic spline model can be extended with:
I Heteroscedastic errors (also modelled with a spline)
I Non-normal errors (student-t or mixture distributions)
I Autocorrelated/dependent errors (AR process for the error term)
I MCMC can again be used to simulate from the joint posterior.
M ATTIAS V ILLANI (S TATISTICS , L I U) B AYESIAN L EARNING 17 / 17

You might also like