You are on page 1of 3

Example

Suppose there are five kinds of bags of candies:


10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
Statistical learning 40% are h3: 50% cherry candies + 50% lime candies
20% are h4: 25% cherry candies + 75% lime candies
10% are h5: 100% lime candies
Chapter 20, Sections 1–3
Then we observe candies drawn from some bag:
What kind of bag is it? What flavour will the next candy be?
Chapter 20, Sections 1–3 1 Chapter 20, Sections 1–3 4
Outline Posterior probability of hypotheses
♦ Bayesian learning
1
P(h1 | d)
Posterior probability of hypothesis

♦ Maximum a posteriori and maximum likelihood learning P(h2 | d)


P(h3 | d)
♦ Bayes net learning 0.8 P(h4 | d)
– ML parameter learning with complete data P(h5 | d)
– linear regression 0.6
0.4
0.2
0
0 2 4 6 8 10
Number of samples in d
Chapter 20, Sections 1–3 2 Chapter 20, Sections 1–3 5
Full Bayesian learning Prediction probability
View learning as Bayesian updating of a probability distribution
1
over the hypothesis space
H is the hypothesis variable, values h1, h2, . . ., prior P(H) 0.9

P(next candy is lime | d)


jth observation dj gives the outcome of random variable Dj 0.8
training data d = d 1 , . . . , dN
Given the data so far, each hypothesis has a posterior probability: 0.7
P (hi|d) = αP (d|hi)P (hi) 0.6
where P (d|hi) is called the likelihood
0.5
Predictions use a likelihood-weighted average over the hypotheses:
0.4
P(X|d) = Σi P(X|d, hi)P (hi|d) = Σi P(X|hi)P (hi|d) 0 2 4 6 8 10
No need to pick one best-guess hypothesis! Number of samples in d
Chapter 20, Sections 1–3 3 Chapter 20, Sections 1–3 6
MAP approximation Multiple parameters
Summing over the hypothesis space is often intractable Red/green wrapper depends probabilistically on flavor: P (F=cherry )
(e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) θ
Likelihood for, e.g., cherry candy in green wrapper:
Flavor
Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d)
P (F = cherry, W = green|hθ,θ1,θ2 ) F P(W=red | F )
I.e., maximize P (d|hi)P (hi) or log P (d|hi) + log P (hi) = P (F = cherry|hθ,θ1,θ2 )P (W = green|F = cherry, hθ,θ1,θ2 ) cherry θ1
lime θ2
Log terms can be viewed as (negative of) = θ · (1 − θ1)
bits to encode data given hypothesis + bits to encode hypothesis Wrapper
N candies, rc red-wrapped cherry candies, etc.:
This is the basic idea of minimum description length (MDL) learning
r
P (d|hθ,θ1,θ2 ) = θ c(1 − θ)` · θ1rc (1 − θ1)gc · θ2` (1 − θ2)g`
For deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise
⇒ MAP = simplest consistent hypothesis (cf. science)
L = [c log θ + ` log(1 − θ)]
+ [rc log θ1 + gc log(1 − θ1)]
+ [r` log θ2 + g` log(1 − θ2)]
Chapter 20, Sections 1–3 7 Chapter 20, Sections 1–3 10
ML approximation Multiple parameters contd.
For large data sets, prior becomes irrelevant Derivatives of L contain only the relevant parameter:
Maximum likelihood (ML) learning: choose hML maximizing P (d|hi) ∂L c ` c
= − =0 ⇒ θ=
∂θ θ 1−θ c+`
I.e., simply get the best fit to the data; identical to MAP for uniform prior
(which is reasonable if all hypotheses are of the same complexity)
∂L rc gc rc
= − =0 ⇒ θ1 =
ML is the “standard” (non-Bayesian) statistical learning method ∂θ1 θ1 1 − θ 1 rc + g c
∂L r` g` r`
= − =0 ⇒ θ2 =
∂θ2 θ2 1 − θ 2 r` + g `
With complete data, parameters can be learned separately
Chapter 20, Sections 1–3 8 Chapter 20, Sections 1–3 11
ML parameter learning in Bayes nets Example: linear Gaussian model
1
Bag from a new manufacturer; fraction θ of cherry candies? P (F=cherry ) 0.8
Any θ is possible: continuum of hypotheses hθ θ P(y |x)
4 0.6
θ is a parameter for this simple (binomial) family of models 3.5

y
3
Flavor 2.5
2 0.4
Suppose we unwrap N candies, c cherries and ` = N − c limes 1.5
1 1 0.2
These are i.i.d. (independent, identically distributed) observations, so 0.5
0
0 0.6
0.8
0.2 0.4 0.4 y
0.6 0.8 0.2 0
N x 1 0
Y
P (d|hθ ) = P (dj |hθ ) = θ c · (1 − θ)` 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
j =1 x
1 (y−(θ1 x+θ2 ))2
Maximize this w.r.t. θ—which is easier for the log-likelihood:
Maximizing P (y|x) = √ e− 2σ 2 w.r.t. θ1, θ2
N
X
2πσ
L(d|hθ ) = log P (d|hθ ) = log P (dj |hθ ) = c log θ + ` log(1 − θ) N
j =1 X
= minimizing E = (yj − (θ1xj + θ2))2
dL(d|hθ ) c ` c c j =1
= − =0 ⇒ θ= =
dθ θ 1−θ c+` N That is, minimizing the sum of squared errors gives the ML solution
Seems sensible, but causes problems with 0 counts! for a linear fit assuming Gaussian noise of fixed variance
Chapter 20, Sections 1–3 9 Chapter 20, Sections 1–3 12
Summary
Full Bayesian learning gives best possible predictions but is intractable
MAP learning balances complexity with accuracy on training data
Maximum likelihood assumes uniform prior, OK for large data sets
1. Choose a parameterized family of models to describe the data
requires substantial insight and sometimes new models
2. Write down the likelihood of the data as a function of the parameters
may require summing over hidden variables, i.e., inference
3. Write down the derivative of the log likelihood w.r.t. each parameter
4. Find the parameter values such that the derivatives are zero
may be hard/impossible; modern optimization techniques help
Chapter 20, Sections 1–3 13

You might also like