Professional Documents
Culture Documents
y
3
Flavor 2.5
2 0.4
Suppose we unwrap N candies, c cherries and ` = N − c limes 1.5
1 1 0.2
These are i.i.d. (independent, identically distributed) observations, so 0.5
0
0 0.6
0.8
0.2 0.4 0.4 y
0.6 0.8 0.2 0
N x 1 0
Y
P (d|hθ ) = P (dj |hθ ) = θ c · (1 − θ)` 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
j =1 x
1 (y−(θ1 x+θ2 ))2
Maximize this w.r.t. θ—which is easier for the log-likelihood:
Maximizing P (y|x) = √ e− 2σ 2 w.r.t. θ1, θ2
N
X
2πσ
L(d|hθ ) = log P (d|hθ ) = log P (dj |hθ ) = c log θ + ` log(1 − θ) N
j =1 X
= minimizing E = (yj − (θ1xj + θ2))2
dL(d|hθ ) c ` c c j =1
= − =0 ⇒ θ= =
dθ θ 1−θ c+` N That is, minimizing the sum of squared errors gives the ML solution
Seems sensible, but causes problems with 0 counts! for a linear fit assuming Gaussian noise of fixed variance
Chapter 20, Sections 1–3 9 Chapter 20, Sections 1–3 12
Summary
Full Bayesian learning gives best possible predictions but is intractable
MAP learning balances complexity with accuracy on training data
Maximum likelihood assumes uniform prior, OK for large data sets
1. Choose a parameterized family of models to describe the data
requires substantial insight and sometimes new models
2. Write down the likelihood of the data as a function of the parameters
may require summing over hidden variables, i.e., inference
3. Write down the derivative of the log likelihood w.r.t. each parameter
4. Find the parameter values such that the derivatives are zero
may be hard/impossible; modern optimization techniques help
Chapter 20, Sections 1–3 13