You are on page 1of 21

Naive Bayes and Gaussian Bayes Classifier

Mengye Ren
mren@cs.toronto.edu

October 18, 2015

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21
Naive Bayes

Bayes Rules:
p(x|t)p(t)
p(t|x) =
p(x)
Naive Bayes Assumption:
D
Y
p(x|t) = p(xj |t)
j=1

Likelihood function:

L(θ) = p(x, t|θ) = p(x|t, θ)p(t|θ)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 2 / 21
Example: Spam Classification

Each vocabulary is one feature dimension.


We encode each email as a feature vector x ∈ {0, 1}|V |
xj = 1 iff the vocabulary xj appears in the email.

We want to model the probability of any word xj appearing in an


email given the email is spam or not.
Example: $10,000, Toronto, Piazza, etc.

Idea: Use Bernoulli distribution to model p(xj |t)


Example: p(“$10, 000”|spam) = 0.3

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 3 / 21
Bernoulli Naive Bayes

Assuming all data points x (i) are i.i.d. samples, and p(xj |t) follows a
Bernoulli distribution with parameter µjt

D (i)
Y x (i)
p(x (i) |t (i) ) = µjtj(i) (1 − µjt (i) )(1−xj )

j=1

N N D (i)
Y Y Y x (i)
p(t|x) ∝ p(t (i) )p(x (i) |t (i) ) = p(t (i) ) µjtj(i) (1 − µjt (i) )(1−xj )

i=1 i=1 j=1

where p(t) = πt . Parameters πt , µjt can be learnt using maximum


likelihood.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 4 / 21
Derivation of maximum likelihood estimator (MLE)

θ = [µ, π]

log L(θ) = log p(x, t|θ)


 
N D
(i) (i)
X X
= log πt (i) + xj log µjt (i) + (1 − xj ) log(1 − µjt (i) )
i=1 j=1

P
Want: arg maxθ log L(θ) subject to k πk = 1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 5 / 21
Derivation of maximum likelihood estimator (MLE)
Take derivative w.r.t. µ
 
N (i) (i)
∂ log L(θ)   xj 1 − xj
1 t (i) = k 
X
=0⇒ − =0
∂µjk µjk 1 − µjk
i=1

N  h   i
1 t (i) = k (i) (i)
X
xj (1 − µjk ) − 1 − xj µjk = 0
i=1

N   N  
1 t (i) = k µjk = 1 t (i) = k xj(i)
X X

i=1 i=1

i=1 1 t
PN (i)
 (i)
= k xj
µjk =
i=1 1
PN 
t (i) = k

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 6 / 21
Derivation of maximum likelihood estimator (MLE)

Use Lagrange multiplier to derive π


P N
∂L(θ) ∂ κ πκ   1
1 t (i) = k)
X
+λ =0⇒λ=−
∂πk ∂πk πk
i=1

i=1 1
PN
t (i) = k)

πk = −
λ
P
Apply constraint: k πk = 1 ⇒ λ = −N

i=1 1
PN
t (i) = k)

πk =
N

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 7 / 21
Spam Classification Demo

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 8 / 21
Gaussian Bayes Classifier

Instead of assuming conditional independence of xj , we model p(x|t) as a


Gaussian distribution and the dependence relation of xj is encoded in the
covariance matrix.

Multivariate Gaussian distribution:


 
1 1 T −1
f (x) = p exp − (x − µ) Σ (x − µ)
(2π)D det(Σ) 2

µ: mean, Σ: covariance matrix, D: dim(x)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 9 / 21
Derivation of maximum likelihood estimator (MLE)

q
θ = [µ, Σ, π], Z = (2π)D det(Σ)
 
1 1 T −1
p(x|t) = exp − (x − µ) Σ (x − µ)
Z 2

log L(θ) = log p(x, t|θ) = log p(t|θ) + log p(x|t, θ)

N
X 1  (i) T  
= log πt (i) − log Z − x − µt (i) Σ−1
t (i)
x (i)
− µ t (i)
2
i=1

P
Want: arg maxθ log L(θ) subject to k πk = 1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 10 / 21
Derivation of maximum likelihood estimator (MLE)

Take derivative w.r.t. µ


N
∂ log L X  
=− 1 t (i) = k Σ−1 (x (i) − µk ) = 0
∂µk
i=0

i=1 1 t
PN (i) = k x (i)

µk =
i=1 1
PN 
t (i) = k

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 11 / 21
Derivation of maximum likelihood estimator (MLE)

Take derivative w.r.t. Σ−1 (not Σ)


Note:
∂ det(A) T
= det(A)A−1
∂A
det(A)−1 = det A−1


∂x T Ax
= xx T
∂A
ΣT = Σ

N
" #
∂ log L   ∂ log Zk 1 (i)
1 t =k −
X
(i) (i) T
=− − (x − µk )(x − µk ) =0
∂Σ−1k i=0
∂Σ −1
k
2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 12 / 21
Derivation of maximum likelihood estimator (MLE)
q
Zk = (2π)D det(Σk )

 1
−1 − 2
∂ log Zk 1 ∂Zk D 1 D ∂ det Σ
= = (2π)− 2 det(Σk )− 2 (2π) 2 k
∂Σ−1k
Z k ∂Σ −1
k ∂Σ −1
k
 
1 1 3
− 2 1
= det(Σ−1 det Σ−1 det Σ−1
 T
k )
2 − k k Σk = − Σk
2 2
N  1 
∂ log L  1 (i)
1 t =k
X
(i) (i) T
=− Σk − (x − µk )(x − µk ) = 0
∂Σ−1k i=0
2 2

T
i=1 1
PN
t (i) = k x (i) − µk x (i) − µk
 
Σk =
i=1 1
PN 
t (i) = k

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 13 / 21
Derivation of maximum likelihood estimator (MLE)

i=1 1
PN
t (i) = k)

πk =
N
(Same as Bernoulli)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 14 / 21
Gaussian Bayes Classifier Demo

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 15 / 21
Gaussian Bayes Classifier

If we constrain Σ to be diagonal, then we can rewrite p(xj |t) as a product


of p(xj |t)
 
1 1 T −1
p(x|t) = p exp − (xj − µjt ) Σt (xk − µkt )
(2π)D det(Σt ) 2

D   D
Y 1 1 Y
= p exp − ||xj − µjt ||22 = p(xj |t)
(2π)D Σt,jj 2Σt,jj
j=1 j=1

Diagonal covariance matrix satisfies the naive Bayes assumption.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 16 / 21
Gaussian Bayes Classifier

Case 1: The covariance matrix is shared among classes

p(x|t) = N (x|µt , Σ)

Case 2: Each class has its own covariance

p(x|t) = N (x|µt , Σt )

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 17 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is shared between classes,

p(x|t = 1) = p(x|t = 0)

1 1
log π1 − (x − µ1 )T Σ−1 (x − µ1 ) = log π0 − (x − µ0 )T Σ−1 (x − µ0 )
2 2

C + x T Σ−1 x − 2µT −1 T −1 T −1 T −1
1 Σ x + µ1 Σ µ1 = x Σ x − 2µ0 Σ x + µ0 Σ µ0
T −1
h i
2(µ0 − µ1 )T Σ−1 x − (µ0 − µ1 )T Σ−1 (µ0 − µ1 ) = C

⇒ aT x − b = 0

The decision boundary is a linear function (a hyperplane in general).

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 18 / 21
Relation to Logistic Regression

We can write the posterior distribution p(t = 0|x) as

p(x, t = 0) π0 N (x|µ0 , Σ)
=
p(x, t = 0) + p(x, t = 1) π0 N (x|µ0 , Σ) + π1 N (x|µ1 , Σ)
  −1
π1 1 T −1 1 T −1
= 1+ exp − (x − µ1 ) Σ (x − µ1 ) + (x − µ0 ) Σ (x − µ0 )
π0 2 2
 
π1 1  T −1 −1
= 1 + exp log + (µ1 − µ0 )T Σ−1 x + µ 1 Σ µ1 − µT
0 Σ −1
µ 0
π0 2
1
=
1 + exp(−w T x − b)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 19 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is not shared between classes,

p(x|t = 1) = p(x|t = 0)

1 1
log π1 − (x − µ1 )T Σ−1 T −1
1 (x − µ1 ) = log π0 − (x − µ0 ) Σ0 (x − µ0 )
2 2
   
x T Σ−1 −1
x − 2 µT −1 T −1
x + µT T

1 − Σ0 1 Σ 1 − µ0 Σ 0 0 Σ0 µ0 − µ1 Σ1 µ1 = C

⇒ x T Qx − 2b T x + c = 0

The decision boundary is a quadratic function. In 2-d case, it looks


like an ellipse, or a parabola, or a hyperbola.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 20 / 21
Thanks!

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 21 / 21

You might also like