Naive Bayes and Gaussian Bayes Classifiers Compared

Naive Bayes and Gaussian Bayes Classifier
Mengye Ren
mren@cs.toronto.edu
October 18, 2015
Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21
Naive Bayes
Bayes Rules:
p(x|t)p(t)
p(t|x) =
p(x)
Naive Bayes Assumption:
D
Y
p(x|t) = p(xj |t)
j=1
Likelihood function:
L(θ) = p(x, t|θ) = p(x|t, θ)p(t|θ)
Example: Spam Classification
Each vocabulary is one feature dimension.

We encode each email as a feature vector x ∈ {0, 1}|V |
xj = 1 iff the vocabulary xj appears in the email.
We want to model the probability of any word xj appearing in an

email given the email is spam or not.
Example: $10,000, Toronto, Piazza, etc.
Idea: Use Bernoulli distribution to model p(xj |t)

Example: p(“$10, 000”|spam) = 0.3
Bernoulli Naive Bayes
Assuming all data points x (i) are i.i.d. samples, and p(xj |t) follows a
Bernoulli distribution with parameter µjt
D (i)
Y x (i)
p(x (i) |t (i) ) = µjtj(i) (1 − µjt (i) )(1−xj )
j=1
N N D (i)
Y Y Y x (i)
p(t|x) ∝ p(t (i) )p(x (i) |t (i) ) = p(t (i) ) µjtj(i) (1 − µjt (i) )(1−xj )
i=1 i=1 j=1
where p(t) = πt . Parameters πt , µjt can be learnt using maximum

likelihood.
Derivation of maximum likelihood estimator (MLE)
θ = [µ, π]
log L(θ) = log p(x, t|θ)

 
N D
(i) (i)
X X
= log πt (i) + xj log µjt (i) + (1 − xj ) log(1 − µjt (i) )
i=1 j=1
P
Want: arg maxθ log L(θ) subject to k πk = 1
Take derivative w.r.t. µ
 
N (i) (i)
∂ log L(θ) xj 1 − xj
1 t (i) = k 
X
=0⇒ − =0
∂µjk µjk 1 − µjk
i=1
N h i
1 t (i) = k (i) (i)
X
xj (1 − µjk ) − 1 − xj µjk = 0
i=1
N N
1 t (i) = k µjk = 1 t (i) = k xj(i)
X X
i=1 i=1
i=1 1 t
PN (i)
(i)
= k xj
µjk =
i=1 1
PN
t (i) = k
Use Lagrange multiplier to derive π

P N
∂L(θ) ∂ κ πκ 1
1 t (i) = k)
X
+λ =0⇒λ=−
∂πk ∂πk πk
i=1
i=1 1
PN
t (i) = k)

πk = −
λ
P
Apply constraint: k πk = 1 ⇒ λ = −N
i=1 1
PN
t (i) = k)

πk =
N
Spam Classification Demo
Gaussian Bayes Classifier
Instead of assuming conditional independence of xj , we model p(x|t) as a

Gaussian distribution and the dependence relation of xj is encoded in the
covariance matrix.
Multivariate Gaussian distribution:

1 1 T −1
f (x) = p exp − (x − µ) Σ (x − µ)
(2π)D det(Σ) 2
µ: mean, Σ: covariance matrix, D: dim(x)
q
θ = [µ, Σ, π], Z = (2π)D det(Σ)

1 1 T −1
p(x|t) = exp − (x − µ) Σ (x − µ)
Z 2
log L(θ) = log p(x, t|θ) = log p(t|θ) + log p(x|t, θ)
N
X 1 (i) T
= log πt (i) − log Z − x − µt (i) Σ−1
t (i)
x (i)
− µ t (i)
2
i=1
P
Want: arg maxθ log L(θ) subject to k πk = 1
Take derivative w.r.t. µ

N
∂ log L X
=− 1 t (i) = k Σ−1 (x (i) − µk ) = 0
∂µk
i=0
i=1 1 t
PN (i) = k x (i)

µk =
i=1 1
PN
t (i) = k
Take derivative w.r.t. Σ−1 (not Σ)

Note:
∂ det(A) T
= det(A)A−1
∂A
det(A)−1 = det A−1

∂x T Ax
= xx T
∂A
ΣT = Σ
N
" #
∂ log L ∂ log Zk 1 (i)
1 t =k −
X
(i) (i) T
=− − (x − µk )(x − µk ) =0
∂Σ−1k i=0
∂Σ −1
k
2
q
Zk = (2π)D det(Σk )
1
−1 − 2
∂ log Zk 1 ∂Zk D 1 D ∂ det Σ
= = (2π)− 2 det(Σk )− 2 (2π) 2 k
∂Σ−1k
Z k ∂Σ −1
k ∂Σ −1
k

1 1 3
− 2 1
= det(Σ−1 det Σ−1 det Σ−1
T
k )
2 − k k Σk = − Σk
2 2
N 1
∂ log L 1 (i)
1 t =k
X
(i) (i) T
=− Σk − (x − µk )(x − µk ) = 0
∂Σ−1k i=0
2 2
T
i=1 1
PN
t (i) = k x (i) − µk x (i) − µk

Σk =
i=1 1
PN
t (i) = k
i=1 1
PN
t (i) = k)

πk =
N
(Same as Bernoulli)
Gaussian Bayes Classifier Demo
If we constrain Σ to be diagonal, then we can rewrite p(xj |t) as a product

of p(xj |t)

1 1 T −1
p(x|t) = p exp − (xj − µjt ) Σt (xk − µkt )
(2π)D det(Σt ) 2
D D
Y 1 1 Y
= p exp − ||xj − µjt ||22 = p(xj |t)
(2π)D Σt,jj 2Σt,jj
j=1 j=1
Diagonal covariance matrix satisfies the naive Bayes assumption.
Case 1: The covariance matrix is shared among classes
p(x|t) = N (x|µt , Σ)
Case 2: Each class has its own covariance
p(x|t) = N (x|µt , Σt )
Gaussian Bayes Binary Classifier Decision Boundary
If the covariance is shared between classes,
p(x|t = 1) = p(x|t = 0)
1 1
log π1 − (x − µ1 )T Σ−1 (x − µ1 ) = log π0 − (x − µ0 )T Σ−1 (x − µ0 )
2 2
C + x T Σ−1 x − 2µT −1 T −1 T −1 T −1
1 Σ x + µ1 Σ µ1 = x Σ x − 2µ0 Σ x + µ0 Σ µ0
T −1
h i
2(µ0 − µ1 )T Σ−1 x − (µ0 − µ1 )T Σ−1 (µ0 − µ1 ) = C
⇒ aT x − b = 0
The decision boundary is a linear function (a hyperplane in general).
Relation to Logistic Regression
We can write the posterior distribution p(t = 0|x) as
p(x, t = 0) π0 N (x|µ0 , Σ)
=
p(x, t = 0) + p(x, t = 1) π0 N (x|µ0 , Σ) + π1 N (x|µ1 , Σ)
−1
π1 1 T −1 1 T −1
= 1+ exp − (x − µ1 ) Σ (x − µ1 ) + (x − µ0 ) Σ (x − µ0 )
π0 2 2

π1 1 T −1 −1
= 1 + exp log + (µ1 − µ0 )T Σ−1 x + µ 1 Σ µ1 − µT
0 Σ −1
µ 0
π0 2
1
=
1 + exp(−w T x − b)
Gaussian Bayes Binary Classifier Decision Boundary
If the covariance is not shared between classes,
p(x|t = 1) = p(x|t = 0)
1 1
log π1 − (x − µ1 )T Σ−1 T −1
1 (x − µ1 ) = log π0 − (x − µ0 ) Σ0 (x − µ0 )
2 2

x T Σ−1 −1
x − 2 µT −1 T −1
x + µT T

1 − Σ0 1 Σ 1 − µ0 Σ 0 0 Σ0 µ0 − µ1 Σ1 µ1 = C
⇒ x T Qx − 2b T x + c = 0
The decision boundary is a quadratic function. In 2-d case, it looks

like an ellipse, or a parabola, or a hyperbola.
Thanks!

Naive Bayes and Gaussian Bayes Classifiers Compared

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naive Bayes and Gaussian Bayes Classifiers Compared

Uploaded by

Copyright:

Available Formats

Naive Bayes and Gaussian Bayes Classifier

October 18, 2015

L(θ) = p(x, t|θ) = p(x|t, θ)p(t|θ)

Each vocabulary is one feature dimension.

We want to model the probability of any word xj appearing in an

Idea: Use Bernoulli distribution to model p(xj |t)

i=1 i=1 j=1

where p(t) = πt . Parameters πt , µjt can be learnt using maximum

log L(θ) = log p(x, t|θ)

Use Lagrange multiplier to derive π

Instead of assuming conditional independence of xj , we model p(x|t) as a

Multivariate Gaussian distribution:

µ: mean, Σ: covariance matrix, D: dim(x)

log L(θ) = log p(x, t|θ) = log p(t|θ) + log p(x|t, θ)

Take derivative w.r.t. µ

Take derivative w.r.t. Σ−1 (not Σ)

If we constrain Σ to be diagonal, then we can rewrite p(xj |t) as a product

Diagonal covariance matrix satisfies the naive Bayes assumption.

Case 1: The covariance matrix is shared among classes

Case 2: Each class has its own covariance

If the covariance is shared between classes,

The decision boundary is a linear function (a hyperplane in general).

We can write the posterior distribution p(t = 0|x) as

If the covariance is not shared between classes,

The decision boundary is a quadratic function. In 2-d case, it looks

You might also like