You are on page 1of 88

Bayesian Decision Theory

(Sections 2.1-2.2)
Decision problem posed in probabilistic terms

Bayesian Decision TheoryContinuous Features

All the relevant probability values are known
Probability Density
Jain CSE 802, Spring 2013
Course Outline
MODEL INFORMATION
COMPLETE INCOMPLETE
Supervised
Learning
Unsupervised
Learning
Nonparametric
Approach
Parametric
Approach
Nonparametric
Approach
Parametric
Approach
Bayes Decision
Theory
Optimal
Rules
Plug-in
Rules
Density
Estimation
Geometric Rules
(K-NN, MLP)
Mixture
Resolving
Cluster Analysis
(Hard, Fuzzy)
Introduction
From sea bass vs. salmon example to abstract
decision making problem

State of nature; a priori (prior) probability
State of nature (which type of fish will be observed next) is
unpredictable, so it is a random variable

The catch of salmon and sea bass is equiprobable

P(e
1
) = P(e
2
) (uniform priors)

P(e
1
) + P( e
2
) = 1 (exclusivity and exhaustivity)
Prior prob. reflects our prior knowledge about how likely we are to
observe a sea bass or salmon; these probabilities may depend on
time of the year or the fishing area!

Bayes decision rule with only the prior information
Decide e
1
if P(e
1
) > P(e
2
), otherwise decide e
2
Error rate = Min {P(e
1
) , P(e
2
)}
Suppose now we have a measurement or feature
on the state of nature - say the fish lightness value
Use of the class-conditional probability density
P(x | e
1
) and P(x | e
2
) describe the difference in
lightness feature between populations of sea bass
and salmon

Amount of overlap between the densities determines the goodness of feature
Maximum likelihood decision rule

Assign input pattern x to class e
1
if
P(x | e
1
) > P(x | e
2
), otherwise e
2


How does the feature x influence our attitude
(prior) concerning the true state of nature?

Bayes decision rule

Posteriori probability, likelihood, evidence
P(e
j
, x) = P(e
j
| x)p (x) = p(x | e
j
) P (e
j
)
Bayes formula

P(e
j
| x) = {p(x | e
j
) . P (e
j
)} / p(x)

where


Posterior = (Likelihood. Prior) / Evidence
Evidence P(x) can be viewed as a scale factor that
guarantees that the posterior probabilities sum to 1
P(x | e
j
) is called the likelihood of e
j
with respect to x; the
category e
j
for which P(x | e
j
) is large is more likely to be
the true category


=
=
=
2 j
1 j
j j
) ( P ) | x ( P ) x ( P e e
P(e
1
| x) is the probability of the state of nature being e
1
given that feature value x has been observed

Decision based on the posterior probabilities is called the
Optimal Bayes Decision rule

For a given observation (feature value) X:

if P(e
1
| x) > P(e
2
| x) decide e
1
if P(e
1
| x) < P(e
2
| x) decide e
2

To justify the above rule, calculate the probability of error:
P(error | x) = P(e
1
| x) if we decide e
2
P(error | x) = P(e
2
| x) if we decide e
1
So, for a given x, we can minimize te rob. Of error,
decide e
1
if
P(e
1
| x) > P(e
2
| x);
otherwise decide e
2

Therefore:
P(error | x) = min [P(e
1
| x), P(e
2
| x)]

Thus, for each observation x, Bayes decision rule
minimizes the probability of error
Unconditional error: P(error) obtained by
integration over all x w.r.t. p(x)



Optimal Bayes decision rule

Decide e
1
if P(e
1
| x) > P(e
2
| x);
otherwise decide e
2

Special cases:
(i) P(e
1
) = P(e
2
); Decide e
1
if
p(x | e
1
) > p(x | e
2
), otherwise e
2

(ii) p(x | e
1
) = p(x | e
2
); Decide e
1
if
P(e
1
) > P(e
2
), otherwise e
2



Bayesian Decision Theory
Continuous Features
Generalization of the preceding formulation

Use of more than one feature (d features)
Use of more than two states of nature (c classes)
Allowing other actions besides deciding on the state of
nature
Introduce a loss function which is more general than the
probability of error
Allowing actions other than classification primarily
allows the possibility of rejection

Refusing to make a decision when it is difficult to
decide between two classes or in noisy cases!

The loss function specifies the cost of each action
Let {e
1
, e
2
,, e
c
} be the set of c states of nature
(or categories)

Let {o
1
, o
2
,, o
a
} be the set of a possible actions

Let (o
i
| e
j
) be the loss incurred for taking

action o
i
when the true state of nature is e
j

General decision rule


o
(x) specifies which action to take for every
possible observation x

Conditional Risk




Overall risk

R = Expected value of R(o
i
| x) w.r.t. p(x)


Minimizing R Minimize R(o
i
| x) for i = 1,, a




Conditional risk
=
=
=
c j
1 j
j j i i
) x | ( P ) | ( ) x | ( R e e o o
For a given x, suppose we take the action o
i
; if the true state is e
j
,
we will incur the loss (o
i
| e
j
). P(e
j
| x) is the prob. that the true
state is e
j
But, any one of the C states is possible for given x.
Select the action o
i
for which R(o
i
| x) is minimum

The overall risk R is minimized and the resulting risk
is called the Bayes risk; it is the best performance that
can be achieved!
Two-category classification
o
1
: deciding e
1
o
2
: deciding e
2

ij
= (o
i
| e
j
)

loss incurred for deciding e
i
when the true state of nature is e
j

Conditional risk:

R(o
1
| x) =
11
P(e
1
| x) +
12
P(e
2
| x)
R(o
2
| x) =
21
P(e
1
| x) +
22
P(e
2
| x)

Bayes decision rule is stated as:
if R(o
1
| x) < R(o
2
| x)
Take action o
1
: decide e
1


This results in the equivalent rule:
decide e
1
if:

(
21
-
11
) P(x | e
1
) P(e
1
) >
(
12
-
22
) P(x | e
2
) P(e
2
)


and decide

e
2
otherwise

Likelihood ratio:

The preceding rule is equivalent to the following rule:






then take action o
1
(decide e
1
); otherwise take action o
2

(decide e
2
)

Note that the posteriori porbabilities are scaled by the loss
differences.


) ( P
) ( P
.
) | x ( P
) | x ( P
if
1
2
11 21
22 12
2
1
e
e


e
e

>

Interpretation of the Bayes decision rule:

If the likelihood ratio of class e
1
and class e
2
exceeds a threshold value (that is independent of
the input pattern x), the optimal action is to decide
e
1


Maximum likelihood decision rule: the threshold
value is 1; 0-1 loss function and equal class prior
probability


Bayesian Decision Theory
(Sections 2.3-2.5)
Minimum Error Rate Classification

Classifiers, Discriminant Functions and Decision Surfaces

The Normal Density

Minimum Error Rate Classification
Actions are decisions on classes
If action o
i
is taken and the true state of nature is e
j
then:
the decision is correct if i = j and in error if i = j


Seek a decision rule that minimizes the probability
of error or the error rate


Zero-one (0-1) loss function: no loss for correct decision
and a unit loss for any error




The conditional risk can now be simplified as:






The risk corresponding to the 0-1 loss function is the
average probability of error

c ,..., 1 j , i
j i 1
j i 0
) , (
j i
=

=
=
= e o

=
=
=
= =
=
1 j
i j
c j
1 j
j j i i
) x | ( P 1 ) x | ( P
) x | ( P ) | ( ) x | ( R
e e
e e o o
Minimizing the risk requires maximizing the
posterior probability P(e
i
| x) since
R(o
i
| x) = 1 P(e
i
| x))

For Minimum error rate

Decide e
i
if P (e
i
| x) > P(e
j
| x) j = i
Decision boundaries and decision regions



If is the 0-1 loss function then the threshold involves
only the priors:

b
1
2
a
1
2
) ( P
) ( P 2
then
0 1
2 0
if
) ( P
) ( P
then
0 1
1 0
u
e
e
u
u
e
e
u

= =
|
|
.
|

\
|
=
= =
|
|
.
|

\
|
=

u
e
e
e u
e
e


> =

) | x ( P
) | x ( P
: if decide then
) ( P
) ( P
. Let
2
1
1
1
2
11 21
22 12
Classifiers, Discriminant Functions
and Decision Surfaces
Many different ways to represent pattern
classifiers; one of the most useful is in terms of
discriminant functions
The multi-category case

Set of discriminant functions g
i
(x), i = 1,,c

Classifier assigns a feature vector x to class e
i
if:
g
i
(x) > g
j
(x) j = i
Network Representation of a Classifier
Bayes classifier can be represented in this way, but
the choice of discriminant function is not unique
g
i
(x) = - R(o
i
| x)
(max. discriminant corresponds to min. risk!)

For the minimum error rate, we take
g
i
(x) = P(e
i
| x)

(max. discrimination corresponds to max. posterior!)
g
i
(x) P(x | e
i
) P(e
i
)
g
i
(x) = ln P(x | e
i
) + ln P(e
i
)
(ln: natural logarithm!)

Effect of any decision rule is to divide the feature
space into c decision regions
if g
i
(x) > g
j
(x) j = i then x is in R
i

(Region R
i
means assign x to e
i
)


The two-category case
Here a classifier is a dichotomizer that has two
discriminant functions g
1
and g
2


Let g(x) g
1
(x) g
2
(x)

Decide e
1
if g(x) > 0 ; Otherwise decide e
2

So, a dichotomizer computes a single
discriminant function g(x) and classifies x
according to whether g(x) is positive or
not.

Computation of g(x) = g
1
(x) g
2
(x)

) ( P
) ( P
ln
) | x ( P
) | x ( P
ln
) x | ( P ) x | ( P ) x ( g
2
1
2
1
2 1
e
e
e
e
e e
+ =
=
The Normal Density
Univariate density: N( , o
2
)

Normal density is analytically tractable
Continuous density
A number of processes are asymptotically Gaussian
Patterns (e.g., handwritten characters, speech signals ) can be
viewed as randomly corrupted versions of a single typical or
prototype (Central Limit theorem)



where:
= mean (or expected value) of x
o
2
= variance (or expected squared deviation) of x
,
x
2
1
exp
2
1
) x ( P
2
(
(

|
.
|

\
|

=
o

o t
Multivariate density: N( , E)

Multivariate normal density in d dimensions:




where:
x = (x
1
, x
2
, , x
d
)
t
(t stands for the transpose of a vector)
= (
1
,
2
, ,
d
)
t
mean vector
E = d*d covariance matrix
|E| and E
-1
are determinant and inverse of E, respectively
The covariance matrix is always symmetric and positive semidefinite; we
assume E is positive definite so the determinant of E is strictly positive
Multivariate normal density is completely specified by [d + d(d+1)/2]
parameters
If variables x
1
and x
2
are statistically independent then the covariance
of x
1
and x
2
is zero.



(

=

) x ( ) x (
2
1
exp
) 2 (
1
) x ( P
1 t
2 / 1
2 / d
E
E t
Multivariate Normal density

2 1
( ) ( )
t
r x x

= E
Samples drawn from a normal population tend to fall in a single
cloud or cluster; cluster center is determined by the mean vector
and shape by the covariance matrix

The loci of points of constant density are hyperellipsoids whose
principal axes are the eigenvectors of E
Transformation of Normal Variables
Linear combinations of jointly normally distributed random variables are
normally distributed

Coordinate transformation can convert an arbitrary multivariate normal
distribution into a spherical one
Bayesian Decision Theory
(Sections 2-6 to 2-9)
Discriminant Functions for the Normal Density

Bayes Decision Theory Discrete Features
Discriminant Functions for the
Normal Density
The minimum error-rate classification can be
achieved by the discriminant function

g
i
(x) = ln P(x | e
i
) + ln P(e
i
)

In case of multivariate normal densities

) ( P ln ln
2
1
2 ln
2
d
) x ( ) x (
2
1
) x ( g
i i
1
i
i
t
i i
e E t + =


Case E
i
= o
2
.I

(I is the identity matrix)


Features are statistically independent and each
feature has the same variance
) category! th the for threshold the called is (
) ( P ln
2
1
w ; w
: where
function) nt discrimina (linear w x w ) x ( g
0 i
i i
t
i
2
0 i
2
i
i
0 i
t
i i
i e
e
o o

+ = =
+ =
A classifier that uses linear discriminant functions is called
a linear machine


The decision surfaces for a linear machine are pieces of
hyperplanes defined by the linear equations:

g
i
(x) = g
j
(x)


The hyperplane separating R
i
and R
j





is orthogonal to the line linking the means!


) (
) ( P
) ( P
ln ) (
2
1
x
j i
j
i
2
j i
2
j i 0

e
e

o

+ =
) (
2
1
x then ) ( P ) ( P if
j i 0 j i
e e + = =
Case 2: E
i
= E (covariance matrices of all classes
are identical but otherwise arbitrary!)
Hyperplane separating R
i
and R
j




The hyperplane separating R
i
and R
j
is generally
not orthogonal to the line between the means!
To classify a feature vector x, measure the
squared Mahalanobis distance from x to each of
the c means; assign x to the category of the
nearest mean
| |
) .(
) ( ) (
) ( P / ) ( P ln
) (
2
1
x
j i
j i
1 t
j i
j i
j i 0

E
e e


+ =

Discriminant Functions for 1D Gaussian
Case 3: E
i
= arbitrary

The covariance matrices are different for each category











In the 2-category case, the decision surfaces are
hyperquadrics that can assume any of the general forms:
hyperplanes, pairs of hyperplanes, hyperspheres,
hyperellipsoids, hyperparaboloids, hyperhyperboloids)
) ( P ln ln
2
1
2
1
w
w
2
1
W
: where
w x w x W x ) x ( g
i i i
1
i
t
i 0 i
i
1
i i
1
i i
0 i
t
i i
t
i
e E E
E
E
+ =
=
=
= + =

Discriminant Functions for the Normal Density


Discriminant Functions for the Normal Density
Discriminant Functions for the Normal Density
Decision Regions for Two-Dimensional Gaussian Data
2
1 1 2
1875 . 0 125 . 1 514 . 3 x x x + =
Error Probabilities and Integrals
2-class problem
There are two types of errors


Multi-class problem
Simpler to computer the prob. of being correct (more
ways to be wrong than to be right)
Error Probabilities and Integrals
Bayes optimal decision boundary in 1-D case
Error Bounds for Normal Densities
The exact calculation of the error for the
general Guassian case (case 3) is extremely
difficult
However, in the 2-category case the general
error can be approximated analytically to give
us an upper bound on the error




Error Rate of Linear Discriminant Function (LDF)
Assume a 2-class problem


Due to the symmetry of the problem (identical
E), the two types of errors are identical
Decide if or


or

1
x e e
1 2
( ) ( ) g x g x >
| | | |
1 1
1 2
1 1 2 2
1 1
( ) ( ) log ( ) ( ) ( ) log ( )
2 2
t t
x x P x x P e e

E + > E +
( )
| |
1 1 1
1 2
2 1 1 1 2 2
1
( ) log ( ) / ( )
2
t t
t
x P P e e

E + E E <
( ) ( )
| |
1 1 2 2
1
~ ( , ), ~ ( , )
1
( ) log ( ) ( ) ( ) log ( )
2
t
i i i
i i
p x N p x N
g x P x x x P
e e
e e


= ( = E +

Let
Compute expected values & variances of
when




where
= squared Mahalanobis distance between
( )
1 1 1
2 1 1 1 2 2
1
( ) ( )
2
t t
t
h x x

= E + E E
( ) h x
1 2
& x x e e e e
( )
1 1 1
1 1 1
2 1 1 1 2 2
1
2 1 2 1
1
( ) ( )
2
1
( ) ( )
2
t t
t
t
E h x x E x q e e

q

= e ( = E ( + E E

= E
=
1

1
2 1 2 1
( ) ( )
t

E
1 2
&
Error Rate of LDF
Similarly






1
2
2 1 2 1
1
( ) ( )
2
t
q
q

= + E
= +
( )
( )
1
2
( ) ~ ( , 2 )
( ) ~ ( , 2 )
p h x x N
p h x x N
e q q
e q q
e
e +
( )
2
2 1
1 1 1 1
2 1 1
1
2 1 2 1
( ) ( ) ( )
( ) ( )
2
t
t
E h x E x x o q e e

q

(
(
= = E e


= E
=
2
2
2 o q =
Error Rate of LDF
( ) ( )
2
1
( )
2
1 1 2 1 1
1
2
2
1
( ) ( ) ( ) ( ) ~
2 2
1
2
1 1
2 2
4
t
n t
P g x g x x P h x dh h x e
e d
t
erf

q
c e e
|

q
q


+
= < e =
H
=
H
| |
+
=
|
|
\ .
}
}
Error Rate of LDF
2
1
2
0
2
1 1 2 2
( )
log
( )
2
( )
1 1
2 2
4
Total probability of error
( ) ( )
r
x
e
P
t
P
erf r e dx
t
erf
P P P
e
e
q
c
q
e c e c

(
=
(

=
H
| |

=
|
|
\ .
= +
}
Error Rate of LDF
( ) ( )
1 2
1
1 2 1 2
1 2
1
0
2
( ) ( )
1 1 1 1
2 2 2 2
4 2 2
t
P P t
erf erf
e e

q
c c
q

= = =
| | E | |
= = =
|
|
|
|
\ .
\ .
Error Rate of LDF
1
1 2 1 2
1 2
1
1 2 1 2
1 2
Mahalanobis distance is a good measure of separation between classes
(i) No Class Separation
( ) ( ) 0
1
2
(ii) Perfect Class Separation
( ) ( ) 0
0 ( 1)
t
t
erf

c c

c c

E =
= =
E
=


Chernoff Bound
To derive a bound for the error, we need the
following inequality



Assume conditional prob. are normal
where
Chernoff Bound
Chernoff bound for P(error) is found by determining the
value of | that minimizes exp(-k(|))
Error Bounds for Normal Densities
Bhattacharyya Bound
Assume | = 1/2
computationally simpler
slightly less tight bound
Now, Eq. (73) has the form

When the two covariance matrices are equal, k(1/2) is te
same as the Mahalanobis distance between the two means
Error Bounds for Gaussian Distributions
Chernoff Bound
Bhattacharya Bound (=1/2)
2category, 2D data
True error using numerical integration = 0.0021
Best Chernoff error bound is 0.008190
Bhattacharya error bound is 0.008191
1 1
1 1 1 2
( ) ( ) ( ) ( | ) ( | ) 0 1 P error P P p x p x dx
| | | |
e e e e |

s s s
}
1 ( )
1 2
( | ) ( | )
k
p x p x dx e
| | |
e e

=
}
1
1 2
1
2 1 1 2 2 1
1 2
(1 ) (1 ) 1
( ) ( ) [ (1 ) ] ( ) ln
2 2
| | | |
t
k
| |
| | | |
| | |

E + E
= E + E +
E E
(1/2)
1 2 1 2 1 2
( ) ( ) ( ) ( | ) ( | ) ( ) ( )
k
P error P P P x P x dx P P e e e e e e e

s =
}
1 2
1
1 2
2 1 2 1
1 2
1
2
(1 / 2) 1 / 8( ) ( ) ln
2 2 | || |
t
k

E +E
E +E
= +
E E
(

Neyman-Pearson Rule
Classification, Estimation and Pattern recognition by Young and Calvert
Neyman-Pearson Rule
Neyman-Pearson Rule
Neyman-Pearson Rule
Neyman-Pearson Rule
Neyman-Pearson Rule
We are interested in detecting a single weak pulse,
e.g. radar reflection; the internal signal (x) in detector
has mean m1 (m2) when pulse is absent (present)
Signal Detection Theory
Discriminability: ease of determining
whether the pulse is present or not
The detector uses a threshold
x* to determine the presence of pulse
2
( *| ) : P x x x e > e
hit
1
( *| ) : P x x x e > e
false alarm
2
( *| ) : P x x x e < e
miss
1
( *| ) : P x x x e < e
correct rejection
For given threshold, define hit,
false alarm, miss and correct
rejection
2
1 1
( | ) ~ ( , ) p x N e o
2
2 2
( | ) ~ ( , ) p x N e o
1 2
| |
' d

o

=
Receiver Operating Characteristic
(ROC)
Experimentally compute hit and false alarm rates for
fixed x*
Changing x* will change the hit and false alarm rates
A plot of hit and false alarm rates is called the ROC
curve
Performance
shown at different
operating points
Operating Characteristic
In practice, distributions may not be Gaussian
and will be multidimensional; ROC curve can still
be plotted
Vary a single control parameter for the decision
rule and plot the resulting hit and false alarm rates
Bayes Decision Theory Discrete Features
Components of x are binary or integer valued; x can
take only one of m discrete values
v
1
, v
2
, ,v
m


Case of independent binary features for 2-category
problem
Let x = [x
1
, x
2
, , x
d
]
t
where each x
i
is either 0 or 1, with
probabilities:

p
i
= P(x
i
= 1 | e
1
)
q
i
= P(x
i
= 1 | e
2
)

The discriminant function in this case is:

0 g(x) if and 0 g(x) if decide
) ( P
) ( P
ln
q 1
p 1
ln w
: and
d ,..., 1 i
) p 1 ( q
) q 1 ( p
ln w
: where
w x w ) x ( g
2 1
2
1
d
1 i
i
i
0
i i
i i
i
0 i
d
1 i
i
s >
+

=
=

=
+ =
=
=
e e
e
e
Bayesian Decision for Three-dimensional
Binary Data
Decision boundary for 3D binary features. Left figure shows the case when p
i
=.8 and
q
i
=.5. Right figure shows case when p
3
=q
3
(Feature 3 is not providing any
discriminatory information) so decision surface is parallel to x
3
axis
Consider a 2-class problem with three independent binary
features; class priors are equal and pi = 0.8 and qi = 0.5, i =
1,2,3
wi = 1.3863
w0 = 1.2
Decision surface g(x) = 0 is shown below

Handling Missing Features
Suppose it is not possible to measure a certain
feature for a given pattern
Possible solutions:
Reject the pattern
Approximate the missing feature
Mean of all the available values for the missing feature
Marginalize over the distribution of the missing feature




Handling Missing Features
Other Topics
Compound Bayes Decision Theory & Context
Consecutive states of nature might not be statistically independent; in
sorting two types of fish, arrival of next fish may not be independent of
the previous fish
Can we exploit such statistical dependence to gain improved
performance (use of context)
Compound decision vs. sequential compound decision problems
Markov dependence
Sequential Decision Making
Feature measurement process is sequential (as in medical diagnosis)
Feature measurement cost
Minimize the no. of features to be measured while achieving a
sufficient accuracy; minimize a combination of feature measurement
cost & classification accuracy
Context in Text Recognition

You might also like