You are on page 1of 206

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

Course Overview

This course is about performing inference in complex engineering settings, providing a


mathematical take on an engineering subject. While driven by applications, the course
is not about these applications. Rather, it primarily provides a common foundation
for inference-related questions arising in various elds not limited to machine learning,
signal processing, articial intelligence, computer vision, control, and communication.
Our focus is on the general problem of inference: learning about the generically
hidden state of the world that we care about from available observations. This course
is part of a two course sequence: 6.437 and 6.438:
6.437 Inference and Information is about fundamental principles and con
cepts of inference. It is oered in the spring.
6.438 Algorithms for Inference is about fundamental structures and algo
rithms for inference. It is oered (now) in the fall.
A key theme is that constraints imposed on system design are what make engi
neering challenging and interesting. To this end, 6.438 Algorithms for Inference is
about recognizing structure in inference problems that lead to ecient algorithms.

1.1

Examples

To motivate the course, we begin with some highly successful applications of graphical
models and simple, ecient inference algorithms.
1. Navigation. Consider control and navigation of spacecraft (e.g., lunar landings,
guidance of shuttles) with noisy observations from various sensors, depicted in
Figure 1. Abstractly, these scenarios can be viewed as a setup where noisy ob
servations of a hidden state are obtained. Accurately inferring the hidden state
allows us to control and achieve a desired trajectory for spacecraft. Formally,
such scenarios are well modeled by an undirected Guassian graphical model
shown in Figure 2. An ecient inference algorithm for this graphical model is
the Kalman lter, developed in the early 1960s.
Noise
Spacecraft dynamics

Sensors

Controller

Estimation of
state

Figure 1: Navigation feedback control in the presence of noisy observations.

s1

s2

s3

s4

y1

y2

y3

y4

Figure 2: Undirected Gaussian graphical model for a linear dynamical system, used
to model control and navigation of spacecraft. Ecient inference algorithm: Kalman
ltering.
2. Error correcting codes. In a communication system, we want to send a message,
encoded as k information bits, over a noisy channel. Because the channel may
corrupt the message, we add redundancy to our message by encoding each
message as an n-bit code sequence or codeword, where n > k. The receivers
task is to recover or decode the original message bits from the received, corrupted
codeword bits. A diagram showing this system is in Figure 3. The decoding
task is an inference problem. The structure of coded bits plays an important
role in inference.
k info
bits

Coder

n code
bits

Channel
(corruption)

Decoder

recover
codeword

recover k
info bits

Figure 3: Simple digital communication system.


An example of a scheme to encode message bits as codeword bits is the Ham
ming (7, 4) code, represented as a graphical model in Figure 4. Here, each
4-bit message is rst mapped to a 7-bit codeword before transmission over the
noisy channel. Note that there are 24 = 16 dierent 7-bit codewords (among
27 = 128 possible 7-bit sequences), each codeword corresponding to one of 16
possible 4-bit messages. The 16 possible codewords can be described by means
of constraints on the codeword bits. These constraints are represented via the
graphical model in Figure 4, an example of a factor graph. Note that this code
requires 7 bits to be transmitted to convey 4 bits, so the eective code rate is
4/7.
The task of the receiver, as explained above, is to decode the 4-bit message that
was sent based on the observations about the 7 received, possibly corrupted
coded bits. The goal of the decoding algorithm is to infer the 4 message bits
using the constraints in the Hamming code and some noise model for the chan
nel. Given the graphical model representation, ecient decoding can be done
using the loopy belief propagation algorithm, even for large codeword lengths
(e.g., one million transmitted bits rather than 7).
2

x5

x1

x2
x4

x3
x6

x7

Figure 4: Factor graph representing the Hamming (7, 4) code. Ecient inference
algorithm: loopy belief propagation.
3. Voice recogniziation. Consider the task of recognizing the voice of specic indi
viduals based on an audio signal. One way to do this is to take the audio input,
segment it into 10ms (or some small) time interval, and then capture appro
priate signature or structure (in the form of frequency response) of each of
these time sgements (the so-called cepstral coecient vector or the features).
Speech has structure that is captured through correlation in time, i.e., what one
says now and soon after are correlated. A succinct way to represent this correla
tion is via an undirected graphical model called a Hidden Markov model, shown
in Figure 5. This model is similar to the graphical model considered for the nav
igation example earlier. The dierence is that the hidden states in this speech
example are discrete rather than continuous. The Viterbi algorithm provides
an ecient, message-passing implementation for inference in such setting.

x1

x2

x3

x4

y1

y2

y3

y4

Figure 5: Hidden Markov model for voice recognization. Ecient inference algorithm:
Viterbi.
3

x1

x3

x2

x4

Figure 6: Markov random eld for image processing. Ecient inference algorithm:
loopy belief propagation.
4. Computer vision/image processing. Suppose we have a low resolution image
and want to come up with a higher resolution version of it. This is an inference
problem as we wish to predict or infer from a low resolution image (observation)
how a high resolution image may appear (inference). Similarly, if an image is
corrupted by noise, then denoising the image can be viewed as an inference
problem. The key observation here is that images have structural properties.
Specically, in many real-world images, nearby pixels look similar. Such a
similarity constraint can be captured by a nearest-neighbor graphical model of
image. This is also known as a Markov random eld (MRF). An example of an
MRF is shown in Figure 6. The loopy belief propagation algorithm provides an
ecient inference solution for such scenarios.

1.2

Inference, Complexity, and Graphs

Here we provide the key topics that will be the focus of this course: inference problems
of interest in graphical models and the role played by the structure of graphical models
in designing inference algorithms.
Inference problems. Consider a collection of random variables x = (x1 , . . . , xN ) and
let the observations about them be represented by random variables y = (y1 , . . . , yN ).
Let each of these random variables xi , 1 i N, take on a value in X (e.g., X = {0, 1}
or R) and each observation variable yi , 1 i N take on a value in Y. Given
observation y = y, the goal is to say something meaningful about possible realizations
of x = x for x XN . There are two primary computation problems of interest:

1. Calculating (posterior) beliefs:

px,y (x, y)
py (y)
px,y (x, y)
=_
.
'
x/ XN px,y (x , y)

px|y (x|y) =

(1)

The denominator in equation (1) is also called the partition function. In gen
eral, computing the partition function is expensive. This is also called the
marginalization operation. For example, suppose X is a discrete set. Then,
px1 (x1 ) =

px1 ,x2 (x1 , x2 ).

(2)

x2 X

But this requires |X| number of operations for a xed x1 . There are |X| many
values of x1 , so overall, the number of operations needed scales as |X|2 . In
general, if we are thinking of N variables, then this starts scaling like |X|N . This
is not surprising since, without any additional structure, a distribution over N
variables with each variable taking on values in X requires storing a table of
size |X|N , where each entry contains the probability of a particular realization.
So without structure, computing posterior has complexity exponential in the
number of variables N .
2. Calculating the most probable conguration also called the maximum a poste
riori (MAP) estimate:
arg max px|y (x|y).
x
xXN

(3)

Therefore,
arg max px|y (x|y)
x
xXN

px,y (x, y)
py (y)
= arg max px,y (x, y).

= arg max
xXN
xXN

(4)

Without any additional structure, the above optimization problem requires


searching over the entire space XN , resulting in an exponential complexity in
the number of variables N .
Structure. Suppose now that the N random variables from before are independent,
meaning that we have the following factorization:
px1 ,x2 ,...,xN (x1 , x2 , . . . , xN ) = px1 (x1 )px2 (x2 ) pxN (xN ).
5

(5)

Then posterior belief calculation can be done separately for each variable. Comput
ing the posterior belief of a particular variable has complexity |X|. Similarly, MAP
estimation can be done by nding each variables assignment that maximizes its own
probability. As there are N variables, the computational complexity of MAP estima
tion is thus N |X|.
Thus, independence or some form of factorization enables ecient computation
of both posterior beliefs (marginalization) and MAP estimation. By exploiting fac
torizations of joint probability distributions and representing these factorizations via
graphical models, we will achieve huge computational eciency gains.
Graphical models. In this course, we will examine the following types of graphical
models:
1. Directed acyclic graphs (DAGs), also called Bayesian networks
2. Undirected graphs, also called Markov Random Fields
3. Factor graphs which more explicitly capture algebraic constraints
Recall that a graph constains a collection of nodes and edges. Each edge, in the
directed case, has an orientation. The nodes represent variables, and edges represent
constraints between variables. Later on, we will see that for some inference algorithms,
we can also think of each node as a processor and each edge as a communication link.

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms For Inference


Fall 2014

Directed Graphical Models

Today we develop the rst class of graphical models in the course: directed graphical
models. A directed graphical model denes a family of joint probability distributions
over a set of random variables. For example, suppose we are told that two random
variables x and y are independent. This characterizes a family of joint distributions
which all satisfy px,y (x, y) = px (x)py (y). Directed graphs dene families of probability
distributions through similar factorization properties.
A directed graphical model G consists of nodes V (representing random variables)
and directed edges (arrows) E V V. (The notation (i, j) E means that there is
a directed edge from i to j.)
Directed graphs dene families of distributions which factor by functions of nodes
and their parents. In particular, we assign to each node i a random variable xi and a
non-negative-valued function fi (xi , xi ) such that
fi (xi , xi ) = 1,

(1)

fi (xi , xi ) = p(x1 , . . . , xn ),

(2)

xi X

where i denotes the set of parents of node i. Assuming the graph is acyclic (has
no directed cycles), we must have fi (xi , xi ) = pxi |xi (xi |xi ), i.e., fi (, ) represents
the conditional probability distribution of xi conditioned on its parents. If the graph
does have a cycle (e.g. Figure 1), then there is no consistent way to assign condi
tional probability distributions for the cycle. Therefore, we assume that all directed
graphical models are directed acyclic graphs (DAGs).
In general, by the chain rule, any joint distribution of any n random variables
(x1 , . . . , xn ) can be written as
px1 ,...,xn (x1 , . . . , xn ) = px1 (x1 )px2 |x1 (x2 |x1 ) pxn |x1 ,...,xn1 (xn |x1 , . . . , xn1 ).

(3)

By treating each of these terms as one of the functions f , we observe that the dis
tribution obeys the graph structure shown in Figure 2. This shows that DAGs are
universal, in the sense that any distribution can be represented by a DAG. Of particu
lar interest are sparse graphs, i.e. graphs where the number of edges is much smaller
than the number of pairs of random variables. Such graphs can lead to ecient
inference.
In general, the graph structure plays a key role of determining the size of the
representation. For instance, we saw above that a fully connected DAG can represent
an arbitrary distribution, and we saw in Lecture 1 that the joint probability table for
1

Figure 1: Example of a directed graph with a cycle, which does not correspond to a
consistent distribution.

x1

...

x3

x2

x4

Figure 2: A fully connected DAG is universal, in that it can represent any distribution.
such a distribution requires |X|N entries. More generally, the number of parameters
required to represent the factorization is of order |X|maxi |i | , which is dramatically
smaller if maxi |i | N . Similarly, the graph structure aects the complexity of
inference: while inference in a fully connected graph always requires |X|N time, infer
ence in a sparse graph is often (but not always) much more ecient.
There is a close relationship between conditional independence and factorization of
the distribution. Well rst analyze at some particular examples from rst principles,
then look at a more general theory.

2.1

Examples

Example 1. First, consider the following graph:

This graph represents the factorization


px,y ,z = pz|y py |x px .
2

By matching these terms against the chain rule for general probability distributions

px,y ,z = pz|y ,x py |x px ,
we see that pz|y = pz|y ,x , i.e. x z|y .
Example 2. Now consider the graph:
y

This graph represents the factorization


Px,y ,z = pz|y px|y py .
We can match terms similarly to the above example to nd that x
z|y in this
graph as well.
Example 3.
y

The factorization is:


px,y ,z = px py |x,z pz .
By matching terms, we nd that x
z. However, its no longer true that x
z|y .
Therefore, we see that the direction of the edges matters. The phenomenon captured
by this example is known as explaining away. (Suppose weve observed an event which
may result from one of two causes. If we then observe one of the causes, this makes
the other one less likely i.e., it explains it away.) The graph structure is called a
v-structure.
Example 4.
y

This is a fully connected graph, so as we saw before, it can represent any distribution
over x, y , and z. That is, we must remove edges in order to introduce indepence
structure.
Example 5. The following graph structure is not a valid DAG because it contains
a cycle:
y

Example 6. The following graph is obtained by removing an edge from Example


3:
y

The factorization represented by this graph is


px,y ,z = px py |x pz ,
a strict subset of the factorization in Example 3. In general, reducing edges increases
the number of independencies and decreases the number of distributions the graph
can represent.
Example 7. Here is a bigger example with more conditional independencies.
x4
x2
x6

x1

x3

x5

As before, we can identify many of them using the factorization properties. In par
ticular, we can read o some conditional independencies by doing the following:
4

Choose a topological ordering of the nodes (i.e. an ordering where any node i
comes after all of its parents).
Let i be the set of nodes that are not parents of i, i.e. i i = , but they
appear in the topological ordering before i.
Then the graph implies the conditional independence xi
xi |xi .
Note that there may be many topological orderings for a graph. With the above
procedure, dierent conditional independences can be found using dierent topolog
ical orderings. Now, we discuss a simpler and more general procedure for testing
conditional independence which does not depend on any particular topological order
ing.

2.2

Graph Separation and Conditional Independence

We now introduce the idea of graph separation: testing conditional independence


properties by looking for particular kinds of paths in a graph. From Examples 1
and 2 above, we might be tempted to conclude that two variables are dependent if
and only if theyre connected by a path which isnt blocked by an observed node.
However, this criterion fails for Example 3, where x and z are dependent only when
the node between them is observed. We can repair this broken intuition, however, by
dening a dierent set of rules for when a path is blocked.
2.2.1

d-separation and Bayes Ball

Let A, B, and C be disjoint subsets of the set of vertices V. To test whether xA and
xB are conditionally independent given xC :
1. Shade nodes in C. Call this primary shade of a node. In addition, assign
secondary shade to each node as follows:
All nodes with primary shade also have secondary shade.
All nodes that are parent of a node with secondary shade have a secondary
shade.
2. Place a ball at each node in A.
3. Let balls bounce around in the graph following rules shown in Figure 3.
Remark: Balls do not interact. The shade in rules 1 and 2 is from primary
shading only. While in rule 3, both primary and secondary shading applies.
4. If no ball can reach any node in B, then xA must be conditionally independent
of xB given xC .
5

1.

blocked

not blocked

2.

blocked

not blocked

not blocked

3.

blocked

Figure 3: Rules for blocking and non-blocking in the Bayes Ball algorithm.

2.3

Characterization of DAGs

The following two characterizations are equivalent descriptions of probability distri


butions:
1. Factorization into a product of conditional probability tables according to the
DAG structure
2. Complete list of conditional independencies obtainable by Bayes Ball
Another way of stating this is that the following two lists are equivalent:
1. List all distributions which factorize according to the graph structure.
2. List all possible distributions, and list all the conditional independencies ob
tainable by Bayes ball. Discard the distributions which do not satisfy all the
conditional independencies.

2.4

Notations/Concepts

A forest is a graph where each node has at most one parent. A connected graph is
one in which there is a path between every pair of nodes. A tree is a connected forest.
A polytree is a singly connected graph. That is, there is at most one path from
any node to any other node. (Note that trees are a special case of polytrees.) Trees
and polytrees will both play an important role in inference.
6

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms For Inference


Fall 2014

Undirected Graphical Models

In this lecture, we discuss undirected graphical models. Recall that directed graphical
models were capable of representing any probability distribution (e.g. if the graph was
a fully connected graph). The same is true for undirected graphs. However, the two
formalisms can express dierent sets of conditional independencies and factorizations,
and one or the other may be more intuitive for particular application domains.
Recall that we dened directed graphical models in terms of factorization into
a product of conditional probabilites, and the Bayes Ball algorithm was required to
test conditional independencies. In contrast, we dene undirected graphical models
in terms of conditional independencies, and then derive the factorization proper
ties. In a sense, a directed graph more naturally represents conditional probabilities
directly, whereas an undirected graph more naturally represents conditional indepen
dence properties.

3.1

Representation

An undirected graphical model is a graph G = (V, E), where the vertices (or nodes) V
correpsond to variables and the undirected edges E V V tell us about the condi
tional independence structure. The undirected graph denes a family of probability
distributions which satisfy the following graph separation property:
xA xB |xC whenever there is no path from a node in A to a node in B which
does not pass through a node in C.
As before, the graph represents the family of all distributions which satisfy this prop
erty; individual distributions in the family may satisfy additional conditional inde
pendence properties. An example of this denition is shown in Figure 1. We note that
graph separation can be tested using a standard graph search algorithm. Because the
graph separation property can be viewed as a spatial Markov property, undirected
graphical models are sometimes called Markov random elds.
Another way to express this denition is as follows: delete all the nodes in C from
the graph, as well as any edges touching them. If the resulting graph decomposes into
multiple connected components, such that A and B belong to dierent components,
then xA xB |xC .

3.2

Directed vs. undirected graphs

We have seen that directed graphs naturally represent factorization properties and
undirected graphs naturally represent conditional independence properties. Does this
1

(a)

(b)

Figure 1: (a) This undirected graphical model expresses the conditional independence
property xA xB |xC . (b) When the shaded nodes are removed, the graph decomposes
into multiple connected components, such that A and B belong to disjoint sets of
components.
mean that we should always use directed graphs when we have conditional probability
distributions and undirected graphs when we have conditional independencies? No
it turns out that each formalism can represent certain families of distributions that
the other cannot.
For example, consider the following graph.
w
y

x
z

Lets try to construct a directed graph to represent the same family of distributions
(i.e. the same set of conditional independencies). First, note that it must contain
at least the same set of edges as the undirected graph, because any pair of variables
connected by an edge depend on each other regardless of whether or not any of the
other variables are observed. In order for the graph to be acyclic, one of the nodes
must come last in a topological ordering; without loss of generality, lets suppose it is
node z. Then z has two incoming edges. Now, no matter what directions we assign to
the remaining two edges, we cannot guarantee the property x y |w , z (which holds
in the undirected graph), because the Bayes Ball can pass along the path x z y
when z is observed. Therefore, there is no directed graph that expresses the same set
of conditional independencies as the undirected one.
What about the reverse case can every directed graph be translated into an
undirected one while preserving conditional independencies? No, as the example of
the v-structure shows:

Figure 2: Part of an undirected graphical model for an image processing task, image
superresolution. Nodes correspond to pixels, and every fourth pixel is observed.

z
y

We saw in Lecture 2 that x z, but not x z|y . By contrast, undirected graphical


models have a certain monotonicity property: when additional nodes are observed,
the new set of conditional independencies is a strict superset of the old one. Therefore,
no undirected graph can represent the same family of distributions as a v-structure.
An example of a domain more naturally represented using undirected rather than
directed graphs is image processing. For instance, consider the problem of image
superresolution, where we wish to double the number of pixels along each dimension.
We formulate the graphical model shown in Figure 2, where the nodes correspond
to pixels, and undirected edges connect each pair of neighboring pixels. This graph
represents the assumption that each pixel is independent of the rest of the image
given its four neighboring pixels. In the superresolution task, we may treat every
fourth pixel as observed, as shown in the Figure 2.

3.3

Parameterization

Like directed graphical models, undirected graphical models can be characterized


either in terms of conditional independence properties or in terms of factorization.
Unlike directed graphical models, undirected graphical models do not have a natu
ral factorization into a product of conditional probabilities. Instead, we represent
the distribution as a product of funcations called potentials, times a normalization
constant.
To motivate this factorization, consider a graph with no edge between nodes xi
3

and xj . By denition, xi xj |xrest , where xrest is shorhand for all the other nodes in
the graph. We nd that
pxall = pxi ,xj |xrest pxrest
= px1 |xrest px2 |xrest pxrest .
Therefore, we conclude that the joint distribution can be factorized in such a way
that xi and xj are in dierent factors.
This motivates the following factorization criterion. A clique is a fully connected
set of nodes. A maximal clique is a clique that is not a strict subset of another clique.
Given a set of variables x1 , . . . , xN and a set C of maximal cliques, dene the following
representation of the joint distribution:
px (x)

xC (xC ),
CC

1
Z

xC (xC ).

(1)

CC

In this representation, Z is called the partition function, and is chosen such that the
probabilities corresponding to all joint assignments sum to 1. The functions can be
any nonnegative valued functions (i.e. do not need to sum to 1), and are sometimes
referred to as compatibilities.
The partition function Z can be written explicitly as
Z=

xC (xC ).
x

CC

This sum can be quite expensive to evaluate. Fortunately, for many calculations, such
as conditional probabilities and nding the most likely joint assignment, we do not
need it. For other calculations, such as learning the parameters , we do need it.
The complexity of description (number of parameters) is given by:
|X||C| |X|max |C| .
As with directed graphical models, the main determinant of the complexity is the
number of variables involved in each term of the factorization.1
So far, we have dened an arbitrary factorization property based on the graph
structure. Is this related to our earlier denition of undirected graphical models in
terms of conditional independencies? The relationship was formally established by
the following theorem.
1

Strictly speaking, this approximation does not always hold, as the number of maximal cliques
may be exponential in the number of variables. An example of this is given in one of the homeworks.
However, it is a good rule of thumb for graphs that arise in practice.

Theorem 1 (Hammersley-Cliord) A strictly positive distribution p (i.e. px (x) >


0 for all joint assignments x) satises the graph separation property from our deni
tion of undirected graphical models if and only if it can be represented in the factorized
form (1).
Proof:
One direction, the factorization (1) implying satisfaction of the graph separation criterion,
is straightforward. The other direction requires non-trivial arguments. For it, we shall
provide proof for binary Markov random eld, i.e. x {0, 1}N . This proof is adapted from
[Grimmet, A Theorem about Random Fields, BULL. LONDON MATH. SOC, 5 (1973),
81-84].
Now any x {0, 1}N is equivalent to a set S(x) S V = {1, . . . , N }, where S(x) = {i
V : xi = 1}. With this in mind, the probability distribution of X over {0, 1}N is equivalent
to probability distribution over the set of all subsets of V , 2V . To that end, let us start by
dening
X
Q(S) =
(1)|SA| log p XA = 1, XV \A = 0)
(2)
AS

where 1 and 0 are vectors of all ones and zeros respectively (of appropriate length). Ac
cordingly,
Q() = log p(X = 0).
We claim that if S V is not a clique of the graphical model G, then
Q(S) = 0.

(3)

To prove this, we shall use the fact that X is a Markov Random Field with respect to G.
Now, since S is not a clique, there exists i, j S so that (i, j)
/ E. Now consider
X
Q(S) =
(1)|SA| log p XA = 1, XV \A = 0)
AS

X
BS:i,j B
/

(1)|SB| log

p XB = 1, XV \B = 0) p XB{i,j} = 1, XV \B{i,j} = 0)
p XB{i} = 1, XV \B{i} = 0) p XB{j} = 1, XV \B{j} = 0)
(4)

With notation ai,j = p XB{i,j} = 1, XV \B{i,j } = 0), ai = p XB{i} = 1, XV \B{i} = 0),


aj = p XB{j} = 1, XV \B{j} = 0), and a0 = p XB = 1, XV \B = 0), we have
p Xi = 1, Xj = 1, XB = 1, XV \B{i,j} = 0)
ai,j
=
aj + ai,j
p Xj = 1, XB = 1, XV \B{i,j} = 0)
= p Xi = 1|Xj = 1, XB = 1, XV \B{i,j} = 0)
= p Xi = 1|XB = 1, XV \B{i,j} = 0),

(5)

where in the last equality, we have used the fact that (i, j)
/ E and hence Xi is independent
of Xj condition on the assignment of all other variables xed. In a very similar manner,
ai
= p Xi = 1|XB = 1, XV \B{i,j} = 0).
a0 + ai
From (5)-(6), we conclude that

(6)

aj
a0
=
ai,j
ai

therefore in (4) we have that Q(S) = 0. This establishes the claim that Q(S) = 0 if S V
is not a clique. From (2) and with notation G(A) = log p XA = 1, XV \A = 0) for all A V
and (S, A) = (1)| S A| any S, A V such that A S, we can re-write (2) as
X
Q(S) =
(S, A)G(A).
(7)
AS

Therefore,
X

Q(A) =

AS

X X

(A, B)G(B)

AS BA

(A, B)G(B)

BS BAS

X
BS

G(B)

(A, B)

BAS

G(B)(B, S)

BS

= G(S),

(8)

where (B, S) is 1 if B = S and 0 otherwise. To see the second last equality, note that given
6 S, all A such that B A S can be decomposed amongst
B S and B =
s sets so that
|A B| = , for 0 k |S B|. The number A with |A B| = , is k . Therefore,
X

(A, B) =

BAS

k
X

(1)

=0

= 1 1)k = 0.

(9)

Of course, when B = S, the above is equal to 1 trivially. Thus, we have that for any
x {0, 1}N ,
log p X = x) = G(N (x))
X
=
Q(S),
SN (x):S clique

(10)

where N (x) = {i V : xi = 1}. In summary,


X

p X = x) exp

SV :S

clique

s
PS (x) ,

(11)

x1

...

x3

x2

xN

xN

Figure 3: A one dimensional Ising model.


where the potential function PS : {0, 1}|S| R, for each clique S V is dened as
Q(S)
0

PS (x) =

if S N (x)
otherwise.

(12)

This completes the proof.

3.4

Energy interpretation

We now note some connections to statistical physics. The factorization (1) can be
rewritten as
pX (x) =

1
exp H(x) .
Z
X
1
exp
HC (xC ) ,
Z
CC

a form known as the Boltzmann distribution. H(x) is sometimes called the Hamil
tonian, and relates to the energy of the state x. Eectively, global congurations
with low energy are favored over those with high energy. One well-studied example
is the Ising model, for which the one-dimensional case is shown in Figure 3. In the
one-dimensional case, the variables x1 , . . . , xN , called spins, are arranged in a chain,
and take on values in {+, }. The pairwise compatibilities either favor or punish
states where neighboring spins are identical. For instance, we may dene
Hxi ,xi+1 =

3/2 1/5
1/5 3/2

There exist factorizations of distributions that cannot be represented by either


directed or undirected graphical models. In order to model some such distributions,
we will introduce the factor graph in the next lecture. (Note that this does not imply
that there exist distributions which cannot be represented in either formalism. In
fact, in both formalisms, fully connected graphs can represent any distribution. This
observation is uninteresting, because the very point of graphical models is to com
pactly represent distributions in ways that support ecient learning and inference.)
The existence of three separate formalisms for representing families of distributions
as graphs is a sign of the immaturity of the eld.
7

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

4 Factor graphs and Comparing Graphical Model


Types
We now introduce a third type of graphical model. Beforehand, let us summarize
some key perspectives on our rst two.
First, for some ordering of variables, we can write any probability distribution as
px (x1 , . . . , xn ) = px1 (x1 )px2 |x1 (x2 , x1 ) pxn |x1 ,...,xn1 (xn |x1 , . . . , xn1 ),
which can be expressed as a fully connected directed graphical model. When the
conditional distributions involved do not depend on all the indicated conditioning
variables, then some of the edges in the directed graphical model can be removed.
This reduces the complexity of inference, since the associated conditional probability
tables have a more compact description.
By contrast, undirected graphical models express distributions of the form
px (x1 , . . . , xn ) =

1
c (xc ),
Z cC

where the potential functions c are non-negative and C is the set of maximal cliques
on an undirected graph G. Remember that the Hammersley-Cliord theorem says that
any distribution that factors in this way satises the Markov property on the graph
and conversely that if p is strictly positive and satises the Markov property for G then
it factors as above. Evidently, any distribution can be expressed using fully connected
undirected graphical model, since this corresponds to a single potential involving
all the variablesi.e., a joint probability distribution. Undirected graphical models
eciently represent conditional independencies in the distribution of interest, which
are expressed by the removal of edges, and which similarly reduces the complexity of
inference.

4.1

Factor graphs

Factor graphs are capable of capturing structure that the traditional directed and
undirected graphical models above are not capable of capturing. A factor graph
consists of a vector of random variables x = (x1 , . . . , xN ) and a graph G = (V, E, F),
which in addition to normal nodes also has factor nodes F. Furthermore, the graph
is constrained to be a bipartite graph between variable nodes and factor nodes.
The joint probability distribution associated to a factor graph is given by
m
1
p(x1 , . . . , xN ) =
fj (xfj ).
Z j=1

x1
f1

fN

xN

Figure 1: A general factor graph.

For example, in Figure 1, f1 is a function of x1 and x2 .


What constraints are imposed on the factors? The factors must be non
negative, but otherwise were free to choose them. We could of course roll the partition
function Z into one of the factors and that would constrain one of the factors.
The factor graph is not constrained to have factors only for maximal cliques, so we
can more explicitly represent the factorization of the joint probability distribution.
It is very easy to encode certain kinds of (especially algebraic) constraints in the
factor graph. One example is the Hamming code example from the rst day of class,
which you will see more of later in the subject. As a more basic example, consider
the following.
Example 1. Suppose we have random variables representing taxes (x1 ), social secu
rity (x2 ), medicare (x3 ), and foreign aid (x4 ) with constraints
x1
x2
x3
x4

3
0.5
0.25
0.01

and nally we need to decrease spending by 1, so x1 + x2 + x3 + x4 1. If we were


interested in picking uniformly among the assignments that satisfy the constraints,
we could encode this distribution conveniently with a factor graph in Figure 2.
The resulting distribution is given by
px (x) 1x1 3 1x2 0.5 1x3 0.25 1x4 0.01 1x1 +x2 +x3 +x4 1
2

x1

px1 (x1 )

1{x1

3}

x2

px2 (x2 )

1{x2

0.5}

px3 (x3 )

1{x3

0.25}

px4 (x4 )

1{x4

0.01}

x3

x4

Figure 2: Factor graph encoding constraints on budget.

4.2
4.2.1

Converting Between Graphical Models Types


Converting Undirected Models to Factor Graphs

We can write down the probability distribution associated to the undirected graph
px (x) f134 (x1 , x3 , x4 )f234 (x2 , x3 , x4 )
which naturally gives a factor graph representation using the potentials as factor
nodes (Figure 3). In general, we can convert an undirected graphical model into a
factor graph by dening a factor node for each maximal clique.
How many maximal cliques could an undirected graph have? We can
construct an example where the number of maximal cliques scales like n2 where n is
the number of nodes in the undirected graph.
Consider a complete bipartite graph where the nodes are evenly split. Given any
3 nodes, 2 of the nodes must lie on the same side and hence be disconnected. Thus,
there cant be any 3 node cliques, so all of the edges are maximal cliques, and there
are O(n2 ) edges. In general there can be exponentially many maximal cliques in an
undirected graph (See Problem Set 2).
4.2.2

Converting Factor Graphs to Directed Models

Take a topological ordering of the nodes say x1 , . . . , xn . For each node in turn, nd
a minimal set U {x1 , . . . , xi1 } such that xi
{x1 , . . . xi1 } U |U is satised
and set xi s parents to be the nodes in U . This amounts to reducing in turn each
p(xi |x1 , . . . , xi1 ) as much as possible using the conditional independencies implied
by the factor graph. An example is done in Figure 5.
3

x1
x1

x2
x2

x3

x4
x3

x4

Figure 3: Representing an undirected graph as a factor graph.

Figure 4: Complete bipartite graph has O(n2 ) maximal cliques.

x1

x2

x3

x4

x3

x1

x4

x2

x3

x1

x4

x2

Figure 5: Converting a factor graph into a directed graph and then into an undirected
graph.
4.2.3

Converting from Directed to Undirected Models

This is done through moralization, which says to completely connect the parents of
each node and then remove the direction of arrows.
Moralization marries the parents by connecting them together. See Figure 5 for
an example.
The important thing to recognize is that the conversion process is not lossless. In
these constructions, any conditional independence implied by the converted graph is
satised by the original graph. However, in general, some of the conditional indepen
dences implied in the original graph are not implied by the converted graph. How do
we know that weve come up with good conversions. We want the converted graph
to be close to the original graph for some denition of closeness. Well explore these
notions through I-maps, D-maps, and P-maps.

4.3
4.3.1

Measuring Goodness of Graphical Representations


I-map

Consider a probability distribution D and a graphical model G. Let CI(D) denote


the set of conditional independencies satised by D and let CI(G) denote the set of
all conditional independencies implied by G.
Denition 1 (I-map). We say G is an independence map or I-map for D if CI(G)
CI(D). In other words, every conditional independence implied by G is satised by
5

Figure 6: Both graphs are I-maps for the distribution that completely factorizes (i.e.
px,y = px py ).
A

B
w

Figure 7: Two example graphs


D if G is an I-map for D.
The complete graph is always an example of an I-map for any distribution because
it implies no conditional independencies.
4.3.2

D-map

Denition 2 (D-map). We say G is a dependence map or D-map for D if CI(G)


CI(D) that is every conditional independence that D satises is implied by G.
The graph with no edges is an example of a D-map for any distribution because
it implies every conditional independence.
4.3.3

P-map

Denition 3 (P-map). We say G is a perfect map or P-map for D if CI(G) = CI(D),


i.e., if every conditional independence implied by G is satised by D and vice versa.
Example 2. Consider three distributions that factor as follows
p 1 = p x py pz
p2 = pz|x,y px py
p3 = pz|x,y px|y py

Then the graph in Figure 7(a) is an I-map for p1 , a D-map for p3 and a P-map for
p2 . The graph in Figure 7(b) is an I-map for
p(x, y, w, z) =

1
f1 (x, w)f2 (w, y)f3 (z, y)f4 (x, z)
Z

by the Hammersley-Cliord theorem.

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

5 Minimal I-Maps, Chordal Graphs, Trees, and


Markov Chains
Recall that some kinds of structure in distributions cannot be eciently captured with
either directed or undirected graphical models. An example of such a distribution is
pxyz (x, y, z) = f1 (x, y) f2 (y, z) f3 (x, z).

(1)

This is captured by (cf. Hammersley-Cliord) an undirected complete graph on 3


nodes. And since such a complete graph captures any distributions, it fails to capture
the special factorization structure in the distribution. By contrast, this structure can
be well captured by a factor graph in which there are three factors, one each for
f1 , f2 and f3 . In this respect, we say that a factor graph can provide a ner grained
representation. The complexity of a factor graph representation is |F ||X|D where
D is the size of the largest factor (i.e. number of variables participating in the factor).
Now all the graphical models are universal, in the sense, that each of them can be
used to represent any distribution. However, the goal is not merely representation but
one of trying to nd the most ecient representation in terms of its ability to cap
ture the structure of the underlying distribution. To begin to quantify such a notion
of eciency, we introduced the concepts of I-map, D-map and P-map. The basic idea
was that P-map captures the conditional independence structure of the distribution
precisely. An I-map for a given distribution implies no more independencies than the
distribution satises.
Typically, we are interested in I-maps that represent a family of distributions that
is as small as possible (subject to the constraint that it contains our distribution of
interest.) For this purpose, we have the following denition:
Denition 1 (Minimal I-Map). A minimal I-map is an I-map with the property that
removing any single edge would cause the graph to no longer be an I-map.
In a similar manner, one can dene a maximal D-map.

5.1

Generating Minimal I-Maps

Consider a distribution D for which we wish to generate a minimal I-map.


We rst consider the construction of a minimal directed I-map. Say there are N
variables x1 , . . . , xN with chain rule representation being
px (x) = px1 (x1 )px2 |x1 (x2 |x1 ) . . . pxN |xN 1 ...x1 (xN |xN 1 . . . x1 ).

(2)

Now for each component (conditional probability), reduce it if conditional indepen


dence exists, i.e. remove a variable in the conditioning if the distributions conditional

independencies allows for it. Inherently, the minimal I-map satises local minimal
ity and therefore such a greedy procedure leads to one such minimal I-map.
Now consider the construction of a minimal undirected I-map. Let us start with
a valid representation for the given distribution. Iteratively, check, for all edges (i, j)
if xi
xj |xrest , and if edge (i, j) satises such conditional independence, then remove
it. Continue doing this iteratively until one cannot remove an edge. The algorithm
can take at most O(N 3 ) time.
Theorem 1. If p > 0, the resulting undirected graphical model is the unique undi
rected minimal I-map for p.
For some distributions, the resulting minimal I-map obtained in this way will
actually be a P-map. In this case, the associated directed or undirected graph is a
perfect t for the distribution of interest. Moreover, a further subset of distributions
will have both directed and undirected P-maps. Such distributions (and the associated
graphical models) are particularly special, and allow for highly ecient inference, as
we will see. In the meantime, lets develop the characteristics of such distributions
(and graphs).

5.2

Moralization and Triangulation

We begin by determining conditions such that converting directed graphical model to


undirected graphical model does not lead to the introduction of ineciency. First
recall the conversion: 1) retain all edges from the directed model in the undirected
one; and 2) connect ever pair of parents of a node with an edge (a process quaintly
called moralization. The resulting of this procedure is called the moralized graph.
Theorem 2. The moralized undirected graphical model of a given directed graphical
model is a minimal I-map for that family of distributions represented by the directed
graphical model. Further, if the moralization does not add any edges, the moralized
graph is a P-map.
To understand the later part of the above stated result, we need to understand
the conversion from undirected graphical model to directed graphical model. This
requires addition of chords. Specically, for a loop (cycle) in a graph, a chord in
the loop is an edge connecting two non-consecutive nodes (of the loop). A graph is
called chordal if any cycle of the graph of size 4 has a chord.
Theorem 3. If a directed acyclic graph is a minimal I-map for an undirected graphical
model, then the directed graph must be chordal.
This suggests that to obtain a directed graphical model out of an undirected
graphical model, one needs to add chords. The detailed procedure is referred to as

x1

x2

xN

x1

x2

xN

Figure 1: Markov chain in undirected and directed representations.


triangulation of the graphical.1 We shall study this procedure in detail soon, when
we begin investigating ecient inference.
From the above development, we can now deduce when a set of conditional inde
pendencies of distribution have both directed and undirected P-maps. In particular,
an undirected graph G has a directed P-map if and only if G is chordal. Alternatively,
a directed graph H has an undirected P-map if and only if moralization of H does
not add any edges.

5.3
5.3.1

Simple Examples of Chordal Graphs


Markov Chains

Perhaps the simplest example of a chordal graph corresponds to a Markov chain,


as shown in Figure 1. In the gure, the directed graphical model and undirected
graphical model have exactly the same structure (ignoring directionality of edges).
And it is chordal, because there are no loops or cycles in the graph. Another such
example is that of hidden Markov model that we have seen before (also shown in
Figure 2). The random variables of interest are x1 , . . . , xN and the observations are
y1 , . . . , yN where yi is dependent on xi ; xi immediately depends on xi1 . The undirected
and directed graphical representation for both of these distributions is the same and
both of these graphical models are chordal (again, because it does not have any loop).
5.3.2

Trees

A very slightly richer class of chordal graphs are tree models. Consider an undirected
graph as shown in Figure 3. To make its directed graph counterpart, start with any
1

It is important to emphasize that the term triangulation is a bit generic in that there are
many ways one might imaging creating triangles of nodes in a graph, not all of which result in a
chordal graph. We use the term to refer to a very particular procedure of creating a chordal graph.

Figure 2: Hidden Markov chain in undirected and directed representations.

'Root'

Figure 3: Converting from undirected to a directed tree.

Chordalization

(the above only represents a subset


of the edges to be added!)

MRF for an image

Figure 4: Chords in the MRF representation of an image.


node as a root and start directing edges away from the root along all directions
in the tree to obtain the associated directed graphical model, as shown.
As we shall see, inference is especially easy on trees.

5.4

Discussion

As we shall see, chordal graphical models are good representation and allow for ef
cient inference (with respect to the size of the graphical structure, which could be
exponentially large in number of variables). Figure 4 represents an example of MRF
that we have seen earlier for representing image. This graphical model (two dimen
sional grid graph) seems nice. But to make it chordal, we have to add lots of edges
and thus making it lot more complicated.
Figure 5 shows a useful visual classication of distribution in terms of perfect
maps, chordal graphical models, trees and Markov chains.
As a nal comment, Markov chains also have other associated graphical repre
sentations with which you may already be familiar: state transition diagrams and
trellis diagrams. While these representations continue to be useful, it is important
not confuse these with our new graphical model representations of Markov chains:
the nodes and edges mean very dierent things in these dierent representations!

Chordal
Perfect directed

Perfect undirected

Trees

Markov chains

Distributions

Figure 5: Venn diagram of graphical models.

c
d

a
s0

s1

s0

s1

s0

s0

s1

s1

Figure 6: State Transition diagram and Trellis diagram for a Markov chain.

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

Gaussian Graphical Models

Today we describe how collections of jointly Gaussian random variables can be repre
sented as directed and undirected graphical models. Our knowledge of these graphical
models will all carry over to the Gaussian case with the added benet that Gaussian
random variables will allow us to exploit a variety of linear algebra tools.
Why focus on Gaussians rather than continuous distributions in general? The
choice of having a special case for Gaussians is warranted by the many nice properties
Gaussian random variables possess. For example, the Gaussian distribution is an
example of a stable family, meaning that if we add any two independent Gaussians,
we get another Gaussian. The Gaussian distribution is the only continuous, stable
family with nite variance. Moreover, the Central Limit Theorem suggests that the
family is an attractor since summing many i.i.d. random variables that need not be
Gaussian results in a random variable that converges in distribution to a Gaussian. In
fact, under mild conditions, we need not require the random variables being summed
to be identically distributed either.

6.1

Multivariate (Jointly) Gaussian Random Variables

There are many equivalent ways to dene multivariate Gaussians, also called Gaus
sian random vectors. Here are a few characterizations for random vector x being
multivariate Gaussian:
(i) Linear combination of i.i.d. scalar Gaussian variables: There exists some matrix
A, constant vector b and random vector u of i.i.d. N(0, 1) entries such that
x = Au + b.
(ii) All linear combinations of elements of x are scalar Gaussian random variables:
y = aT x is Gaussian for all a.
(iii) Covariance form: The probability density function of x can be written as

1
1
T 1
px (x) =
exp (x ) (x ) ,
2
(2)N/2 ||1/2
denoted as x N(,
J ) with mean = E [x] and covariance matrix =
T
E (x )(x ) .
(iv) Information form: The probability density function of x can be written as

1 T
T
px (x) exp x Jx + h x ,
2

denoted as x N1 (h, J) with potential h and information (or precision) matrix


J. Note that J = 1 and h = J.
We will focus on the last two characterizations, while exploiting the rst two as
key properties.

6.2

Operations on Gaussian random vectors

For the covariance and information forms, we consider how marginalization and con
ditioning operations are done. Let
x=

x1
x2

12
1
, 11
2
21 22

= N1

h1
J
J
, 11 12
h2
J21 J22

Marginalization is easy when we have x represented in covariance form: Due to


characterization (ii) from earlier, marginals of x are Gaussian. Computing marginals
just involves reading o entries from and , e.g.
x1 N(1 , 11 ).
In contrast, computing marginals using the information form is more complicated:
x1 N1 (h/ , J/ ),
1
/
/
where h/ = h1 J12 J1
22 h2 and J = J11 J12 J22 J21 . The expression for J is called
the Schur complement.
Conditioning is easy when we have x represented in information form: We use the
fact that conditionals of a Gaussian random vector are Gaussian. Setting conditioning
variables constant in the joint distribution and reading o the quadratic form of the
remaining variables, it becomes apparent that conditioning involves reading o entries
from J, e.g., when conditioning on x2 ,

px1 |x2 (x1 |x2 ) px1 ,x2 (x1 , x2 )


1 T T J11 J12 x1
x1
T
T
exp x1 x2
+ h1 h2
x2
J21 J22 x2
2


1 T
T
T
T
T
= exp x1 J11 x1 + 2x2 J21 x1 + x2 J22 x2 + h1 x1 + h2 x2
2


1 T
1 T
T
T
T
= exp x1 J11 x1 + (h1 x2 J21 )x1 + h2 x2 x2 J22 x2
2
2


1 T
1 T
T
T
= exp x1 J11 x1 + (h1 J12 x2 ) x1 + h2 x2 x2 J22 x2
2
2


1
exp xT
J11 x1 + (h1 J12 x2 )T x1 ,
2 1

where the last step uses the fact that x2 , which we are conditioning on, is treated as
a constant. In particular, we see that
x1 |x2 N1 (h/1 , J11 ),
where h/1 = h1 J12 x2 . While we can read o entries of J to obtain the informa
tion matrix for x1 |x2 , namely J11 , the potential vector needs to be updated. Note
that conditioning using the covariance form is more complicated, involving a Schur
complement:
x1 |x2 N(/ , / ),
/
1
where / = + 12 1
22 (x2 2 ) and = 11 12 22 21 .
We can interpret the conditional distribution as follows. Note that / = E [x1 |x2 ],
also known as the Bayes least-squares estimate of x1 from x2 , is linear in x2 , a special
property of Gaussians. Moreover,

J
/ = arg min E lx1 x1 (x2 )l2 ,

x1 (x2 ) s.t.

x1 (x2 )=Ax2 +b

where / is the resulting mean-square error for estimater / .


We see that both the covariance and information forms are useful depending on
whether we are marginalizing or conditioning. Converting between the two requires
matrix inversion, e.g., solving linear equations. This involves Gaussian elimination
and use of the Schur complement, which we will say a little more about at the end of
todays lecture.

6.3

Gaussian graphical models

To represent a Gaussian random vector as a graphical model, we will need to know


conditional independencies. From and J, we can read o the following indepen
dencies:
Theorem 1. For x N(, ), xi
xj if and only if ij = 0.
xj |xrest if and only if Jij = 0.
Theorem 2. For x N1 (h, J), xi
The information matrix J is particularly useful since it describes pairwise Markov
conditional independencies and encodes a minimal undirected I-map for x. To obtain
the undirected Gaussian graphical model from J, add an edge between xi and xj
whenever Jij 6= 0. To obtain a Gaussian directed graphical model, choose an ordering
of the xi s and apply the chain rule:
px1 ,...,xn = px1 px2 |x1 px3 |x2 ,x1 pxn |xn1 ,...,x1 .
Note that each factor on the right-hand side is Gaussian, with mean linear in its
parents, due to the Bayes least-square estimate being linear as discussed previously.
3

6.4

Gaussian Markov process

An example of a directed Gaussian graphical model is the Gauss-Markov process,


shown in Figure 1. We can express the process in innovation form:
xi+1 = Axi + Bvi ,

(1)

where x1 N(0 , 0 ), vi i.i.d. N(0, v ) are independent of x1 , and Bvi is called the
innovation. This is a linear dynamical system because the evolution model is linear.

x1

x3

x2

xN

Figure 1: Gauss-Markov process.


Consider the case where we do not observe the xi s directly, i.e., xi s are hid
den, but we observe yi related to each xi through a Gaussian conditional probability
distribution:
yi = Cxi + wi ,
(2)
where wi i.i.d. N(0, w ) independent of vj s and x1 . The resulting graphical model
is shown in Figure 2 Collectively, equations (1) and (2) are referred to as standard
state space form.

x1

x2

x3

y1

y2

y3

xN

yN

Figure 2: Hidden Gauss-Markov process.


Generally, Gaussian inference involves exploiting linear algebraic structure.

6.5

Matrix inversion

Lastly, we develop the matrix inversion lemma. Let




E F
M=
,
G H
4

where E and H are invertible. We want to invert M. First, we block diagonalize via
pre- and post-multiplying:

M/H




1

I FH
E F
I
0
E FH1 G 0

1
0
I
G H H G I

0
H

G
X
M
Z
W

where, M/H is the Schur complement. Noting that W1 = Z1 M1 X1 , then M1


is given by
M1 = ZW1 X

I
=
H1 G

I
=
H1 G

 
1 

0
(M/H) 0
I FH1
I
0
H
0
I




I FH1
0 (M/H)1
0
.
I
0
H1 0
I

Taking the determinant of both sides of the above equation yields

|M|1 = |M1 |

 
 

(M/H)1
0

I
0
I
FH1
=

H1 G I

H1
0
I



(M/H)1
0
=

H1

0
= |(M/H)1 ||H|1

= |(M/H)|1 |H|1 ,

(3)

where we use the fact that







A B

A 0

C D

=

0 D

= |A||D|

whenever A and D are square matrices. Rearranging terms in equation (3) gives
|M| = |M/H||H|, hence the notation for the Schur complement.
We could alternatively decompose M in terms of E and M/E = H GE1 F to
get an expression for M1 , which looks similar to the above equation. Equating the
two and rearranging terms gives the matrix inversion lemma:
1
1
G)1 = E1 + E1 F(H
F)1 GE1 .

(E
G
G
FH
GE
M/H

M/E

This will be useful later when we develop Gaussian inference algorithms.


5

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

The Elimination Algorithm

Thus far, we have introduced directed, undirected, and factor graphs and discussed
conversion between these dierent kinds of graphs. Today we begin our foray into
algorithms for inference on graphical models. Recall from the rst lecture that the
two main inference tasks of interest are:
Calculating posterior beliefs. We have some initial belief or prior px () on
an unknown quantity of interest x. We observe y = y, which is related to x via
a likelihood model py |x (y|). We want to compute distribution px|y (|y).
Calculating maximum a posterior estimates. We want to compute a most
probable conguration x arg maxx px|y (x|y).
We look at calculating posterior beliefs today in the context of undirected graphical
models. For concreteness, consider a collection of random variables x1 , . . . , xN , each
taking on values from alphabet X and with their joint distribution represented by an
undirected graph G and factoring as
px (x) =

1
c (xc ),
Z cC

where C is the set of maximal cliques of G and Z is the partition function. Marginal
ization is useful because, for example, we may be interested in learning about x1
based on observing xN , i.e., we want the posterior belief px1 |xN (|xN ). Calculating this
posterior belief requires us to marginalize out x2 , . . . , xN 1 . More generally, we want
to be able to compute
px ,x (xA , xB )
pxA |xB (xA |xB ) = A B
pxB (xB )
for any disjoint pair of subsets A, B {1, 2, . . . , N }, A B = , where A B may
not consist of all nodes in the graph.
In this lecture, we discuss the Elimination Algorithm for doing marginalization.
This algorithm works on all undirected graphs and provides an exact solution, albeit
possibly with high computational complexity. The analogous algorithm for DAGs is
similar; alternatively, we could convert a DAG into an undirected graph via moraliza
tion and then apply the Elimination algorithm. We rst look at marginalization when
there are no observations and then explain how observations can be incorporated into
the algorithm to yield posterior beliefs.

7.1

Intuition for the Elimination Algorithm

We rst develop some intuition for what the Elimination Algorithm does via the
following example graph on random variables x1 , x2 , . . . , x5 X:

x3

x4

x2

x5

x1

Figure 1: Our running example for todays lecture.


The corresponding probability distribution factors as:
px (x) 12 (x1 , x2 )13 (x1 , x3 )25 (x2 , x5 )345 (x3 , x4 , x5 ),

(1)

where the s are the potential functions for the maximal cliques in the graph.
Suppose we want marginal px1 (). The naive brute-force approach is to rst com
pute the joint probability table over all the random variables and then sum out
x2 , . . . , x5 :
px (x),
(2)
px1 (x1 ) =
x2 ,x3 ,x4 ,x5 X

which requires O(|X|5 ) operations.


Can we do better? Observe that by plugging in factorization (1) into equation (2),
we obtain
px1 (x1 ) =

px (x)
x2 ,x3 ,x4 ,x5 X

12 (x1 , x2 )13 (x1 , x3 )25 (x2 , x5 )345 (x3 , x4 , x5 ).


x2 ,x3 ,x4 ,x5 X

Notice that choosing which order we sum out variables could make a dierence in
computational complexity!
For example, consider an elimination ordering (5, 4, 3, 2, 1) where we sum out
x5 rst, x4 second, and so forth until we get to x1 , which we do not sum out. Then,

we can push around some sums:

X
12 (x1 , x2 )13 (x1 , x3 )25 (x2 , x5 )345 (x3 , x4 , x5 )
px1 (x1 )
x2 ,x3 ,x4 ,x5 X

12 (x1 , x2 )13 (x1 , x3 )

x2 ,x3 ,x4 X

25 (x2 , x5 )345 (x3 , x4 , x5 )

x5 X

im5 (x2 ,x3 ,x4 )

'

12 (x1 , x2 )13 (x1 , x3 )m5 (x2 , x3 , x4 )

x2 ,x3 ,x4 X

12 (x1 , x2 )13 (x1 , x3 )

x2 ,x3 X

m5 (x2 , x3 , x4 )

x4 X
im4 (x2 ,x3 )

'

12 (x1 , x2 )13 (x1 , x3 )m4 (x2 , x3 )

x2 ,x3 X

12 (x1 , x2 )

x2 X

13 (x1 , x3 )m4 (x2 , x3 )

x3 X

im3 (x2 )

'

12 (x1 , x2 )m3 (x2 )

x2 X

i m2 (x1 ).
What happened here is that we computed intermediate tables m5 , m4 , m3 , and m2 .

At the very end, to compute the marginal on x1 , we just normalize m2 to get px1 (x1 ):

px1 (x1 ) =

m2 (x1 )
.
'
x/ X m2 (x )

The above procedure is the Elimination Algorithm for our specic example graph
using elimination ordering (5, 4, 3, 2, 1). Note that the intermediate tables are given
the letter m to suggest that these can be interpreted as messages. For example,
m5 (x2 , x3 , x4 ) can be thought of as all the information x5 has to give to its neigh
bors x2 , x3 , and x4 . Later on, we will see more of these so-called message-passing
algorithms.
What is the computational complexity? Intuitively, this depends on the elimina
tion ordering. For the elimination ordering (5, 4, 3, 2, 1), note that the most expensive
summation carried out was in computing m5 , which required forming a table over
four variables and summing out x5 . This incurs a cost of O(|X|4 ). Each of the other
summations was cheaper, so the overall computational complexity is O(|X|4 ), an im
provement over the naive brute-force approachs complexity of O(|X|5 ). But can we
do better? In fact, we can! Think about what happens if we use elimination ordering
(4, 5, 3, 2, 1).
3

What about if we have some observations? For example, lets say that we observed

x3 = a. Then the above procedure still works, provided that we force the value of x3
to be a and not sum over x3 . The normalization step at the end ensures that we have
a valid conditional probability distribution, corresponding to the posterior belief for
x1 given x3 = a. Another way to see this is to note that observing x3 = a is equivalent
to placing a delta-function singleton potential on x3 : 3 (x3 ) = 1 if x3 = a and 0
otherwisethen running the Elimination Algorithm as usual gives us the posterior of
x1 given x3 .

7.2

The Elimination Algorithm

As we saw in our example, the Elimination Algorithm is just about marginalizing out
variables in a particular order, at each step summing only over the relevant potentials
and messages.
In general, the key idea is to maintain a list of active potentials. This list
starts o as the collection of all potential functions for our graph, including possibly
the delta-function singleton potentials on observed nodes. Each time we eliminate
a variable by summing it out, at least one potential function gets removed from the
list of active potentials and a new message (which acts as a potential) gets added to
the list. In the earlier example, upon eliminating x5 , potentials 25 and 345 were
removed from the list of active potentials whereas m5 was added to the list. With
this bookkeeping in mind, we present the Elimination Algorithm:

Input: Potentials c for c C, subset A to compute marginal pxA () over, and


an elimination ordering I
Output: Marginal pxA ()
Initialize active potentials to be the set of input potentials.
for node i in I that is not in A do
Let Si be the set of all nodes (not including i) that share a potential with

node i.

Let i be the set of potentials in involving xi .

Compute

X Y

mi (xSi ) =
(xi xSi ).
xi i

Remove elements of i from .


Add mi to .
end
Normalize
Y

pxA (xA )
(xA ).

Algorithm 1: The Elimination Algorithm


Lets analyze the algorithms complexity. At each step i, we have a table of size
|X|Si . To ll in each element, we sum |X| terms and multiply |i | terms. This means
that step i, which results in computing message mi requires O(|X|Si |X||i |) =
O(|i | |X||Si |+1 ) operations. Thus, the total complexity is
X
O(|i | |X||Si |+1 ).
i

We can upper-bound |i | with |C| and |Si | with maxi |Si | to obtain
X
X
O(|i | |X||Si |+1 ) =
O(|C| |X|maxi |Si |+1 ) = O(N |C| |X|maxi |Si |+1 ),
i

recalling that big-O is an upper bound. Note that the quantity maxi |Si | dictates how
ecient the Elimination Algorithm will be. We next look a way to compute maxi |Si |.

7.3

Reconstituted Graph

To compute maxi |Si |, we will look at what new edges we introduced during marginal
ization. Returning to our running example graph from Figure 1 and using elim
ination ordering (5, 4, 3, 2, 1), upon summing out x5 , we are left with potentials
12 , 13 , m5 (x2 , x3 , x4 ), which can be viewed as a new graph where the m5 term cou
ples x2 , x3 , and x4 , i.e., eectively we have added edges (x2 , x3 ) and (x2 , x4 ).
5

In general, when eliminating a node, the node sends a message that couples all
its neighbors into a clique. With this analysis, in our running example:
After eliminating x5 : make {x2 , x3 , x4 } into a clique (add edges (x2 , x3 ) and
(x2 , x4 ))
After eliminating x4 : make {x2 , x3 } into a clique (already done, so no edge
added)
After eliminating x3 : make {x1 , x2 } into a clique (already done, so no edge
added)
After eliminating x2 : nothing left to do
By introducing the added edges into our original graph from Figure 1, we arrive
at what is called the reconstituted graph:

x3

x4

x2

x5

x1

Here, we denote the added edges via dotted lines.


The reconstituted graph has two key properties. First, the largest clique size in
the reconstituted graph is maxi |Si |+1 (note that Si does not include node i) from our
complexity analysis earlier. Second, the reconstituted graph is always chordal, which
means that if we want to make a graph chordal, we could just run the Elimination
Algorithm and then construct the reconstituted graph. Note that the reconstituted
graph depends on the elimination ordering. To see this, try the elimination ordering
(4, 5, 3, 2, 1). The minimum time complexity achieveable using the Elimination Algo
rithm for a graph is thus given by minI maxi |Si |, referred to as the treewidth of the
graph.

7.4

Grid graph

We end with a neat result, which will be stated without proof. Consider the grid
graph over N = n2 nodes as shown below:

...
...

...
...

Brute-force marginalization requires O(|X|N ) operations.


However, it is possible to

achieve an elimination ordering where maxi |Si | = N , using the zig-zag pattern
denoted by the arrow
above. In fact, any planar graph has an elimination order such
that maxi |Si | = O( N ).

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

Inferences on trees: sum-product algorithm

Recall the two fundamental inference problems in graphical models:


1. marginalization, i.e. computing the marginal distribution pxi (xi ) for some vari
able xi
2. mode, or maximum a posteriori (MAP) estimation, i.e. nding the most likely
joint assignment to the full set of variables x.
In the last lecture, we introduced the elimination algorithm for performing marginal
ization in graphical models. We saw that the elimination algorithm always produces
an exact solution and can be applied to any graph structure. Furthermore, it is
not entirely naive, in that it makes use of the graph structure to save computation.
Forr instance,
recall that in the grid graph, its computational complexity grows as
n
O |X| N , while the naive brute force computation requires O |X|N operations.
However, if we wish to retrieve the marginal distributions for multiple dierent
variables, we would need to run the elimination algorithm from scratch each time.
This would entail much redundant computation. It turns out that we can recast the
computations in the elimination algorithm as messages passed between nodes in the
network, and that these messages can be reused between dierent marginalization
queries. The resulting algorithm is known as the sum-product algorithm. In this
lecture, we look at the special case of sum-product on trees. In later lectures, we will
extend the algorithm to graphs in general. We will also see that a similar algorithm
can be applied to obtain the MAP estimate.

8.1

Elimination algorithm on trees

Recall that a graph G is a tree if any two nodes are connected by exactly one path.
This denition implies that all tree graphs have exactly N 1 edges and no cycles.
Throughout this lecture, we will use the recurring example of the tree graphical
model shown in Figure 1. Suppose we wish to obtain the marginal distribution
px1 using the elimination algorithm. The eciency of the algorithm depends on
the elimination ordering chosen. For instance, if we choose the suboptimal ordering
(2, 4, 5, 3, 1), the elimination algorithm produces the sequence of graph structures
shown in Figure 2. After the rst step, all of the neighbors of x2 are connected,
resulting in a table of size |X|3 . For a more drastic example of a suboptimal ordering,
consider the star graph shown in Figure 3. If we eliminate x1 rst, the resulting graph
is a fully connected graph over the remaining variables, resulting in a table of size
|X|N 1 . This is clearly undesirable.

x1
x3

x2

x4

x5

Figure 1: A tree-structured graphical model which serves as a recurring example


throughout this lecture.

x1

x1

x3

x3

x5

x4

x1

x5

x1

x3

Figure 2: The sequence of graph structures obtained from the elimination algorithm
on the graph from Figure 1 using the suboptimal ordering (2, 4, 5, 3, 1).

x2

x2

x6

x6

x3

x3

x1

x5

x4

x5

x4

Figure 3: A star-shaped graph and the resulting graph after eliminating variable x1 .

x1
x1
x1

x3

x2

x3

x2
x5

m4 (x2 )

m5 (x2 )

x1

x2

m4 (x2 )
m4 (x2 )

m3 (x1 )
m5 (x2 )

m2 (x1 )

m3 (x1 )

Figure 4: The sequence of graph structures and messages obtained from the elimina
tion algorithm on the graph from Figure 1 using an optimal ordering (4, 5, 3, 2, 1).
Fortunately, for tree graphs, it is easy to nd an ordering which adds no edges.
Recall that, in last lecture, we saw that starting from the edges of a graph is a
good heuristic. In the case of trees, edges correspond to leaf nodes, which suggests
starting from the leaves. In fact, each time we eliminate a variable, the graph remains
a tree, so we can choose an ordering by iteratively removing a leaf node. (Note that
the root node must come last in the ordering; however, it is easy to show that every
tree has at least two leaves.) In the case of the graph from Figure 1, one such ordering
is (4, 5, 3, 2, 1). The resulting graphs and messages are shown in Figure 4.
In a tree graph, the maximal cliques are exactly the edges. Therefore, by the
Hammersley-Cliord Theorem, if we assume the joint distribution is strictly positive,
we can represent it (up to normalization) as a product of potentials ij (xi , xj ) for
each edge (i, j). However, it is often convenient to include unary potentials as well,
so we will assume a redundant representation with unary potentials i (xi ) for each
variable xi . In other words, we assume the factorization
px (x) =

1
Z

i (xi )
iV

ij (xi , xj ).

(1)

(i,j)E

The messages produced in the course of the algorithm are:


m4 (x2 ) =

4 (x4 )24 (x2 , x4 )


x4

m5 (x2 ) =

5 (x5 )25 (x2 , x5 )


x5

m3 (x1 ) =

3 (x3 )13 (x1 , x3 )


x3

m2 (x1 ) =

2 (x2 )12 (x1 , x2 )m4 (x2 )m5 (x2 )

(2)

x2

Finally, we obtain the marginal distribution over x1 by multiplying the incoming


messages with its unary potential, and then marginalizing. In particular,
px1 (x1 ) 1 (x1 )m2 (x1 )m3 (x1 ).
3

(3)

x1
x3

x2

x4

x5

Figure 5: All messages required to compute the marginals over px1 (x1 ) and px3 (x3 )
for the graph in Figure 1.
Now consider the computational complexity of this algorithm. Each of the mes
sages produced has |X| values, and computing each value requires summing over |X|
terms. Since this must be done for each of the N 1 edges in the graph, the total
complexity is O(N |X|2 ), i.e. linear in the graph size and quadratic in the alphabet
size.1

8.2

Sum-product algorithm on trees

Returning to Figure 1, suppose we want to compute the marginal for another variable
x3 . If we use the elimination ordering (5, 4, 2, 1, 3), the resulting messages are:
X
5 (x5 )25 (x2 , x5 )
m5 (x2 ) =
x5

m4 (x2 ) =

4 (x4 )24 (x2 , x4 )

x4

m2 (x1 ) =

2 (x2 )12 (x1 , x2 )m4 (x2 )m5 (x2 )

x2

m1 (x3 ) =

1 (x1 )13 (x1 , x3 )m2 (x1 )

(4)

x1

Notice that the rst three messages m1 , m4 , and m2 are all strictly identical to the
corresponding messages from the previous computation. The only new message to be
computed is m1 (x3 ), as shown in Figure 5.
As this example suggests, we can obtain the marginals for every node in the graph
by computing 2(N 1) messages, one for each direction along each edge. When
computing the message mij (xj ), we need the incoming messages mki (xi ) for its
other neighbors k N (i) \ {j}, as shown in Figure 6. (Note: the \ symbol denotes
the dierence of two sets, and N (i) denotes the neighbors of node i.) Therefore, we
need to choose an ordering over messages such that the prerequisites are available at
each step. One way to do this is through the following two-step procedure.
1

Note that this analysis does not include the time for computing the products in each of the
messages. A naive implementation would, in fact, have higher computational complexity. However,
with the proper bookkeeping, we can reuse computations between messages to ensure that the total
complexity is O(N |X|2 ).

xj

xi

N (i) \ {j}

Figure 6: The message mi (xi ) depends on each of the incoming messages mk (xi ) for
xi s other neighbors N (i) \ {j}.
Choose an arbitrary node i as the root, and generate messages going towards it
using the elimination algorithm described in Section 8.1.
Compute the remaining messages, working outwards from the root.
We now combine these insights into the sum-product algorithm on trees. Messages
are computed in the order given above using the rule:
X
Y
mij (xj ) =
i (xi )ij (xi , xj )
mki (xi ).
(5)
xi

kN (i)\{j}

Note that, in order to disambiguate messages sent from node i, we explicitly write
mij (xj ) rather than simply mi (xj ). Then, the marginal for each variable is obtained
using the formula:
Y
pxi (xi ) i (xi )
mji (xi ).
(6)
jN (i)

We note that the sum-product algorithm can also be viewed as a dynamic program
ming algorithm for computing marginals. This view will become clearer when we
discuss hidden Markov models.

8.3

Parallel sum-product

The sum-product algorithm as described in Section 8.2 is inherently sequential: the


messages must be computed in sequence to ensure that the prerequisites are avail
able at each step. However, the algorithm was described in terms of simple, local
operations corresponding to dierent variables, which suggests that it might be par
allelizable. This intuition turns out to be correct: if the updates (6) are repeatedly
applied in parallel, it is possible to show that the messages will eventually converge
to their correct values. More precisely, letting mti (xj ) denote the messages at time
step t, we apply the following procedure:
1. Initialize all messages m0ij (xj ) = 1 for all (i, j) E.
2. Iteratively apply the update
X
mt+1
(x
)
=
i (xi )ij (xi , xj )
j
ij
xi

Y
kN (i)\{j}

t
mki

(xi )

(7)

Intuitively, this procedure resembles xed point algorithms for solving equations
of the form f (x) = x. Fixed point algorithms choose some initial value x0 and then
iteratively apply the update xt+1 = f (xt ). We can view (6) not as an update rule, but
as a set of equations to be satised. The rule (7) can be viewed as a xed point update
for (6). You will prove in a homework exercise that this procedure will converge to
the correct messages (6) in d iterations, where d is the diameter of the tree (i.e. the
length of the longest path).
Note that this parallel procedure entails signicant overhead: each iteration of the
algorithm requires computing the messages associated with every edge. We saw in
Section 8.1 that this requires O (N |X|2 ) time. This is the price we pay for parallelism.
Parallel sum-product is unlikely to pay o in practice unless the diameter of the tree
is small. However, in a later lecture we will see that it naturally leads to loopy belief
propagation, where the update rule (7) is applied to a graph which isnt a tree.

8.4

Ecient implementation

In our complexity analysis from Section 8.1, we swept under the rug the details of
exactly how the messages are computed. Consider the example of the star graph
shown in Figure 3, where this time we intelligently choose the center node x1 as the
root. When we compute the outgoing messages m1j (xj ), we must rst multiply
together all the incoming messages mk1 (x1 ). Since there are N 2 of these, the
product requires roughly |X|N computations. This must be done for each of the N 1
outgoing messages, so these products contribute approximately |X|N 2 computations
in total. This is quadratic in N , which is worse than the linear dependency
a we2 stated
earlier. More generally, for a tree graph G, these products require O (|X| i di ) time,
where di is the degree (number of neighbors) of node i.
However, in parallel sum-product, we can share computation between the dierent
messages by computing them simultaneously as follows:
1. Compute

ti (xi ) =

mtki (xi ) i (xi )

(8)

kN (i)

2. For all j N (i), compute

mt+1
ij (xj ) =

X i (xi , xj )t (xi )

i
t
mji (xi )
x X

(9)

Using this algorithm, each


a update (8) can be computed in O(|X|di ) time, so the cost
per iteration is O (|X| i di ) = O (|X|N ). Computing (9) still takes O (|X|2 ) per
node, so the overall running time is O (|X|2 N ) per iteration, or O (|X|2 N d) total.
(Recall that d is the diameter of the graph.) A similar strategy can be applied to the
sequential algorithm to achieve a running time of O(|X|2 N ).
6

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms For Inference


Fall 2014

9 Forward-backward algorithm, sum-product on


factor graphs
The previous lecture introduced belief propagation (sum-product), an ecient infer
ence algorithm for tree structured graphical models. In this lecture, we specialize it
further to the so-called hidden Markov model (HMM), a model which is very useful
in practice for problems with temporal structure.

9.1

Example: convolution codes

We motivate our discussion of HMMs with a kind of code for communication called a
convolution code. In general, the problem of communication is that the sender would
like to send a message m, represented as a bit string, to a receiver. The message may
be corrupted along the way, so we need to introduce redundancy into the message so
that it can be reconstructed accurately even in the presence of noise. To do this, the
sender sends a coded message b over a noisy channel. The channel introduces some
noise (e.g. by ipping random bits). The receiver receives the received message y
. A schematic
and then applies a decoding procedure to get the decoded message m
= m with high
is shown in Figure 1. Clearly, we desire a coding scheme where m
can be eciently computed from y .
probability, b is not much larger than m, and m
We now discuss one example of a coding scheme, called a convolution code. Sup
pose the message m consists of N bits. The coded message b will consist of 2N 1
bits, alternating between the following:
The odd-numbered bits b2i1 repeat the message bits mi exactly.
The even-numbered bits b2i are the XOR of message bits mi and mi+1 , denoted
mi mi+1 .
The ratio of the lengths of m and b is called the rate of the code, so this convolution
code is a rate 12 code, i.e. for every coded message bit, it can convey 12 message bit.
We assume an error model called a binary symmetric channel : each of the bits of
the coded message is independently ipped with probability . We can represent this
as a directed graphical model as shown in Figure 2. Note that from the receivers
perspective, only the yi s are observed, and the task is to infer the mi s.
In order to perform inference, we must convert this graph into an undirected
graphical model. Unfortunately, the straightforward construction, where we moralize
the graph, does not result in a tree structure, because of the cliques over mi , mi+1 , and
b2i . Instead, we coarsen the representation by combining nodes into supernodes. In
particular, we will combine all of the adjacent message bits into variables mi mi+1 , and
1

error
b1 , . . . , b2N

m1 , . . . , mN

message

y1 , . . . , y2N

coded message

corruption (binary
symmetric channel)

N
1, . . . , m
m

received message

decoded message

Figure 1: A schematic representation of the problem setup for convolution codes.

m1

m3

m2

b1

b2

b3

b4

b5

y1

y2

y3

y4

y5

Figure 2: Our convolution code can be represented as a directed graphical model.


we will combine pairs of adjacent received message bits y2i1 y2i , as shown in Figure
3. This results in a tree-structured directed graph, and therefore an undirected tree
graph now we can perform sum-product.

9.2

Hidden Markov models

Observe that the graph in Figure 3 is Markov in its hidden states. More generally, a
hidden Markov model (HMM) is a graphical model with the structure shown in Figure
4. Intuitively, the variables xi represent a state which evolves over time and which
we dont get to observe, so we refer to them as the hidden state. The variables yi are
signals which depend on the state at the same time step, and in most applications
are observed, so we refer to them as observations.
From the denition of directed graphical models, we see that the HMM represents
the factorization property
P(x1 , . . . , xN , y1 , . . . , yN ) = P(x1 )

N
N
i=2

P(xi |xi1 )

N
N
j=1

P(yj |xj ).

(1)

Observe that we can convert this to the undirected representation shown in Figure 4
(b) by taking each of the terms in this product to be a potential. This allows us to
2

m1 m2

m2 m3

m3 m4

y 1 y2

y 3 y4

y 5 y6

m1 m2

m2 m3

m3 m4

y 1 y2

y 3 y4

y 5 y6

...

mN

y2N

1 mN

mN

3 y2N 2

y2N

(a)

...

mN

y2N

1 mN

mN

3 y2N 2

y2N

(b)
Figure 3: (a) The directed graph from gure 2 can be converted to a tree-structured
graph with combined variables. (b) The equivalent undirected graph.

x1

x2

x3

y1

y2

y3

...

xN

Hidden state

yN

Observations

(a)
x1

x2

x3

y1

y2

y3

...

xN

yN

(b)
Figure 4: (a) a hidden Markov model, and (b) its undirected equivalent.

1 (x1 )

N (x1 )

x1

...

x3

x2

xN

12 (x1 , x2 )

Figure 5: Representation of an HMM as a chain graph.


perform inference using sum product on trees. In particular, our goal is typically to
infer the marginal distribution for each of the hidden states given all of the observa
tions. By plugging in the values of the observations, we can convert the HMM to a
chain graph, as shown in Figure 5. This graphical model has two sets of potentials:
1 (x1 ) = P(x1 , y1 )
i (xi ) = P(yi |xi ), i {2, . . . , N }
i,i+1 (xi , xi+1 ) = P(xi+1 |xi ), i {1, . . . , N 1}

(2a)
(2b)
(2c)

The resulting sum-product messages are:


m12 (x2 ) =

1 (x1 )12 (x1 , x2 )

(3)

x1

mii+1 (xi+1 ) =

i (xi )i,i+1 (xi , xi+1 )mi1i (xi )


xi

mN N 1 (xN 1 ) =

2iN

N (xN )N 1,N (xN 1 , xN )

(4)
(5)

xN

mii1 (xi1 ) =

i (xi )i1,i (xi1 , xi )mi+1i (xi )


xi

1 i N 1 (6)

Belief propagation on HMMs is also known as the forward-backward algorithm.


You can easily show that
m12 (x2 ) =

1 (x1 )12 (x1 , x2 ) =


x1

x1

P(x1 , y1 )P(x2 |x1 )

P(x1 , y1 , x2 ) = P(y1 , x2 ).

=
x1

m23 (x3 ) =

2 (x2 )23 (x2 , x3 )m12 (x2 ) =


x2

x2

P(x3 , x2 , y2 , y1 ) = P(y1 , y2 , x3 ).
x2

P(y2 |x2 )P(x3 |x2 )P(y1 , x2 )

Continuing in this fashion, we can show that:

mi1i (xi ) = P(y1 , y2 , . . . , yi1 , xi ).


Similarly:
mN N 1 (xN 1 ) =

N (xN )N 1,N (xN 1 , xN ) =

xN

mN 1N 2 (xN 2 ) =

xN

X
xN

P(yN |xN )P(xN |xN 1 )

P(xN , yN |xN 1 ) = P(yN |xN 1 ).


N 1 (xN 1 )N 2,N 1 (xN 2 , xN 1 )

xN 1

X
xN 1

X
xN 1

P(yN 1 |xN 1 )P(xN 1 |xN 2 )P(yN |xN 1 )


P(yN 1 , yN , xN 1 |xN 2 ) = P(yN 1 , yN |xN 2 ).

Continuing in this fashion, we get:


mi+1i (xi ) = P(yi+1 , yi+2 , . . . , yN |xi ).
9.2.1 , Forward-Backward Algorithms and Probabilistic interpretation
of messages
As we have seen, belief propagation on HMMs takes the form of a two-pass algorithm,
consisting of a forward pass, and a backward pass. In fact there is considerably
exibility on how computation is structured in the algorithm, even with the two-pass
structure. As a result, we refer to this algorithm and its variants collectively as the
forward-backward algorithm.
To illustrate how the computation can be rearranged in useful ways, in this section
we highlight one variant of the forward-backward algorithm, termed the , version
for reasons that will become apparent. The , forward-backward algorithm was
also among the earliest versions developed.
To see the rearrangement of interest, rst note that each posterior marginal of
interest pxi |y1 ,...,yN is proportional to pxi ,y1 ,...,yN and that from the graphical model in
Figure 4, (y1 , . . . , yi ) are d-separated from (yi+1 , . . . , yN ) given xi . Therefore, the joint
distribution factorizes as:
py1 ,...,yN ,xi = py1 ,...,yi ,xi pyi+1 ,...,yN |xi ,y1 ,...,yi
= py1 ,...,yi ,xi pyi+1 ,...,yN |xi .

(7)
(8)

Now we derive recursive update rules for computing each of the two parts. First, the

forward messages :
i (xi ) = P(y1 , . . . , yi , xi )
X
=
P(y1 , . . . , yi , xi , xi1 )

(9)
(10)

xi1

xi1

xi1

P(y1 , . . . , yi1 , xi1 )P(xi |xi1 )P(yi |xi )

(11)

i1 (xi1 )P(xi |xi1 )P(yi |xi ), i {2, . . . , N },

(12)

where 1 (x1 ) = P(x1 , y1 ).


Then, the backward messages :
i (xi ) = P(yi+1 , . . . , yN |xi )
X
=
P(xi+1 , yi+1 , . . . , yN |xi )

(13)
(14)

xi+1

X
xi+1

X
xi+1

P(yi+1 |xi+1 )P(xi+1 |xi )P(yi+2 , . . . , yN |xi+1 )

(15)

P(yi+1 |xi+1 )P(xi+1 |xi )i+1 (xi+1 ), i {1, . . . , N 1},

(16)

where N (xN ) = 1.

Now consider the relationship between and and our sum-product messages. The

forward message corresponds to part of the formula for ,

X
mii+1 (xi+1 ) = P(y1 , . . . , yi , xi+1 ) =
P(y1 , . . . , yi , xi , xi+1 )
xi

X
xi

P(xi+1 |xi )P(y1 , . . . , yi , xi ) =

X
xi

P(xi+1 |xi )i (xi ). (17)

Observe that the messages and the forward messages in belief propagation perform
exactly the same series of sums and products, but they divide up the steps dierently.
Following an analogous line of reasoning, we nd that the backwards messages are
identical, i.e. i (xi ) = mi+1i (xi ). This gives a useful probabilistic interpretation to
the sum-product messages.
We can see directly that the marginal computation corresponding to
i (xi ) / pxi (xi |y1 , . . . , yN ) pxi (xi , y1 , . . . , yN ) = i (xi )mi1i (xi )mi+1i (xi )
in the sum-product algorithm is, in terms of the and , messages,
i (xi ) =

i (xi )i (xi )
.
P(y1 , . . . , yN )
6

x1

fa

fb
x3

x2

fc

Figure 6: An example of factor graph belief propagation.


In turn, we note that the likelihood P(y1 , . . . , yN ) is obtained from the messages
at any node:
X
P(y1 , . . . , yN ) =
i (xi )i (xi ), i.
xi

One appealing feature of this form of the forward-backward algorithm is that the
messages are, themselves, marginals that are often directly useful in applications.
In particular,
i (xi ) P(xi |y1 , . . . , yi ),
i.e., i (xi ) represents the marginal at state node i using data up to observation node
i, and as such represents a causal marginal of particular interest in real-time appli
cations. Sometimes these are referred to as ltered marginals, for reasons that we
will come back to when we revisit the forward-backward algorithm in the context of
Gaussian HMMs.
Finally, it should be noted that several other closely related variants of the forwardbackward algorithm are also possible. One is the so-called , version, whereby the
recursion for is replaced with a recursion for , which has the benet of producing
the desired marginals directly as messages, and the additional benet that each piece
of data yi is used only once, in the forward pass. Specically, the recursion for the
messages can be shown to be (see, e.g., Jordans notes, Chapter 12):

X
P(xi+1 |xi )i (xi )

i+1 (xi+1 ).
i (xi ) =

P(xi+1 |xi )i (xi )


x
x
i+1

9.3

Sum-product on factor graphs

We now consider how to perform the sum product algorithm on a slightly more
general class of graphical models, tree-structured factor graphs. Observe that this is
a strictly larger set of graphs than undirected or directed trees. In either of these
two cases, there is one factor for each edge in the original graph, so their equivalent
factor graph representation is still a tree. However, some non-tree-structured directed
or undirected graphs may have tree structured factor graph representations. One
7

important example is polytrees: recall that these are directed graphs which are treestructured when we ignore the directions of the edges. For instance, observe that
the convolution code directed graph shown in Figure 2 is a polytree, even though its
undirected representation (obtained by moralizing) is not a tree. Using factor graph
belief propagation, it is possible to perform inference in this graph without resorting
to the supernode representation of Section 9.1.
In factor graph belief propagation, messages are sent between variable nodes and
factor nodes. As in undirected belief propagation, each node sends messages to one
neighbor by multiplying and/or summing messages from its other neighbors. Factor
nodes a multiply incoming messages by their factor and sum out all but the relevant
variable i:
N
X
mai (xi ) =
fa (xi , xN (a)\{i} )
mja (xj )
(18)
xN (a)\{i}

jN (a)\{i}

The variable nodes simply multiply together their incoming messages:


N
mbi (xi )
mia (xi ) =

(19)

bN (i)\{a}

For instance, consider Figure 6. The messages required to compute px1 (x1 ) are:
ma1 (x1 )
m2b (x2 )
mc3 (x3 )
m3b (x3 )

=
=
=
=

fa (x1 )
1
fc (x3 )
mc3 (x3 )
X
mb1 (x1 ) =
fb (x1 , x2 , x3 )m2b (x2 )m3b m(x3 )
x2 ,x3

(20)
(21)
(22)
(23)
(24)

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms For Inference


Fall 2014

10 Sum-product on factor tree graphs, MAP elim


ination algorithm
The previous lecture introduced the beginnings of the sum-product on factor tree
graphs with a specic example. In this lecture, we will describe the sum-product
algorithm for general factor tree graphs.

10.1

Sum-product for factor trees

Recall that a factor graph consists of a bipartite graph between variable nodes and
factor nodes. The graph represents the factorization of the distribution into factors
corresponding to factor nodes in the graph, in other words
p(x1 , . . . , xN )

M
M

fi (xci )

i=1

where ci is the set of variables connected to factor fi . As in the case with undirected
graphs, we restrict our attention to a special class of graphs where inference is ecient.
Denition 1 (Factor tree graph) A factor tree graph is a factor graph such that
the variable nodes form a tree, that is between any two variable nodes, there is a
unique path (involving factors).
10.1.1

Serial version

As in the sum-product algorithm for undirected graphs, there is a serial verison of


the sum-product algorithm for factor tree graphs. Because there are factor nodes and
variables nodes, we have two types of messages:
1. Variable to factor messages:

xi

mia (xi ) =

mbi (xi )

bN (i)\a

The variable node i with degree di sends a message towards a that is the product
of di 1 terms for each value in |X|. This results in a total of O(|X|di ) operations.
1

2. Factor to variable messages:

xj

mai (xi ) =
xj :jN (a)\i

xi

fa (xi ; xj , j N (a)\i)

mja (xj )

jN (a)\i

The factor node a with degree da sends a message towards i that is a summation
over |X|da 1 terms, each involving a product of da terms for each value in |X| of
the neighboring variables. Thus, the total cost is O(|X|da da ).

In summary, complexity scales as O(|E||X|d ) where d = maxa da (over factor nodes)


and |E| is the total number of edges in the factor tree graph. Notably, because the
graph is a tree, |E| = O(N ).1
10.1.2

Parallel version

We can also run the updates in parallel to get the parallel version of the sum-product
algorithm for factor tree graphs:
1. Initialization: for (i, a) E,
m0ia (xi ) = 1
m0ai (xi ) = 1
for all xi X.
2. Update: for (i, a) E, t 0,
M
t
(*) mt+1
(x
)
=
mbi
(xi )
ia i
bN (i)\a

(**) mt+1
ai (xi ) =
xj :jN (a)\i

fa (xi ; xj , j N (a)\i)

mtja (xj )

jN (a)\i

3. Marginalization: for t diameter


pt+1
xi (xi )

t+1
mai
(xi ).

aN (i)

This is the cost to send all of the messages to a single node. If we wanted to send all of the

messages the other direction, this would naively cost O(N 2 |X|d ). We can reduce this to O(N |X|d )
using a technique that we show in the next section.

As we noticed above sending mai costs O(|X|da da ) and we have to send a similar
mes
a
2
sage for every neighbor ofaa, so the total complexity cost will scale as O(|X|d
a da )
and in the worst case O( a d2a ) will be quadratic in N .
With a little cleverness, we can provide a more ecient implementation of (*)
and (**):
(*) Compute
mt+1
i (xi ) =

mtbi (xi )

bN (i)

then,

mt+1
ia (xi ) =

mit+1 (xi )

.
mtai (xi )

Thus the total cost of computing (*) per variable node i is O(|X|di ) instead of
O(|X|d2i ).
(**) Compute
fat+1 (xk , k N (a)) = fa (xk ; k N (a))

t
mka
(xk )

kN (a)

then,

mt+1
ai (xi ) =

X
xj :jN (a)\i

fat+1
(xk , k N (a))
.
mtia (xi )

Thus the total cost of computing (**) per factor node a is O(|X|da da ).
Thus the overall complexity cost for a single iteration of the parallel algorithm will

be O(N |X|d ).

10.2

Factor tree graph > Undirected tree graph

The factor tree graph representation is strictly more general than the undirected
tree graph representation. Recall that by the Hammersley-Cliord theorem, we can
represent any undirected graph as a factor graph by creating a factor node for each
clique potential and connecting the respective variable nodes. In the case of an
undirected tree graph, the only clique potentials are along edges, so its clear the
factor graph representation will also be a tree. Now, we consider an example where
the undirected graph is not a tree, but its factor graph is:

As shown in the gure, the graphical model on the left has 4 maximal cliques. The
factor graph representation of it is on the right, and it does not have any loops,
even though the undirected graph did. A necessary and sucient condition for this
situation is provided in the solution of Problem 2.2.
On the left hand graph, we cannot apply the sum-product algorithm because the
graph is not a tree, but on the right hand side we can. What happened here? The
factor in the middle of the graph has 3 nodes, and as we saw above, the overall
complexity of the algorithm depends strongly on the maximal degree node. Taking
this example further, consider converting the complete undirected graph on N vertices
into a factor graph. It is a star graph with a single factor node, so it is a tree
Unfortunately, the factor node has N vertices, so the cost of running sum-product is
O(N |X|N ), so we dont get anything for free.

10.3

MAP elimination algorithm

The primary focus of the prior material has been on algorithms for computing marginal
distributions. In this section, we will be talking about the second inference problem:
computing the mode or MAP.
Specically, consider a collection of N random variables x1 , . . . xN each taking
values in |X| with their joint distribution represented by a graphical model G. For
concreteness, let G = (V, E) be an undirected graphical model, so that the distribution
factorizes as:
M
px (x)
C (xC )
CC

1 M
C (xC ),
Z CC

where C is the set of maximal cliques in G and Z is the partition function. Our goal
is to compute x XN such that
x arg max px (x).
xXN

Note that we have written instead of = because there may be several congurations
that result in the maximum and we are interested in nding one of them.
The sum-product algorithm gives us a way to compute the marginals pxi (xi ), so
we might wonder if setting xi arg maxxi pxi (xi ) produces a MAP assignment. This
works in situations where the variables are independent, there is only one positive
conguration, or when the variables are jointly Gaussian. However, consider the
following example with binary variables x1 and x2 :
px1 ,x2 (0, 0) = 1/3 t
px1 ,x2 (1, 0) = 1/3
px1 ,x2 (0, 1) = 0
px1 ,x2 (1, 1) = 1/3 + t.
arg maxx1 px1 (x1 ) = 1 and arg maxx2 px2 (x2 ) = 0, but that is not the MAP. It is clear
that we need a more general algorithm.

10.4

MAP Elimination algorithm example

We will start with a motivating example of the MAP elimination algorithm. Consider
a distribution over (x1 , . . . , x5 ) described by the graphical model below. Let each
xi {0, 1}, for 1 i 5.
35

=1

x3
13

x5

x1
34

12

=1

x2

=1

x4
24

p(x) exp(12 x1 x2 + 13 x1 x3 + 24 x2 x4 + 34 x3 x4 + 35 x3 x5 )
For this example, we will consider the specic values 12 = 34 = 35 = 1, 13 = 24 =
1. So, plugging in those values, our distribution is
p(x) exp(x1 x2 x1 x3 x2 x4 + x3 x4 + x3 x5 )
1
= exp(x1 x2 x1 x3 x2 x4 + x3 x4 + x3 x5 )

Z
F (x1 ,...,x5 )

Our goal is to eciently compute:


x arg max px (x) F (x1 , . . . , x5 ).
xXN

To eciently compute this mode x , let us adapt the elimination algorithm for
computing the marginals of a distribution. First, we will x an elimination order of
nodes: 5, 4, 3, 2, 1. Then
max px (x) = max exp(x1 x2 x1 x3 x2 x4 + x3 x4 ) max exp(x3 x5 )
x
x1 ,...,x4
x5
|
{z
}
m5 (x3 )

where we have grouped the terms involving x5 and moved the maxx5 past all of the
terms that do not involve x5 . We see that
m5 (x3 ) = ex3
here. We also store the specic value of x5 that maximized the expression (e.g. when
x3 = 1, m5 (x3 ) = e and x5 = 1) and we will see why this is important to track later.
In this case, the maximizer of x5 is 1 if x3 = 1 and 0 or 1 if x3 = 0. We are free to
break ties arbitrarily, so we break the tie so that x5 (x3 ) = x3 .
m5 (x3 ) = exp(x3 )
x5 (x3 ) = 1
x3
13

x1
34

12

=1

x2

=1

x4
24

Now we eliminate x4 in the same way, by grouping the terms involving x4 and
moving the maxx4 past terms that do not involve it.
max px (x) = max exp(x1 x2 x1 x3 )m5 (x3 ) max exp(x3 x4 x2 x4 )
x
x1 ,x2 ,x3
x4
|
{z
}
m4 (x2 ,x3 )

with
m4 (x2 , x3 ) = exp(x3 (1 x2 ))
x4 (x2 , x3 ) = 1 x2 .
6

x3
13

1
m4 (x2 , x3 ) = exp(x2 (1

x1

12

x4 (x2 , x3 ) = 1

x3 ))
x2

x2

=1

Similarly for x3 ,



max px (x) = max exp(x1 x2 ) max exp(x1 x3 )m5 (x3 )m4 (x2 , x3 )
x
x1 ,x2
x3


= max exp(x1 x2 ) max exp(x1 x3 + x3 + x3 (1 x2 ))
x1 ,x2
x3


= max exp(x1 x2 ) max exp((x1 + x2 )x3 )
x1 ,x2
x3
|
{z
}
m3 (x1 ,x2 )
,

with
m3 (x1 , x2 ) = 1
x3 (x1 , x2 ) = 0.
Finally,

max px (x) = max exp(x1 x2 )

x1 ,x2

implies that x1 = x2 = 1. Now we can use the information we stored along the way
to compute the MAP assignment. x3 (1, 1) = 0, x4 (1, 0) = 0, x5 (0) = 1. Thus the
optimal assignment is:
x = (1, 1, 0, 0, 1).
As in the elimination algorithm for computing marginals, the primary determinant
of the computation cost is the size of the maximal clique in the reconstituted graph.
Putting this together, we can describe the MAP elimination algorithm for general
graphs:

Input: Potentials c for c C and an elimination ordering I

Output: x arg maxxXN px ()

Initialize active potentials to be the set of input potentials.

for node i in I do
Let Si be the set of all nodes (not including i and previously eliminated

nodes) that share a potential with node i.

Let i be the set of potentials in involving xi .

Compute

M
mi (xSi ) = max
(xi xSi )
xi

xi (xSi )

arg max
xi

M
i

(xi xSi )

where ties are broken arbitarily.


Remove elements of i from .
Add mi to .
end
Produce x by traversing I in a reverse order
xj = xj (xk : j < k N ).
Algorithm 1: The MAP Elimination Algorithm
As in the context of computing marginals, the overall cost of the MAP elimination
algorithm is bounded above as
X
overall cost |C|
|X||Si |+1
i

N |C||X|maxi |Si |+1 .


The MAP elimination algorithm and elimination algorithm for marginals are
closely related and they both rely on the relationship between sums and products
and max and products. Specically, the key property was that


max f (x)g(x, y) = max f (x) max g(x, y)

x,y

X
x,y

f (x)g(x, y) =

X
x

f (x)

X
y

g(x, y) .

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

11

The Max-Product Algorithm

In the previous lecture, we introduced the MAP elimination algorithm for computing
an MAP conguration for any undirected graph. Today, we specialize the algorithm
to the case of trees. The resulting algorithm is the Max-Product algorithm, which
recovers the max-marginal at each node of a graph over random variables x1 , . . . , xN :
pxi (xi ) = max px (x).
xj :j=i

If each xi argmaxxi pxi (xi ) is uniquely attained, i.e., there are no ties at each node,
then the vector x = (x1 , x2 , . . . , xN ) is the unique global MAP conguration. Oth
erwise, keeping track of backpointers indicating what maximizing values neighbors
took is needed to recover a global MAP conguration. Note that computing xi is
computationally not the same as rst computing the marginal distribution of xi and
then looking at what value xi maximizes pxi (xi ) because we are actually maximizing
over all other xj for j = i as well!
Max-Product relates to MAP elimination in the same way Sum-Product relates
to elimination. In particular, for trees, we can just root the tree at an arbitrary
node and eliminate from the leaves up to the root at which point we can obtain the
max-marginal at the root node. We pass messages from the root back to the leaves
to obtain max-marginals at all the nodes.

11.1

Max-Product for Undirected Trees

As usual, for undirected graph G = (V, E) with V = {1, 2, . . . , N }, we dene a distri


bution over x1 , . . . , xN via factorization:
px (x)

i (xi )
iV

ij (xi , xj ).
(i,j)E

We work through an example to illustrate what happens when G is a tree. The in


tuition is nearly identical to that of Sum-Product. Consider an undirected graph
ical model shown below. We run the MAP elimination algorithm with ordering
(5, 4, 3, 2, 1), i.e., eectively were rooting the tree at node 1:
x1
x3

x2

x4

x1

x5

x3

x2

x4

m5

Eliminating nodes 5, 4, and 3, we obtain messages:

m5 (x2 ) = max 5 (x5 )25 (x2 , x5 )


x5 X

m4 (x2 ) = max 4 (x4 )24 (x2 , x4 )


x4 X

m3 (x1 ) = max 3 (x3 )13 (x1 , x3 )


x3 X

x1
x3

x2
m4

x1
x2
m4

m5

m3
m5

Eliminating node 2 yields message:


m2 (x1 ) = max 2 (x2 )12 (x1 , x2 )m4 (x2 )m5 (x2 ).
x2

Finally, we obtain the max-marginal at node 1:


px1 (x1 ) = 1 (x1 )m2 (x1 )m3 (x1 ),
with maximal point
x1 argmax px1 (x1 ) = argmax 1 (x1 )m2 (x1 )m3 (x1 ).
x1 X

x1 X

To obtain max-marginals for all nodes, we could imagine that we instead choose a
dierent node to act as the root and eliminate from leaves to the root. As with
the Sum-Product case, the key insight is that messages computed previously could
be reused. The bottom line is that it suces to use the same serial schedule as for
Sum-Product: pass messages from leaves to the root and then from the root back to
the leaves.
Using the above serial schedule for message passing, in general the messages are
computed as:
Y
mij (xj ) = max i (xi )ij (xi , xj )
mki (xi ).
(1)
xi X

kN (i)\j

Once all messages have been computed in both passes, then we compute max-marginals
at each node:
Y
pxi (xi ) = i (xi )
mki (xi ).
(2)
kN (i)

As mentioned previously, in general we need to backtrack to obtain a global congu


ration, which requires us to store the argmax for each message:
Y
ij (xj ) = argmax i (xi )ij (xi , xj )
mki (xi ).
xi X

kN (i)\j

Then the backtracking works as follows: Let xi argmaxxi pxi (xi ) for an arbitrary
node i. For each j N (i), assign xj = ji (xi ). Recurse until all nodes have been
assigned a conguration.
Parallel Max-Product for Undirected Trees
As with Sum-Product, we have a parallel or ood schedule for message passing:
1. Initialize all messages m0ij (xj ) = 1 for all (i, j) E.
2. Iteratively apply the update for t = 0, 1, 2, . . . until convergence:
Y
mt+1
(x
)
=
max

(x
)
(x
,
x
)
mtki (xi ).
j
i i
ij
i
j
ij
xi X

kN (i)\j

3. Compute max-marginals:
pxi (xi ) = i (xi )

mt+1
ki (xi ).

kN (i)

4. Backtrack to recover an MAP conguration.


As with parallel Sum-Product, we can reduce the amount of computation by precom
puting at the beginning of each iteration of step 2:
Y
m
m ti (xi ) =
mtki (xi ).
kN (i)

Then the update equation in step 2 becomes


mt+1
ij (xj ) = max i (xi )ij (xi , xj )
xi X

m
m ti (xi )
.
t
mji
(xi )

Numerical stability, Max-Sum, and Min-Sum


In the message passing and max-marginal computation equations, we are multiplying
many potential function values, which can lead to numerical issues. For example, if
the potential functions represent probabilities, e.g., if our undirected graph is actually
obtained from a DAG so the potentials are conditional probability distributions, then
3

multiplying many probabilities can result in extremely small values below machine
precision. A remedy for this is to work in the log domain.
Denote mbij (xj ) = log mij (xj ). Then taking the log of equations (1) and (2)
results in updates for the Max-Sum algorithm, summarized below.
Message-passing:

mbki (xi ) .

mbij (xj ) = max log i (xi ) + log ij (xi , xj ) +


xi X

kN (i)\j

Max-marginals:

mbki (xi ) .

p
xi (xi ) = exp log i (xi ) +

kN (i)

We could also take the negative log instead of the log, which results in the MinSum algorithm. A rationale for why this might make sense is that if the potentials
represent probabilities, then log probabilities take on negative values. In this case, if
we wish to deal with strictly non-negative entries, then we we would use Min-Sum.
Denote bij (xj ) = log mij (xj ). Taking the negative log of equations (1) and (2),
we obtain the update rules below.
Message-passing:

bki (xi ) .

bij (xj ) = min log i (xi ) log ij (xi , xj ) +


xi X

kN (i)\j

Max-marginals:

bki (xi ) .

p
xi (xi ) = exp log i (xi )

kN (i)

Lastly, we remark that if there are ties in MAP congurations, one way to remove
the ties is to perturb the potential functions values with a little bit of noise. If the
model is numerically well-behaved, perturbing the potential functions values with a
little bit of noise should not drastically change the model but may buy us node max
marginals that all have unique optima, implying that the global MAP conguration
is unique as well.

11.2

Max-Product for Factor Trees

The Max-Product algorithm for factor trees is extremely similar to that of undirected
trees, so we just present the parallel schedule version of Max-Product on factor tree
G = (V, E, F):
4

1. Initialize all messages m0ia (xi ) = m0ai (xi ) = 1 for all i V, a F.


2. Iteratively apply the update for t = 0, 1, 2, . . . until convergence:
Y
t
mt+1
max fa ({xi } {xj : j N (a) \ i})
mja
(xj )
ai (xi ) =
xj :jN (a)\i

mt+1
ia (xi ) =

jN (a)\i

mt+1
bi
(xi )

bN (i)\a

3. Compute max-marginals:
pxi (xi ) =

mt+1
ai (xi ).

aN (i)

4. Backtrack to recover an MAP conguration.

11.3

Max-Product for Hidden Markov Models

We revisit our convolution code HMM example from Lecture 9, truncating it to have
four hidden states:
m1m2

m2m3

m3m4

m4

y1y2

y3y4

y5y6

y7

To ease the notation a bit for whats to follow, lets denote mi mi+1 / mi,i+1 and
yj yj+1 / yj,j+1 . We thus obtain:
m12

m23

m34

m4

y12

y34

y56

y7

Fix (0, 1/2). Let w = log

> 0. Then the potential functions are given by:

mi,i+1 ,mi+1,i+2 (ab, cd) = 1(b = c)


for i {1, 2},
m34 ,m4 (ab, c) = 1(b = c),
mi,i+1 ,y2i1,2i (ab, uv) = exp {1(a = u)w + 1(a b = v)w } for i {1, 2, 3},
m4 ,y7 (a, u) = exp {1(a = u)w } ,
5

where a, b, c, d, u, v {0, 1}, ab denotes a length-2 bit-string, and 1(A) is the indi
cator function that is 1 when event A is true and 0 otherwise.
Suppose we observe y = (0, 0, 0, 1, 0, 0, 0) and want to infer the MAP conguration
for m = (m1 , m2 , m3 , m4 ). Lets solve this using Max-Sum. The rst thing to do is
to incorporate the data into the model by plugging in the observed values; this will
cause the resulting graphical model to just be a line:
m12

m23

m34

m4

m12

m23

m34

m4

The new singleton potentials are:


m12 (ab) = exp {1(a = 0)w + 1(a b = 0}w }
m23 (ab) = exp {1(a = 0)w + 1(a b = 1}w }
m34 (ab) = exp {1(a = 0)w + 1(a b = 0}w }
m4 (a) = exp {1(a = 0)w }
The pairwise potentials remain the same.
Since were going to use Max-Sum, we write out the log-potentials:
log m12 (ab) = 1(a = 0)w + 1(a b = 0}w
log m23 (ab) = 1(a = 0)w + 1(a b = 1}w
log m34 (ab) = 1(a = 0)w + 1(a b = 0}w
log m4 (a) = 1(a = 0)w
log mi,i+1 ,mi+1,i+2 (ab, cd) =

0
if b = c
otherwise

log m34 ,m4 (ab, c) =

0
if b = c
otherwise

for i {1, 2}

Using the Max-Sum message update equation,


log m(i,i+1)(i+1,i+2) (cd)
(
= max 2 log mi,i+1 (ab) + log mi,i+1 ,mi+1,i+2 (ab, cd) + log m(i1,i)(i,i+1) (ab)
ab{0,1}
(
= max log mi,i+1 (ac) + log m(i1,i)(i,i+1) (ac) ,

a{0,1}

where the last equality uses the fact that what the pairwise potentials really are saying
is that we only allow transitions where the last bit of the previous state is the rst
6

bit of the next state. This means that if we know the next state is cd, then the
previous state must end in c, so we shouldnt bother optimizing over all bit-strings
of length 2.
Note that if we pass messages from the left-most node to the right-most node,
then what log m(i,i+1)(i+1,i+2) (cd) represents is the highest score achievable that
ends in mi+1,i+2 = cd but that does not include the contribution of the observation
associated with mi+1,i+2 . The goal of Max-Sum would then be to nd the MAP con
guration with the highest overall score, with contributions from all nodes singleton
potentials and pairwise potentials.
We can visualize this setup using a trellis diagram:
Hidden states

m12

m23

m34

m4

00

00

00

logm12(00) = 2w

logm23(00) = 1w

logm34(00) = 2w

logm4(0) = 1w

01

01

01

logm12(01) = 1w

logm23(01) = 2w

logm34(01) = 1w

10

10

10

logm12(10) = 0w

logm23(10) = 1w

logm34(10) = 0w

logm4(1) = 0w

11

11

11

logm12(11) = 1w

logm23(11) = 0w

logm34(11) = 1w

Possible values
hidden states
can take

For each hidden states possible value, incoming edges correspond to which possible
previous states are possible. The MAP conguration is obtained by summing across
scores (i.e., the log singleton potential values in this case) along a path that goes from
one of the left-most nodes to the right-most node of the trellis diagram. If we keep
track of which edges we take that maximizes the score along the path we traverse,
then we can backtrack once were at the end to gure out what the best conguration
is, as was done in the MAP elimination algorithm from last lecture.
The key idea for the Max-Sum algorithm is that we basically iterate through the
vertical layers (i.e., hidden states) in the trellis diagram one at a time from left to
right. At each vertical layer, we store the highest possible score achievable that arrives
at each possible value that the hidden state can take. Note that for this example, the
edges do not have scores associated with them; in general, edges may have scores as
well.
7

With this intuition, we run the Max-Sum with the modication that we keep track
of which optimizing values give rise to the highest score so far at each hidden states
possible value. At the end, we backtrack to obtain the global MAP conguration
(00, 00, 00, 0), which achieves a overall score of 6w . Note that this algorithm of
essentially using Max-Product on an HMM while keeping track of optimizing values
to enable backtracking at the end is called the Viterbi algorithm, developed in 1967.

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

12

Gaussian Belief Propagation

Our inference story so far has focused on discrete random variables, starting with the
Elimination algorithm for marginalization. For trees, we saw how there is always a
good elimination order, which gave rise to the Sum-Product algorithm for trees, also
referred to as belief propagation (BP). For the problem of determining MAP cong
urations of the variables, we introduced the MAP Elimination algorithm, which is
obtained from its marginalization counterpart by replacing summations with maxi
mizations and keeping track of the maximizing argument of every message. For trees,
an ecient implementation of elimination yielded the Max-Product algorithm. We
then specialized our algorithms to the case of the simplest class of treesthe hidden
Markov modelobtaining the forward-backward algorithm as the specialization of
Sum-Product, and the Viterbi algorithm as the specialization of Max-Product.
In the case of continuous-valued distributions, we can of course continue to use the
Elimination Algorithm, where now summations are replaced with integrals. Obviously
our computational complexity analysis does not carry over from the discrete case, as
we need to think about the complexity of integrationwhich could be intractable
for arbitrary continuous distributions. However, as we develop in these notes, for
the special case of jointly Gaussian distributions, explicit integration can be entirely
avoided by exploiting the algebraic structure of the underlying distributions, leading,
once again, to highly ecient algorithms!
We begin with the Gaussian Sum-Product algorithm, or equivalently, Gaussian
belief propagation.

12.1

Preliminary Example: the Two-Node Case

We begin with a two-node undirected graph for the jointly Gaussian case, observing
the message passing structure associated with eliminating one node. The insights
here will help us anticipate the structure for the general case of trees.
Our development makes use of a key result from our earlier introduction to Gaus
sian graphical models, whereby in information form, marginalization is computed via
a Schur complement:
Claim 1. If
x=

x1
x2

N1

h1
J
J
, 11 12
h2
J21 J22

(1)

then
x1 N1 (h , J )

(2)

where
h = h1 J12 J1
22 h2

and

J = J11 J12 J1
22 J21 .

(3)

Proceeding, consider the case of two jointly Gaussian random variables x1 and x2
distributed according to
px1 ,x2 (x1 , x2 )
1
1 T
T
T
T
exp xT
1 J11 x1 + h1 x1 x2 J22 x2 + h2 x2 x1 J12 x2
2
2
1
1 T
T
T
T
= exp xT
1 J11 x1 + h1 x1 exp x2 J22 x2 + h2 x2 exp x1 J12 x2 .
2
2
,
}
,
},
}
/ (x ,x )
/1 (x1 )

12

/2 (x2 )

where we want to compute the marginal px1 . Writing out the associated integration,1
while interpreting the quantities involved in terms of the sum-product algorithm
messages, we obtain
px1 (x1 ) =

1 (x1 ) 2 (x2 ) 12 (x1 , x2 ) dx2 = 1 (x1 )

2 (x2 )12 (x1 , x2 ) dx2 .


,
}
m21 (x1 )

However, the message m21 can be rewritten as


m21 (x1 ) =
=

1
T
T
exp xT
dx2
2 J22 x2 + h2 x2 x1 J12 x2
2
 T 
    T  
1 x1
0 J12 x1
0
x1
exp
+
J21 J22 x2
h2
x2
2 x2

dx2 ,

which, conveniently, we can interpret as a marginalization which we can evaluate


using a Schur complement! In particular, applying Claim 1 with J11 = 0 and h1 = 0,
we then obtain
m21 (x1 ) N1 (x1 ; h21 , J21 ),
where
h21 / J12 J1
22 h2

1
and J21 / J12 J22
J21 .

We remark that the proportionality constant does not matter; indeed, even in the
discrete case we had the lattitude to normalize messages at each step. However, what
is important is the observation that in this case of jointly Gaussian distributions,
the messages are Gaussian. In turn, this has important computational implications.
Indeed, since Gaussian distributions are characterized by their mean and convariance
parameters, in Gaussian inference we need only propagate these parameters. Phrased
dierently, the mean and covariance parameters constitute equivalent messages in
this Gaussian case. As such, the computation involved in the manipulation of these
1

This is an integration over Rd if x2 is d-dimensional.

messages in the inference process amounts to linear algebra, whose complexity is easy

to characterize!
To complete our initial example, note that the marginal distribution for x1 , which
is also a Gaussian distribution and thereby characterized by its mean and covariance,
is obtained from the message m21 via the following nal bit of linear algebra:
px1 (x1 ) 1 (x1 ) m21 (x1 )



1 T
1 T
T
T
exp x1 J11 x1 + h1 x1
x1 J21 x1 + h21 x1
2
2


1 T
T
= exp x1 (J11 + J21 )x1 + (h1 + h21 ) x1
2
1
N (x1 ; h1 + h21 , J11 + J21 ).
In particular, in marginalizing out x2 , we see that h21 is used in updating the
potential vector of x1 , while J21 is used in updating the information matrix of x1 .

12.2

Undirected Trees

From the two-node example, we saw that messages are Gaussian in the case of Gaus
sian inference, and that message computation and marginalization involve linear al
gebraic computation. Thus, when applying the Sum-Product algorithm to the case of
Gaussian distributions, our focus is one of determining the associated linear algebra
for its implementation, which we now develop, again exploiting the information form
of Gaussian distributions and the role of Schur complements in marginalization.
To start, note that the message sent from xi to xj can be written as
mij (xj )
Z
= i (xi )ij (xi , xj )

mki (xi ) dxi


kN (i)\j




1 T
T
= exp xi Jii xi + hi xi exp xiT Jij xj
2


1 T
T

exp xi Jki xi + hki xi dxi


2
kN (i)\j



Z
1
1
T
T
= exp xT
xT
Jki xi + hT
dxi
i Jii xi + hi xi xi Jij xj +
ki xi
2

2 i
kN (i)\j
(
 T 
 
Z
1 xi
Jii + kN (i)\j Jki Jij
xi
= exp
xj
Jji
0
2 xj
)

T  
hi + kN (i)\j hki
xi
+
dxi .
xj
0
Z

Applying Claim 1, we obtain

mij (xj ) N1 (xj ; hij , Jij ),

(4)

where
1

hij = Jji Jii +

Jki

hi +

kN (i)\j

hki ,

(5)

kN (i)\j

Jij = Jji Jii +

Jki

Jij .

(6)

kN (i)\j

In turn, the marginal computation can be expressed as


Y
mki (xi )

pxi (xi ) i (xi )


kN (i)

 Y


1 T
1 T
T
T
exp xi Jii xi + hi xi
exp xi Jki xi + hki xi
2
2
kN (i)


1
X  1
= exp xT
Jii xi + hT
xT
Jki xi + hT
ki xi
i xi +
2 i

2 i
kN (i)

T
1

X
X

= exp xT
+
J
+
h
+
h
,
J
x
x
ii
ki
i
i
ki
i
2 i

kN (i)

kN (i)

from which we obtain that


i, J
i)
xi N1 (h
where
i = hi +
h

i = Jii +
J

hki ,

kN (i)

Jki .

(7)

kN (i)

Eqs. (5), (6), and (7) thus dene the implementation of the update rules for
Gaussian BP. As in the general case, we may run these update rules at each node in
parallel.
Having now developed the linear algebra that implements Gaussian belief propa
gation, we can examine the complexity of Gaussian inference, focusing on the serial
version of Gaussian BP. In general, of course, dierent xi can have dierent dimen
sions, but to simplify our discussion let us assume that each xi has dimension d. At
each iteration, the computational complexity is dominated by the matrix inversion in
(5), which need not be repeated for computing (6). Now if matrix inversion is imple
mented via Gaussian elimination, its complexity is O(d3 ) complexity2 Moreover, as
2

Actually, a lower complexity of O(d2.376 ) is possible via the Coppersmith-Winograd algorithm.

in the case of discrete-valued variables, the number of iterations needed scales with
the diameter of the undirected tree, which in turn scales with the number of nodes
N . Thus, the overall complexity is O(N d3 ). By contrast, naively inverting the infor
mation matrix of the entire graph J RN dN d in order to compute marginal means
and covariances at each node results in inference complexity of O((N d)3 ) = O(N 3 d3 ).

12.3

Connections to Gaussian Elimination

From our initial discussion of Gaussian distributions, recall that the covariance and
information forms are related via
J = 1

and h = J,

(8)

from which we see that marginalization, i.e., the computation of from the informa
tion form parameterization can be obtained by solving a set of linear equations.
Via this observation, it follows that for any (symmetric) positive-denite matrix
A that corresponds to the information matrix for a tree graph, the linear equation
Ax = b for x may be solved by running Gaussian BP on the associated graph!
As such, there is a close connection between the Gaussian elimination algorithm for
solving linear equations and Gaussian BP, which we give a avor of by the following
simple example.
Example 1. Consider the following pair of jointly
 
  
x1
3
4
1
,
N
3
2
x2

Gaussian variables

2
.
3

In particular, lets look at solving the following for x = (x1 , x2 ):



   
4 2 x1
3
=
.
2 3 x2
3
Subtracting 2/3 of the second row from the rst yields


  

3 23 3
4 23 2 2 23 3 x1
=
.
3
2
3
x2
Simplifying yields
8

   
0
x1
1
3
=
,
3
2 3 x2

from which we can back-substitute to obtain x1 = 3/8 and so forth. We leave com
puting Gaussian BP messages by rst eliminating node x2 as an exercise, noting that
the calculations will in fact be identical to those above.

This begs the question: Does this result contradict our earlier computational com
plexity analysis suggesting that Gaussian BP is more ecient than naively inverting
the entire graphs information matrix using Gaussian elimination? We mention two
dierences here. First, the equivalence assumes the same elimination ordering, which
means that while using Gaussian elimination, we need to essentially reorder the rows
so that we eliminate the leaves to the root, and choosing a bad elimination ordering
will induce a signicant computational penalty, akin to using the Elimination Algo
rithm on a tree with a bad ordering (e.g., trying to rst eliminate nodes that are not
leaves). Second, Gaussian elimination does not know a priori that matrix A is sparse
and might, for example, perform many multiplications by 0 whereas Gaussian BP
for trees will automatically avoid a lot of these operations. We can modify Gaussian
elimination to account for such issues, but the intuition for doing so really comes from
the tree structure. In fact, once we account for the tree structure and elimination
ordering as described above, Gaussian elimination becomes equivalent to Gaussian
BP.

12.4

MAP Congurations in the Gaussian Case

For Gaussian distributions, we are also often interested in our other main inference
task, viz., MAP estimation. However, conveniently, for jointly Gaussian distributions,
marginalization and MAP estimation coincide, i.e., the sum-product and max-product
and produce identical results, and thus our development of the sum-product algorithm
suces.
The reason for this is the special tructure and symmetries of jointly Gaussian
distributions; indeed, the mean and mode (and median for that matter) of such
distributions is identical. Since the mode corresponds to the MAP conguration, and
the marginals are parameterized by the mean vector, BP suces for both.
To verify that the mode of a Gaussian distribution is its mean, it suces to note
that maximizing the distribution N(x; , ) over x amounts to solving the uncon
strained maximization problem


1

T 1
x = arg max (x ) (x )
2
x


1 T 1
1 T 1
T 1
= arg max x x + x .
2
2
x
Taking the gradient of the objective function with respect to x, we obtain




1 T 1
1 T 1
T 1
x x + x = 1 x + 1 .
x
2
2
In turn, setting the gradient to 0 we see that, indeed, the MAP conguration is
x = .
6

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology


Department of Electrical Engineering and Computer Science
6.438 Algorithms for Inference
Fall 2011

13 BP on Gaussian Hidden Markov Models: Kalman


Filtering and Smoothing
As we have seen, when our variables of interest are jointly Gaussian, the sum-product
algorithm for inference on trees takes a special form in which the messages themselves
are Gaussian. As we also saw earlier, a simple but very important class of trees
are the Hidden Markov Models (HMMs). In this section of the notes we specialize
our Gaussian BP algorithm to the class of Gaussian HMMs, which are equivalently
characterized as linear dynamical systems (LDSs). As we will see, what results is
a version of the forward-backward algorithm for such models that is referred to as
Kalman ltering and smoothing.

13.1

Gaussian HMM (State Space Model)

Consider the Gaussian HMM depicted in Fig. 1, where to simplify our development we
restrict our attention to zero-mean variables and homogeneous models. In this model,
the states xt and observations yt are d- and d' -dimensional vectors, respectively.
The Gaussian HMM can be expressed as an LDS by exploiting the representation
of the variables in innovation form.1 In particular, rst, the states evolve according
to the linear dynamics
xt+1 = Axt + vt ,

vt N(0, Q),

x0 N(0, 0 ),

(1a)

where A is a constant matrix, 0 and Q are covariance (i.e., positive semidenite)


matrices, and where {vt } is a sequence of independent random vectors that are also
independent of x0 . Second, the observations depend on the state according to the
linear measurements
yt = Cxt + wt ,
wt N(0, R),
(1b)

where C is a constant matrix and R is a covariance matrix, and where {wt } is a


sequence of independent random variables that is also independent of x0 and {vt }.
Hence, in terms of our notation, we have the conditional distributions
x0 N(0, 0 )
xt+1 |xt N(Axt , Q)
yt |xt N(Cxt , R).
1

Recall that any pair of jointly Gaussian random vectors u, v can be expressed in innovation form:
there exists a matrix G and a Gaussian random vector w (the innovation) that is independent of v
such that
u = Gv + w.

x0

x1

x2

xt

y0

y1

y2

yt

Figure 1: A Gaussian HMM as a directed tree. In this model, the variables are jointly
Gaussian.
Such Gaussian HMMs are used to model an extraordinarily broad range of phe
nomena in practice.

13.2

Inference with Gaussian HMMs

In our development of Gaussian BP for trees, we expressed the algorithm in terms


of the information form representation of the Gaussian variables involved. In this
section, we further specialize our results to the HMM, and express the quantities
involved directly in terms of the parameters of the LDS (1).
To obtain our results, we start by observing that the joint distribution factors as
p(xt0 , y0t ) = p(x0 ) p(y0 |x0 ) p(x1 |x0 ) p(y1 |x1 ) p(yt |xt ),
where we have used xt0 as a shorthand to denote the entire sequence x0 , . . . , xt .
Substituting the Gaussian form of the constituent conditional distributions and
matching corresponding terms, we then obtain

1 T
1
t
t
T 1
p(x0 , y0 ) exp x0 0 x0 exp (y0 Cx0 ) R (y0 Cx0 )
2
2

1
T 1
exp (x1 Ax0 ) Q (x1 Ax0 )
2

1 T
1 T 1
1 T T 1
T T 1
exp x0 0 x0 y0 R y0 x0 C R Cx0 + x0 C R y0
2
2
2

1 T 1
1 T T 1
T 1
exp x1 Q x1 x0 A Q Ax0 + x1 Q Ax0 +
2
2

1 T
1 T 1
1 T 1
T 1
T 1
exp x0 (0 + C R C + A Q A)x0 y0 R y0 x1 Q x1
2
2
2
2

t
t

(
)
T 1
T 1
exp xT
0 C R y0 + x1 Q Ax0 +

k (xk )

k=0

t
t

k1,k (xk1 , xk )

k=1

t
t

k (yk )

k=0

t
t

k (xk , yk ),

k=t

where
1
T 1

xT
+ AT Q1 A)x0

0 (0 + C R C

J0

1 T 1
T 1
T 1
log k (xk ) = 2 xk (Q + C R MC + A Q A)xk

Jk

T
1
T
1

x (Q + C R C)xk

M
2 k

k=0
1k t1
k=t

Jt

1
log k1,k (xk1 , xk ) = xT
k Q M A xk1 ,
Lk

1
log k (yk ) = y0T R1 y0 ,
2
T
1
log k (xk , yk ) = xk CT R
M yk .
Mk

Next note that y0t are observations, so we condition on these variables. After
conditioning on y0t , the joint distribution continues to be Gaussian, and simplies to
the following form
pxt0 |y0t (x0t |y0t ) px0t ,y0t (xt0 , y0t )

t
t

k=0
t
t

k (xk )

t
t

k (xk , yk )

k=0

t
t

k1,k (xk1 , xk )

k=1

t
t
1
(
)
T
T
exp xT

exp xk Jk xk + xk Mk yk
(Lk )xk1 .
k
M
2
k=0
k=1
hk

Thus the potential for node k is a Gaussian with information parameters (hk , Jk ), and
the potential for the edge from node k 1 to node k is a Gaussian with information
parameters (0, Lk ); see Fig. 2.
Having now expressed the joint distribution as a product of potentials in Gaussian
information form, we can easily read o the information matrices and potential vectors
required to apply Gaussian BP. In particular, the initial messages are
J01 = L1 J01 LT
1
1
h01 = L1 J0 h0 .
3

exp

xT
1 ( L1 )x0

x0

exp

x1

1 T
2 x0 J0 x0

x2

xt

+ hT
0 x0

Figure 2: The equivalent undirected tree for Gaussian HMMs, with only the rst
node and edge potentials shown.
Then, the forward pass is given by the recursion
Jii+1 = Li+1 (Ji + Ji1i )1 LT
i+1

hii+1 = Li+1 (Ji + Ji1i )1 (hi + hi1i )

i = 0, 1, . . . , t 1.

In turn, the backward pass is given by the recursion


Ji+1i = Li+1 (Ji+1 + Ji+2i+1 )1 LT
i+1

,
hi+1i = Li+1 (Ji+1 + Ji+2i+1 )1 (hi+1 + hi+2i+1 ).

i = t 1, t 2, . . . , 0.

Finally, the parameters for the marginals for each i are obtained via
i = Ji + Ji1i + Ji+1i
J
i = hi + hi1i + hi+1i ,
h
from which the mean vectors and covariance matricses associated with the t marginals
are given by
i and i = J
1 h
1 .
i = J
i
i
As a reminder, the complexity of this algorithm scales as O(td3 ) compared to O((td)3 )
for a naive implementation of the marginalization. The savings is particularly signif
icant because t d in typical applications.

13.3

Kalman Filtering and Smoothing

The implementation of BP for Gaussian HMMs developed in the previous section


has a convenient two-pass structure, consisting of a forward and a backward pass.
Note that the data yk enters into the computation through the potential vector hk ,
and thus in the serial version of BP each piece of the data is used three times: once
4

during the forward pass, once during the backward pass, and once during the marginal
computation.
There are a variety of ways to rearrange such forward-backward computation for
eciently producing marginals. One such variant corresponds to what is referred to as
the Rauch-Tung-Striebel (RTS) algorithm. In this algorithm, the forward and back
ward passes take the form of what are referred to as Kalman ltering and smoothing,
respectively.
The Kalman lter was introduced by R. Kalman in his extraordinarily inuential
work in 1959.2 In contrast to the forward pass of the Gaussian BP procedure de
veloped in the previous section, the Kalman lter directly generates a particular set
of marginals. Specically, it generates the marginals p(xi |y0i ) for i = 0, 1, . . . , t, i.e.,
a marginal at node i based on the data only through index i; this is what is meant
by the term ltering. Each step in the forward pass is typically implemented in
two substeps: a prediction substep that generates p(xi |y0i1 ), followed by an update
substep that generates p(xi |y0i ).
The backward pass of the RTS algorithm, which implements Kalman smoothing,
directly generates the desired marginals p(xi |y0t ), i.e., a marginal at node i based
on all the data. Moreover, it does so in a manner that requires only access to the
marginals obtained in the forward pass in the computationonce the forward pass is
complete, the data is no longer needed.
From the above perspectives, the RTS algorithm can be viewed as the Gaussian
version of the so-called , variant of the forward-backward algorithm for discrete
HMMs. Indeed, suitably normalized, the messages are the ltered marginals, and
the messages are the smoothed marginals. The , forward-backward algorithm
and its relation to the , forward-backward algorithm we introduced as BP for
discrete HMMs are developed in Chapter 12 of Jordans notes.
There are several versions of the Kalman lter, both in covariance and information
forms. As an illustration, we summarize a standard version of the former. As notation,
we use i|j and i|j to refer to the mean vector and covariance matrix, respectively,
of the distribution p(xi |y0j ). We omit the derivation, which involves substituting
the Gaussian forms into the , version of the forward-backward algorithm, with
summations replaced by integrals, and evaluating the integrals, the same manner
that Gaussian BP was obtained from our original BP equations.
More specically, the forward (ltering) recursion is

(xi+1 ) = (xi ) p(xi+1 |xi ) p(yi+1 |xi+1 ) dxi ,


2

You are strongly encouraged to read at least the rst two pages to appreciate the broader context
as well as signicance of this work.

and the backward (smoothing) recursion is

Z
(xi ) p(xi+1 |xi )
dxi+1 ,
(xi ) = (xi+1 )
(x'i ) p(xi+1 |x'i ) dxi'

where, for reference, recall that the marginals are expressed in terms of our original
messages via
(xi ) (xi )
(xi ) =
.
p(y0t )
13.3.1

Filtering

The ltering pass produces the mean and covariance parameters i|i and i|i , re
spectively, in sequence for i = 0, 1, . . . , t. Each step consists of the following two
substeps:
Prediction Substep: In this substep, we predict the next state xi+1 from the
current observations y0i , whose impact is summarized in the ltered marginals p(xi |y0i )
and thus does not require re-accessing the data:
p(xi |y0i ) p(xi+1 |y0i ),

i = 0, 1, . . . , t 1.

In terms of the model parameters (1), the prediction recursion for the associated
marginal parameters can be derived to be
i+1|i = Ai|i
i+1|i = Ai|i AT + Q.

i = 0, 1, . . . , t 1,

where the initialization of the recursion is


0|1 = 0

(2)

0|1 = 0 .

(3)

Update Substep: In this substep, we update the prediction at step i by incorpo


rating the new data yi+1 , i.e.,
p(xi+1 |y0i ) p(xi+1 |y0i+1 ),

i = 0, 1, . . . , t 1.

In terms of the model parameters (1), the update recursion for the associated marginal
parameters can be derived to be
i+1|i+1 = i+1|i + Gi+1 (yi+1 Ci+1|i )

i+1|i+1 = i+1|i Gi+1 Ci+1|i

i = 0, 1, . . . , t 1,

where
Gi+1 = i+1|i CT (Ci+1|i CT + R)1
is a precomputable quantity referred to as the Kalman gain.
6

13.3.2

Smoothing

The ltering pass produces the mean and covariance parameters i|t and i|t , respec
tively, in sequence for i = t, t 1, . . . , 0. Each step implements
p(xi+1 |y0t ) p(xi |y0t ),

i = t, t 1, . . . , 0.

In terms of the model parameters (1), the smoothing recursion for the associated
marginal parameters can be derived to be
i|t = i|i + Fi (i+1|t i+1|i )

i|t = Fi (i+1|t i+1|i )FT


i + i|i
where
Fi = i|i AT 1
i+1|i .

i = t, t 1, . . . , 0.

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

14

The Junction Tree Algorithm

In the past few lectures, we looked at exact inference on trees over discrete random
variables using Sum-Product and Max-Product, and for trees over multivariate Gaus
sians using Gaussian belief propagation. Specialized to hidden Markov models, we
related Sum-Product to the forward-backward algorithm, Max-Product to the Viterbi
algorithm, and Gaussian belief propagation to Kalman ltering and smoothing.
We now venture back to general undirected graphs potentially with loops and again
ask how to do exact inference. Our focus will be on computing marginal distributions.
To this end, we could just run the Elimination Algorithm to obtain the marginal
at a single node and then repeat this for all other nodes in the graph to nd all
the marginals. But as in the Sum-Product algorithm, it seems that we should be
able to recycle some intermediate calculations. Today, we provide an answer to this
bookkeeping with an algorithm that does exact inference on general undirected graphs
referred to as the Junction Tree Algorithm.
At a high-level, the basic idea of the Junction Tree Algorithm is to convert the
input graph into a tree and then apply Sum-Product. While this may seem to good
to be true, alas, theres no free lunch and the neprint is that the nodes in the new
tree may have alphabet sizes substantially larger than those of the original graph! In
particular, we require the trees to be what are called junction trees.

14.1

Clique trees and junction trees

The idea behind junction trees is that certain probability distributions corresponding
to possibly loopy undirected graphs can be reparameterized as trees, enabling us to
run Sum-Product on this tree reparameterization and rest assured that we extract
exact marginals.
Before dening what junction trees are, we dene clique graphs and clique trees.
Given graph G = (V, E), a clique graph of G is a graph where the set of nodes is
precisely the set of maximal cliques in G. Next, any clique graph that is a tree is a
clique tree. For example, Figures 1b and 1c show two clique trees for the graph in
Figure 1a.
x1

134

x2

134

124

x3

x4
(a)

x5

245

124
245

(b)

(c)

Figure 1: (a) Original graph. (b)-(c) Clique trees for the original graph in (a).

Note that if each xi takes on a value in X, then each clique node in the example clique
trees corresponds to a random variable that takes on X3 values since all the maximal
clique sizes are 3. In other words, clique nodes are supernodes of multiple random
variables in the original graph.
Intuitively, the clique tree in Figure 1b seems inadequate for describing the under
lying distribution because maximal cliques {1, 3, 4} and {1, 2, 4} both share node 1
but there is no way to encode a constraint that says that the x1 value that these two
clique nodes take on must be the same; we can only put constraints as edge potentials
for edges ({1, 3, 4}, {2, 4, 5}) and ({2, 4, 5}, {1, 2, 4}). Indeed, what we need is that the
maximal cliques that node 1 participates in must form a connected subtree, meaning
that if we delete all other maximal cliques that do not involve node 1, then what we
are left with should be a connected tree.
With this guiding intuition, we dene the junction tree property. Let Cv denote
the set of all maximal cliques in the original graph G = (V, E) that contain node v.
Then we say that a clique graph for G satises the junction tree property if for all
v G, the set of nodes Cv in the clique graph induces a connected subtree. For
example, the clique tree in Figure 1c satises the junction tree property.
Finally, we dene a junction tree to be a clique tree that satises the junction tree
property. Thus, the clique tree in Figure 1c is a junction tree. With our previous
intuition for why we need the junction tree property, we can now specify constraints
on shared nodes via pairwise potentials. We will rst work o our example before
discussing the general case.
The original graph in Figure 1a has corresponding factorization
px1 ,x2 ,x3 ,x4 ,x5 (x1 , x2 , x3 , x4 , x5 ) 134 (x1 , x3 , x4 )124 (x1 , x2 , x4 )245 (x2 , x4 , x5 ),

(1)

whereas the junction tree in Figure 1c has corresponding factorization


py134 ,y124 ,y245 (y134 , y124 , y245 )
134 (y134 )124 (y124 )245 (y245 )134,124 (y134 , y124 )124,245 (y124 , y245 ).

(2)

For notation, well denote [yV ]S to refer to the value that yV assigns to nodes in S.
For example, [y134 ]3 refers to the value that y134 X3 assigns to node 3. Then note
that we can equate distributions (1) and (2) by assigning the following potentials:
134 (y134 ) = 134 ([y134 ]1 , [y134 ]3 , [y134 ]4 )
124 (y124 ) = 124 ([y124 ]1 , [y124 ]2 , [y124 ]4 )
245 (y245 ) = 245 ([y245 ]2 , [y245 ]4 , [y245 ]5 )
134,124 (y134 , y124 ) = 1{[y134 ]1 = [y124 ]1 and [y134 ]4 = [y124 ]4 }
124,245 (y124 , y245 ) = 1{[y124 ]2 = [y245 ]2 and [y124 ]4 = [y245 ]4 }
The edge potentials constrain shared nodes to take on the same value, and once this

consistency is ensured, the rest of the potentials are just the original clique potentials.

In the general case, consider a distribution for graph G given by

px (x)

C (xC ),

(3)

CC

where C is the set of maximal cliques in G. Then a junction tree T = (C, F) for G has
factorization
py ({yC : C C})

C (yC )

V,W (yV , yW ),

(4)

U,W F

CC

where edge potential V,W is given by


V,W (yV , yW ) = 1{[yV ]S = [yW ]S }

for S = V W.

Comparing (3) with (1) and (4) with (2), note that importantly, yC s are values over
cliques. In particular, for V, W C with S = V W = , we can plug into the joint
distribution (4) values for yV and yW that disagree over shared nodes in S; of course,
the probability of such an assignment will just be 0.
With the singleton and edge potentials dened, applying Sum-Product directly
yields marginals for all the maximal cliques. However, we wanted marginals for
nodes in the original graph, not marginals over maximal cliques! Fortunately, we
can extract node marginals from maximal-clique marginals by just some additional
marginalization. In particular, for node v in the original graph, we just look at the
marginal distribution for any maximal clique that v participates in and sum out the
other variables in this maximal cliques marginal distribution.
So we see that junction trees are useful as we can apply Sum-Product on them
along with some additional marginalization to compute all the node marginals. But
a few key questions arise that demand answers:
1. What graphs over distributions have junction trees?
2. For a graph that has a junction tree, while we know a junction tree exists for
it, how do we actually nd it?
3. For a graph that does not have a junction tree, how do we modify it so that it
does have a junction tree?
Well answer these in turn, piecing together the full Junction Tree Algorithm.

14.2

Chordal graphs and junction trees

To answer the rst question raised, we skip to the punchline:


Theorem 1. If a graph is chordal, then it has a junction tree.

Figure 2: Diagram to help explain the proof of Lemma 1.


In fact, the converse is also true: If a graph has a junction tree, then it must
be chordal. We wont need this direction though and will only prove the forward
direction. First, we collect a couple of lemmas.
Lemma 1. If graph G = (V, E) is chordal, has at least three nodes, and is not fully
connected, then V = ABS where: (i) sets A, B, and S are disjoint, (ii) S separates
A from B, and (iii) S is fully connected.
Proof. Refer to the visual aid in Figure 2. Nodes , V are chosen to be nonad
jacent (requiring the graph to have at least three nodes and not be fully connected),
and S is chosen to be the minimal set of nodes for which any path from to passes
through. Dene A to be the set of nodes reachable from with S removed, and
dene B to be the set of nodes reachable from with S removed. By construction,
S separates A from B and V = A B S where A, B, and S are disjoint. It remains
to show (iii).
Let , S. Because S is the minimal set of nodes for which any path from to
passes through, there must be a path from to as well as a path from to , from
which we conclude that there must be a path from to in A S. Similarly, there
must be a path from to as well as a path from to , from which we conclude
that there must be a path from to in B S.
Suppose for contradiction that there is no edge from to . Consider the shortest
path from to in A S and the shortest path from to in B S. String these
two shortest paths together to form a cycle of at least four nodes. Since we chose
shortest paths, this cycle has no chord, which is a contradiction because G is chordal.
Conclude then that there must be an edge from to . Iterating this argument over
all such , S, we conclude that S is fully connected.

Lemma 2. If graph G = (V, E) is chordal and has at least two nodes, then G has
at least two nodes each with all its neighbors connected. Furthermore, if G is not
fully connected, then there exist two nonadjacent nodes each with all its neighbors
connected.
Proof. We use strong induction on the number of nodes in the graph. The base case
of two nodes is trivial as each node has only one neighbor.
Now for the inductive step: Our inductive hypothesis is that any graph of size up
to N nodes which is chordal and has at least two nodes contains at least two nodes
each with all its neighbors connected; moreover, if said graph is fully connected then
there exist two nonadjacent nodes each with all its neighbors connected.
We want to show that the same holds for any graph G/ over N + 1 nodes which is
chordal and has at least two nodes. Note that if G/ is fully connected, then there is
nothing to show since each node has all its neighbors connected, so we can arbitrarily
pick any two nodes. The interesting case is when G/ is not fully connected, for which
we need to nd two nonadjacent nodes each with all its neighbors connected. To do
this, we rst apply Lemma 1 to decompose the graph into disjoint sets A, B, S with
S separating A and B. Note that A S and B S are each chordal with at least two
nodes and at most N nodes. The idea is that we will show that we can pick one node
from A that has all its neighbors connected and also one node from B that has all
its neighbors connected, at which point wed be done. The proof for nding each of
these nodes is the same so we just need to show how to nd, say, one node from A
that has all its neighbors connected.
By the inductive hypothesis, AS has at least two nodes each with all its neighbors
connected. In particular, if AS is fully connected, then we can just choose any point
in A and its neighbors will all be connected and it certainly wont have any neighbors
in B since S separates A from B, at which point were done. On the other hand, if
A S is not fully connected, by the inductive hypothesis, there exist two nonadjacent
nodes in A S each with all its neighbors connected, for which certainly one will
be A because if both are in S, they would actually be adjacent (since S is fully
connected).
We now prove Theorem 1:
Proof. We use induction on the number of nodes in the graph. For the base case, we
note that if a chordal graph has only one or two nodes, then trivially the graph has
a junction tree.
Onto the inductive step: Suppose that any graph of up to N nodes that is chordal
has a junction tree. We want to show that chordal graph G with N + 1 nodes also has
a junction tree. By Lemma 2, G has a node with all its neighbors connected. This
means that removing node from G will result in a graph G/ which is also chordal.
By the inductive hypothesis, G/ , which has N nodes, has a junction tree T / . We next
show that we can modify T / to obtain a new junction tree T, which is the junction
tree for our (N + 1)-node chordal graph G.
5

Let C be the maximal clique that node participates in, which consists of and
its neighbors, which are all connected. If C \ is a maximal-clique node in T / , then
we can just add to this clique node to obtain a junction tree T for G.
Otherwise, if C \ is not a maximal-clique node in T / , then C \ , which we know
to be a clique, must be a subset of a a maximal-clique node D in T / . Then we add C
as a new maximal-clique node in T / which we connect to D to obtain a junction tree
T for G; justifying why T is in fact a junction tree just involves checking the junction
tree property and noting that participates only in maximal clique C.

14.3

Finding a junction tree for a chordal graph

We now know that chordal graphs have junction trees. However, the proof given
for Theorem 1 is existential and fails to deliver an ecient algorithm. Recalling our
second question raised from before, we now ask how we actually construct a junction
tree given a chordal graph.
Surprisingly, nding a junction tree for a chordal graph just boils down to nding
a maximum-weight spanning tree, which can be eciently solved using Kruskals
algorithm or Prims algorithm. More precisely:
Theorem 2. Consider weighted graph H, which is a clique graph for some underlying
graph G where we have an edge between two maximal-clique nodes V, W with weight
|V W | whenever this weight is positive.
Then a clique tree for underlying graph G is a junction tree if and only if it is a
maximum-weight spanning tree in H.
Before we prove this, we state a key inequality: Let clique tree T = (C, F) be for
underlying graph G = (V, E) with maximal cliques C. Denote Se as the separator
set for edge e F, i.e., Se is the nodes in common between the two maximal cliques
connected by e. Then any v V satises:
1{v Se }
eF

1{v C} 1.

(5)

CC

What this is saying is that the number of separator sets v participates in is bounded
above by the number of maximal cliques v participates in minus 1. The intuition for
why theres a minus 1 is that the left-hand side is counting edges in the tree and the
right-hand side is counting nodes in the tree (remember that a tree with N nodes has
N 1 edges). As a simple example, if v participates in two maximal cliques, then v
will participate in at most one separator set, which is precisely the intersection of the
two maximal cliques.
With the above intuition, note that the inequality above becomes an equality if
and only if T is a junction tree! To see this, after dropping terms on both sides that
have nothing to do with node v, the left-hand side counts edges that v is associated
with and the right-hand side counts maximal-clique nodes that v is associated with
6

minus 1. So equality holds when T restricted to maximal cliques that v participates

in forms a connected subtree.


Now we prove Theorem 2:
Proof. Let T = (C, F) be a clique tree for underlying graph G = (V, E). Note that T
is a subgraph of weighted graph H. Denote w(T) to be the sum of all the weights
along edges of T using the edge weights assigned. We nd an upper-bound for w(T),
using inequality (5):
X
w(T) =
|Se |
eF

XX

1{v Se }

eF vV

XX

1{v Se }

vV eF

e
X X
vV

1{v C} 1

CC

XX

XX

1{v C} |V |

vV CC

1{v C} |V |

CC vV

|C| |V |.

CC

This means that any


n maximum-weight spanning (clique) tree T for H has weight
bounded above by CC |C| |V |, where equality is attained if and only if inequality (5) is an equality, i.e., when T is a junction tree.

14.4

The full algorithm

The last question asked was that if a graph doesnt have a junction tree, how can we
modify it so that it does have a junction tree? This question was actually already
answered when we discussed the Elimination Algorithm. In particular, when running
the Elimination Algorithm, the reconstituted graph is always chordal. So it suces to
run the Elimination Algorithm using any elimination ordering to obtain the (chordal)
reconstituted graph.
With this last question answered, we outline the key steps for the Junction Tree
Algorithm. As input, we have an undirected graph and an elimination ordering, and
the output comprises of all the node marginals.
1. Chordalize the input graph using the Elimination Algorithm.
7

2. Find all the maximal cliques in the chordal graph.


3. Determine all the separator set sizes for the maximal cliques to build a (possibly
loopy) weighted clique graph.
4. Find a maximum-weight spanning tree in the weighted clique graph from the
previous step to obtain a junction tree.
5. Assign singleton and edge potentials to the junction tree.
6. Run Sum-Product on the junction tree.
7. Do additional marginalization to get node marginals for each node in the original
input graph.
As with the Elimination Algorithm, we incorporate observations by xing observed
variables to take on specic values as a preprocessing step. Also, we note that weve
intentionally brushed aside implementation details, which give rise to dierent vari
ants of the Junction Tree Algorithm that have been named after dierent people. For
example, by accounting for the structure of the edge potentials, we can simplify the
Sum-Product message passing equation, resulting in the Shafer-Shenoy algorithm.
Additional clever bookkeeping to avoid having to multiply many messages repeatedly
yields the Hugin algorithm.
We end by discussing the computational complexity of the Junction Tree Algo
rithm. The key quantity of interest is the treewidth of the original graph, dened in a
previous lecture as the largest maximal clique size for an optimal elimination ordering.
Note that using an optimal elimination ordering will result in the largest maximalclique node in the junction tree having alphabet size exponential in the treewidth, so
the Junction Tree Algorithm will consequently have running time exponential in the
treewidth.
Yet, the fact that the Junction Tree Algorithm may take exponential time should
not be surprising. Since we can encode NP-complete problems such as 3-coloring as
an inference problem over an undirected graph, if the Junction Tree Algorithm could
eciently solve all inference problems over undirected graphs, then the P vs. NP
debate would have long ago been settled.

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

15-16. Loopy Belief Propagation and Its Properties


Our treatment so far has been on exact algorithms for inference. We focused on
trees rst and uncovered ecient inference algorithms. To handle loopy graphs, we
introduced the Junction Tree Algorithm, which could be computational intractable
depending on the input graph. But many real-world problems involve loopy graphs,
which demand ecient inference algorithms. Today, we switch gears to approximate
algorithms for inference to tackle such problems. By tolerating some error in our
solutions, many intractable problems suddenly fall within reach of ecient inference.
Our rst stop will be to look at loopy graphs with at most pairwise potentials.
Although we derived the Sum-Product algorithm so that it was exact on trees, noth
ing prevents us from running its message updates on a loopy graph with pairwise
potentials. This procedure is referred to as loopy belief propagation. While mathe
matically unsound at a rst glance, loopy BP performs surprisingly well on numerous
real-world problems. On the ip side, loopy BP can also fail spectacularly, yielding
nonsensical marginal distributions for certain problems. We want to gure out why
such phenomena occur.
In guiding our discussion, we rst recall the parallel Sum-Product algorithm, where
we care about the parallel version rather than the serial version because our graph
may not have any notion of leaves (e.g., a ring graph). Let G = (V, E) be a graph
over random variables x1 , x2 , . . . , xN X distributed as follows:
px (x)

exp (i (xi ))
iV

exp (ij (xi , xj )) .

(1)

(i,j)E

Remark : If (i, j) E, then (j, i)


/ E since otherwise we would have ij and ji be

two separate factors when we only want one of these; moreover, we dene ji / ij .

Parallel Sum-Product is given as follows.

Initialization: For (i, j) E and xi , xj X,

m0ij (xj ) 1 m0ji (xi ).


Message passing (t = 0, 1, 2, . . . ): For i V and j N (i),
mt+1
ij (xj )

mtki (xi ),

exp (i (xi )) exp (ij (xi , xj ))


xi X

where we normalize our messages:

kN (i)\j

xj X

t
mij
(xj ) = 1.

Computing node and edge marginals: For (i, j) E,

Y
t
bti (xi ) exp (i (xi ))
mki
(xi ),

(2)

kN (i)

btij (xi , xj ) exp (i (xi ) + j (xj ) + ij (xi , xj ))

mtki (xi )

kN (i)\j

mt
ij (xj ).

iN (j)\i

(3)
But what exactly are these message-passing equations doing when the graph is not
a tree? Lets denote mt [0, 1]2|E||X| to be all messages computed in iteration t stacked
up into a giant vector. Then we can view the message passing equations as applying
some function F : [0, 1]2|E||X| [0, 1]2|E||X| to mt to obtain mt+1 : mt+1 = F (mt ).
For loopy BP to converge, it must be that repeatedly applying the message-passing
update F eventually takes us to whats called a xed point m of F :
m = F (m ).
This means that if we initialize our messages with m , then loopy BP would imme
diately converge because repeatedly applying F to m will just keep outputting m .
Of course, in practice, we dont know where the xed points are and the hope is that
applying F repeatedly starting from some arbitrary point brings us to a xed point,
which would mean that loopy BP actually converges. This leads us naturally to the
following questions:
Does message-passing update F have a xed point?
If F has a xed point or multiple xed points, what are they?
Does F actually converge to a xed point?
In addressing these questions, well harness classical results from numerical analysis
and interweave our inference story with statistical physics.

14.1

Brouwers xed-point theorem

To answer the rst question raised, we note that the message-passing update equations
are continous and so, collectively, the message-passing update function F is continuous
and in fact maps a convex compact set1 [0, 1]2|E||X| to the same convex compact set.
Then we can directly apply Brouwers xed-point theorem:
Theorem 1. (Hadamard 1910, Brouwer 1912) Any continuous function mapping
from a convex compact set to the same convex compact set has a xed point.
1

In Euclidean space, a set is compact if and only if it is closed and bounded (Heine-Borel theorem)
and a set is convex if the line segment connecting any two points in the set is also in the set.

This result has been popularized by a coee cup analogy where we have a cup of

coee and we stir the coee. The theorem states that after stirring, at least one point
must be in the same position as it was prior to stirring! Note that the theorem is an
existential statement with no known constructive proof that gives exact xed points
promised by the theorem. Moreover, as a historical note, this theorem can be used
to prove the existence of Nash equilibria in game theory.
Applying Brouwers xed-point theorem to F , we conclude that in fact there does
exist at least one xed point to the message-passing equations. But what does this
xed point correspond to? It could be that this giant-vector-message xed point
does not correspond to the correct nal messages that produce correct node and edge
marginals. Our next goal is to characterize these xed points by showing that they
correspond to local extrema of an optimization problem.

14.2

Variational form of the log partition function

Describing the optimization problem that message-passing update equations actually


solve involves a peculiar detour: We will look at a much harder optimization problem
that in general we cant eciently solve. Then well relax this hard optimization
problem by imposing constraints to make the problem substantially easier, which will
directly relate to our loopy BP message update equations.
The hard optimization problem is to solve for the log partition function of distri
bution (1). Recall from Lecture 1 that, in general, solving for the partition function
is hard. Its logarithm is no easier to compute. But why should we care about the
partition function though? The reason is that if we have a black box that com
putes partition functions, then we can get any marginal of the distribution that we
want! Well show this shortly but rst well rewrite factorization (1) in the form of a
Boltzmann distribution:

X
X
1
1
px (x) = exp
i (xi ) +
ij (xi , xj ) = eE(x) ,
(4)
Z
Z
iV
(i,j)E

P
P

where Z is the partition function and E(x) / iV i (xi ) (i,j)E ij (xi , xj ) refers
to energy. Then suppose that we wanted to know pxi (xi ). Then letting x\i denote

the collection (x1 , x2 , . . . , xi1 , xi+1 , . . . , xN ), we have


X
pxi (xi ) =
px (x)
x\i

X1
X
X
=
exp
i (xi ) +
ij (xi , xj )
Z
x
iV
(i,j)E

\i

x\i

exp

iV

i (xi ) +

(i,j)E

ij (xi , xj )

Z
Z(xi = xi )
=
Z
P

P
where Z(xi = xi ) = x\i exp( iV i (xi ) + (i,j)E ij (xi , xj )) is just the partition
function once we x random variable xi to take on value xi . So if computing partition
functions were easy, then we could compute any marginalnot just for nodeseasily.
Thus, intuitively, computing the log partition function should be hard.
We now state what the hard optimization problem actually is:

X
X
log Z = sup
(x)E(x)
(x) log (x) ,
(5)
M

xXN

xXN

where M is the set of all distributions over XN . This expression is a variational


characterization of the log partition function, which is a fancy way of saying that
we wrote some expression as an optimization problem. Well show why the righthand side optimization problem actually does yield log Z momentarily. First, we lend
some physical insight for whats going on by parsing the objective function F of the
maximization problem in terms of energy:
X
X
F() =
(x)E(x)
(x) log (x)
xXN

xXN

= E [E(x)] + E [ log (x)] .


' '
average energy
w.r.t.

(6)

entropy of

So by maximizing F over , we are nding that minimizes average energy while


maximizing entropy, which makes sense from a physics perspective.
We could also view the optimization problem through the lens of information
theory because as well justify shortly, any solution to optimization problem (5) is
also a solution to the following optimization problem:
min D( I px ),
M

(7)

where D( I ) is the Kullback-Leibler divergence also called the information divergence

between two distributions over the same alphabet:


D(p I q) =

p(x) log

p(x)
.
q(x)

KL divergence is a way to measure how far apart two distributions are; however, it
is not a proper distance because it is not symmetric. It oers a key property that
we shall use though: D(p I q) 0 for all distributions p and q dened on the same
alphabet, where equality occurs if and only if p and q are the same distribution.
We now show why the right-hand side maximization in equation (5) yields log Z
and, along the way, establish that the maximization in (5) has the same solution as
the minimization in (7). We begin by rearranging terms in (4) to obtain
E(x) = log Z log px (x).

(8)

Plugging (8) into (6), we obtain


X
X
F() =
(x)E(x)
(x) log (x)
xXN

xXN

(x) (log Z + log px (x))

xXN

(x) log (x)

xXN

(x) log Z +

xXN

= log Z +

(x) log px (x)

xXN

(x) log (x)

xXN

(x) log px (x)

xXN

= log Z

(x) log (x)

xXN

(x) log

xXN

(x)
px (x)

= log Z D( I px )
log Z,
where the last step is because D( I px ) 0, with equality if and only if px . In
fact, we could set to be px , so the upper bound established is attained with px ,
justifying why the maximization in (5) yields log Z. Moreover, from the second-to
last line in the derivation above, since log Z does not depend on , maximizing F()
over is equivalent to minimizing D( I px ), as done in optimization problem (7).
So we now even know the value of that maximizes the right-hand side of (5).
Unfortunately, plugging in to be px , were left with having to sum over all possible
congurations x XN , which in general is intractable. We next look at a way to
relax this problem called Bethe approximation.

14.3

Bethe approximation

The guiding intuition we use for relaxing the log partition optimization is to look at
what happens when px has a tree graph. By what we argued earlier, the solution to
the log partition optimization is to set px , so if px has a tree graph, then must
factorize as a tree as well. Hence, we need not optimize over all possible distributions
in XN for ; we only need to look at distributions with tree graphs!
With this motivation, we look at how to parameterize to simplify our optimiza
tion problem, which involves imposing constraints on . Well motivate the contraints
were about to place by way of example. Supposing that px has the tree graph in Fig
ure 1, then px factorizes as:
px (x) = px1 (x1 )px2 |x1 (x2 |x1 )px3 |x1 (x3 |x1 )
px ,x (x1 , x2 ) px1 ,x3 (x1 , x3 )

= px1 (x1 )
1 2
px1 (x1 )

px1 (x1 )
px ,x (x1 , x2 ) px1 ,x3 (x1 , x3 )
.

= px1 (x1 )px2 (x2 )px3 (x3 ) 1 2


px1 (x1 )px2 (x2 ) px1 (x1 )px3 (x3 )

A pattern emerges in the last step, which in fact holds in general: if px has tree graph
G = (V, E), then its factorization is given by
px (x) =

pxi (xi )

iV

Y
(i,j)E

pxi ,xj (xi , xj )


.
pxi (xi )pxj (xj )

Using this observation, we make the crucial step for relaxing the log partition opti
mization by parameterizing by:
Y
Y ij (xi , xj )
,
(9)
(x) =
i (xi )

i (xi )j (xj )
iV
(i,j)E

where we have introduced new distributions i and ij that need to behave like node
marginals and pairwise marginals. By forcing to have the above factorization, we
arrive at the following Bethe variational problem for the log partition function:
log Zbethe = max F()

3
Figure 1

(10)

subject to the constraints:

Y
Y
(x) =
i (xi )
iV

(i,j)E

ij (xi , xj )
i (xi )j (xj )

for all x XN

i (xi ) 0

for all i V, xi X

i (xi ) = 1

for all i V

xi X

ij (xi , xj ) 0

for all (i, j) E, xi , xj X

ij (xi , xj ) = i (xi )

for all (i, j) E, xi X

ij (xi , xj ) = j (xj )

for all (i, j) E, xj X

xj X

X
xi X

The key takeaway here is that if px has a tree graph, then the optimal will factor
as a tree and so this new Bethe variational problem is equivalent to our original log
partition optimization problem and, furthermore, log Z = log Zbethe .
However, if px does not have a tree graph but still has the original pairwise poten
tial factorization given in equation (4), then the above optimization may no longer
be exact for computing log Z, i.e., we may have log Zbethe =
6 log Z. Constraining
to factor into i s and ij s as in equation (9), where E is the set of edges in px and
could be loopy, is referred to as a Bethe approximation, named after physicist Hans
Bethe. Note that since edge set E is xed and comes from px , we are not optimizing
over the space of all tree distributions!
Were now ready to state a landmark result.
Theorem 2. (Yedidia, Freeman, Weiss 2001) The xed points of Sum-Product mes
sage updates result in node and edge marginals that are local extrema of the Bethe
variational problem (10).
We set the stage for the proof by rst massaging the objective function a bit to
make it clear exactly what were maximizing once we plug in the equality constraint
for factorizing as i s and ij s. This entails a fair bit of algebra. We rst compute
simplied expressions for log px and log , which well then use to derive reasonably
simple expressions for the average energy and entropy terms of objective function F.
Without further ado, denoting di to be the degree of node i, we take the log of
equation (4):
X
X
log px = log Z +
i (xi ) +
ij (xi , xj )
iV

= log Z +

X
iV

= log Z +

(i,j)E

i (xi ) +

(ij (xi , xj ) + i (xi ) + j (xj ))

di i (xi )

iV

(i,j)E

X
X
(1 di )i (xi ) +
(ij (xi , xj ) + i (xi ) + j (xj )) .
iV

(i,j)E

Meanwhile, taking the log of the tree factorization imposed for given by equation (9),

we get:
X
X
log (x) =
log i (xi ) +
(log ij (xi , xj ) log i (xi ) log j (xj ))
iV

(i,j)E

(1 di ) log i (xi ) +

iV

log ij (xi , xj ).

(i,j)E

Now, recall from equations (6) and (8) that


F() = E [E(x)] + E [ log (x)] = E [log Z + log px (x)] + E [ log (x)].
| {z } |
{z
average energy
w.r.t.

entropy of

Gladly, weve computed log px and log already, so we plug these right in to determine
what F is equal to specically when factorizes like a tree:
F() = E [log Z + log px (x)] + E [ log (x)]

X
X
(ij (xi , xj ) + i (xi ) + j (xj ))
= E (1 di )i (xi ) +
iV

(i,j)E

X
X
E (1 di ) log i (xi ) +
log ij (xi , xj )
iV

(i,j)E

(1 di )E [i (xi )] +

iV

E [ij (xi , xj ) + i (xi ) + j (xj )]

(i,j)E

X
X
+ (1 di )E [ log i (xi )] +
E [ log ij (xi , xj )]
iV

(i,j)E

(1 di )Ei [i (xi )] +

iV

Eij [ij (xi , xj ) + i (xi ) + j (xj )]

(i,j)E

X
X
+
(1 di ) Ei [ log i (xi )] +
Eij [ log ij (xi , xj )]
|
{z
-
|
{z
iV
(i,j)E
entropy of i

entropy of ij

/ Fbethe (),
where Fbethe is the negative Bethe free energy. Consequently, the Bethe variational
problem (10) can be viewed as minimizing the Bethe free energy. Mopping up Fbethe

a little more yields:

X
(1 di )Ei [i (xi ) log i (xi )]
Fbethe () =
iV

Eij [ij (xi , xj ) + i (xi ) + j (xj ) log ij (xi , xj )]

(i,j)E

(1 di )

i (xi )[i (xi ) log i (xi )]

xi X

iV

ij (xi , xj )[ij (xi , xj ) + i (xi ) + j (xj ) log ij (xi , xj )].

(i,j)E xi ,xj X

(11)
We can now rewrite the Bethe variational problem (10) as
max Fbethe ()

subject to the constraints:


X

i (xi ) 0

for all i V, xi X

i (xi ) = 1

for all i V

xi X

ij (xi , xj ) 0

for all (i, j) E, xi , xj X

ij (xi , xj ) = i (xi )

for all (i, j) E, xi X

ij (xi , xj ) = j (xj )

for all (i, j) E, xj X

xj X

X
xi X

Basically all we did was simplify the objective function by plugging in the factorization
for . As a result, the stage is now set for us to prove Theorem 2.
Proof. Our plan of attack is to introduce Lagrange multipliers to solve the now sim
plied Bethe variational problem. Well look at where the gradient of the Lagrangian
is zero, corresponding to local extrema, and well see that at these local extrema,
Lagrange multipliers relate to Sum-Product messages that have reached a xed-point
and the i s and ij s correspond exactly to Sum-Products node and edge marginals.
So rst, we introduce Lagrange multipliers. We wont introduce any for the nonnegativity constraints because itll turn out that these constraints arent active; i.e.,
the other constraints will already yield local extrema points that always have non
negative i and ij . Lets assign Lagrange multipliers to the rest of the constraints:
Lagrange multipliers

PConstraints
i
xi X i (xi ) = 1
P

(x , x ) = i (xi )
ji (xi )
P xj X ij i j

ij (xj )
xi X ij (xi , xj ) = j (xj )
9

Collectively calling all the Lagrange multipliers , the Lagrangian is thus

L(, ) = Fbethe () +

xi X

iV

X X

X
ji (xi )
ij (xi , xj ) i (xi )

(i,j)E xi X

X X

i (xi ) 1

xj X

ij (xj )

(i,j)E xj X

ij (xi , xj ) j (xj ) .

xi X

The local extrema of Fbethe subject to the constraints we imposed on are precisely
the points where the gradient of L with respect to is zero and all the equality
constraints in the table above are met.
Next, we take derivatives. Well work with the form of Fbethe given in equa
d
tion (11), and well use the result that dx
x log x = log x + 1. First, we dierentiate
with respect to k (xk ):
X
L(, )
ik (xk )
= (1 dk )(k (xk ) log k (xk ) 1) + k
k (xk )
iN (k)
X
= (dk 1)k (xk ) + (dk 1)(log k (xk ) + 1) + k
ik (xk ).
iN (k)

Setting this to 0 gives

log k (xk ) = k (xk ) +


so
k (xk ) exp

1
dk 1

k (xk ) +

ik (xk ) k 1,

iN (k)

1
dk 1

X
iN (k)

ik (xk ) .

(12)

Next, we dierentiate with respect to ki (xk , xi ):


L(, )
= ki (xk , xi ) + k (xk ) + i (xi ) log ki (xk , xi ) 1 + ik (xk ) + ki (xi ).
ki (xk , xi )
Setting this to 0 gives:
ki (xk , xi ) exp {ki (xk , xi ) + k (xk ) + i (xi ) + ik (xk ) + ki (xi )} .

(13)

We now introduce variables mij (xj ) for i V and j N (i), which are cunningly
named as they will turn out to be the same as Sum-Product messages, but right now,
10

we do not make such an assumption! The idea is that well write our ij multipliers
in terms of mij s. But how do we do this? Pattern matching between equation (13)
and the edge marginal equation (3) suggests that we should set
X
ij (xj ) =
log mkj (xj )
for all i V, j N (i), xj X.
kN (j)\i

With this substitution, equation (13) becomes


Y

ki (xk , xi ) exp (ki (xk , xi ) + k (xk ) + i (xi ))

mik (xk )

iN (k)\i

mji (xi ),

jN (i)\k

(14)
which means that at a local extremum of Fbethe , the above equation is satised. Of
course, this is the same as the edge marginal equation for Sum-Product. We verify a
similar result for i . The key observation is that
X
X X
X
ik (xk ) =
log mik (xk ) = (dk 1)
log mjk (xk ),
iN (k)

iN (k) iN (k)\i

jN (k)

which follows from a counting argument. Plugging this result directly into equa
tion (12), we obtain

X
1

ik (xk )
k (xk ) exp k (xk ) +

dk 1
iN (k)

X
1
= exp
k (xk ) +
(dk 1)
log mjk (xk )

dk 1
jN (k)
Y
= exp(k (xk ))
mjk (xk ),
(15)
j N (k)

matching the Sum-Product node marginal equation.


What weve shown so far is that any local extremum of Fbethe subject to the
constraints we imposed must have i and ij take on the forms given in Equations (15)
and (14), which just so happen to match node and edge marginal equations for SumProduct. But we have not yet shown that the messages themselves are at a xed
point. To show this last step, we examine our equality constraints on ij . It suces
to just look at what happens for one of them:
X
ki (xk , xi )
xe X

exp (ki (xk , xi ) + k (xk ) + i (xi ))

xe X

iN (k)\i

=
exp(k (xk ))

mik (xk )

mji (xi )

jN (i)\k

Y
iN (k)\i

mik (xk )

exp (ki (xk , xi ) + i (xi ))

xe X

Y
jN (i)\k

11

mji (xi ),

at which point, comparing the above equation and the k equation (15), perforce, we
have
X
Y
mik (xk ) =
exp (ki (xk , xi ) + i (xi ))
mji (xi ).
xe

jN (i)\k

The above equation is just the Sum-Product message-passing update equation! More
over, the above is in fact satised at a local extremum of Fbethe for all messages mij .
Note that there is absolutely no dependence on iterations. This equation say that
once were at a local extremum of Fbethe , the above equation must hold for all xk X.
This same argument holds for all the other mij s, which means that were at a
xed-point. And such a xed-point is precisely for the Sum-Product message-update
equations.
Weve established that Sum-Product message updates on loopy graphs do have
at least one xed point and, moreover, all possible xed points are local extrema of
Fbethe , the negative Bethe free energy. Unfortunately, Theorem 2 is only a statement
about local extrema and says nothing about whether a xed point of Sum-Product
message updates is a local maximum or a local minimum for Fbethe . Empirically, it
has been found that xed points corresponding to maxima of the negative Bethe free
energy, i.e., minima of Bethe free energy, tend to be more stable. This empirical
result intuitively makes sense since for loopy graphs, we can view local maxima of
Fbethe as trying to approximate log Z, which was what the original hard optimization
problem was after in the rst place.
Now that we have characterized the xed points of Sum-Product message-passing
equations for loopy (as well as not-loopy) graphs, it remains to discuss whether the
algorithm actually converges to any of these xed points.

14.4

Loopy BP convergence

Our quest ends in the last question we had asked originally, which eectively asks
whether loopy BP converges to any of the xed points. To answer this, we will sketch
two dierent methods of analysis and will end by stating a few results on Gaussian
loopy BP and an alternate approach to solving the Bethe variational problem.
14.4.1

Computation trees

In contrast to focusing on xed points, computation trees provide more of a dynamic


view of loopy BP through visualizing how messages across iterations contribute to
the computation of a node marginal. We illustrate computation trees through an
example. Consider a probability distribution with the loopy graph in Figure 2a.

12

1
121

1
121

3
(a)

131

2
131

3
(b)

0
32

0
23

2
(c)

Figure 2
Suppose we ran loopy BP for exactly one iteration and computed the marginal
at node 1. Then this would require us to look at messages m121 and m131 , where
the superscripts denote the iteration number. We visualize these two dependencies in
Figure 2b. To compute m121 , we needed to look at message m032 , and to compute
m131 , we needed to look at m023 . These dependencies are visualized in Figure 2c.
These dependency diagrams are called computation trees. For the example graph
in Figure 2a, we can also draw out the full computation tree up to iteration t, which
is shown in Figure 3. Note that loopy BP is operating as if the underlying graph is
the computation tree (with the arrows on the edges removed), not realizing that some
nodes are duplicates!
But how can we use the computation tree to reason about loopy BP convergence?
Observe that in the full computation tree up to iteration t, the left and right chains
that descend from the top node each have a repeating pattern of three nodes, which
are circled in Figure 3. Thus, the initial uniform message sent up the left chain will
pass through the repeated block over and over again. In fact, assuming messages
are renormalized at each step, then each time the message passes through the re
peated block of three nodes, we can actually show that its as if the message was
left-multiplied by a state transition matrix M for a Markov chain!
So if the initial message v on the left chain is from node 2, then after 3t iterations,
the resulting message would be Mt v. This analysis suggests that loopy BP converges
to a unique solution if as t , Mt v converges to a unique vector and something
similar happens for the right-hand side chain. In fact, from Markov chain theory, we
know that this uniqueness result does occur if our alphabet size is nite and M is the
transition matrix for an ergodic Markov chain. This result follows from the PerronFrobenius theorem and can be used to show loopy BP convergence for a variety of
graphs with single loops. Complications arise when there are multiple loops.

13

14.4.2

Attractive xed points

Another avenue for analyzing convergence is a general


xed-point result in numerical analysis that basically an
swers: if were close to a xed point of some iterated func
tion, does applying the function push us closer toward the
xed point? One way to answer this involves looking at
the gradient of the iterated function, which for our case
is the Sum-Product message update F dened back in
Section for giant vectors containing all messages.
Well look at the 1D case rst to build intuition. Sup
pose f : R R has xed point x = f (x ). Let xt be
our guess of xed point x at iteration t, where x0 is our
initial guess and we have xt+1 = f (xt ). What we would
like is that at each iteration, the error decreases, i.e., error
|xt+1 x | should be smaller than error |xt x |. More
formally, we want
|x

t+1

x | |x x |

for some < 1.

21

31

1
32

1
23

2
13

2
12

(16)

In fact, we could chain these together to bound our total


error at iteration t + 1 in terms of our initial error:
|xt+1 x | |xt x | (|xt1 x |)
((|xt2 x |))
t+1 |x0 x |,

3
31

3
21

4
32

(17)

which implies that as t , the error is driven to 0.


Were now going to relate this back to f , but rst,
well need to recall the Mean-Value Theorem: Let a < b.
If f : [a, b] R is continuous on [a, b] and dierentiable
on (a, b), then f (a)f (b) = f ' (c)(ab) for some c (a, b).
Then

5
13

5
12

|xt+1 x | = |f (xt ) f (x )|
= |f ' (y)(xt x )| (via Mean-Value Theorem)
= |f ' (y)||xt x |,

4
23

Figure 3

for some y in an open interval (a, b) with a < b and xt , x


[a, b]. Note that if |f ' (y)| for all y close to xed point
x , then well exactly recover condition (16), which will
drive our error over time to 0, as shown in inequality (17). Phrased another way, if
|f ' (y)| < 1 for all y close to xed point x , then if our initial guess is close enough to
x , we will actually converge to x .
14

Extending this to the n-dimensional case is fairly straight-forward. Well basically


apply a multi-dimensional version of the Mean-Value Theorem. Let f : Rn Rn have
xed point x = f (x ). We update xt+1 = f (xt ). Let fi denote the i-th component
of the output of f , i.e.,

f1 (xt )
f2 (xt )

xt+1 = f (xt ) = .. .

fn (xt )
Then
|xt+1
xi | = |fi (xt ) fi (x )|
i
= |Vfi (yi )T (xt x )|
Ifi' (yi )I1 Ixt x I ,

(multi-dimensional Mean-Value Theorem)

for some yi in an neighborhood whose closure includes xt and x , and where the
inequality2 is because for any a, b Rn ,
!

n
n
n
X
X
X
T
|a b|
|ak ||bk |
ak b k
|ak |
max |bi | .

i=1,...,n

k=1
k=1
k=1
{z
| {z
-
|
/ a

/ b

Lastly, taking the max of both sides across all i = 1, 2, . . . , n, we get


Ixt+1 x I

max IVfi (yi )I1 Ixt x I .

i=1,...,n

Then if IVfi (z)I1 is strictly less than positive constant S for all z in an open ball
whose closure includes xt and x and for all i, then
Ixt+1 x I

max IVfi (yi )I1 Ixt x I SIxt x I ,

i=1,...,n

and provided that S < 1 and x0 is suciently close to x , iteratively applying f will
indeed cause xt to converge to x .
In fact, if IVfi (x )I is strictly less than 1 for all i, then by continuity and dier
entiability of f , there will be some neighborhood B around x for which IVfi (z)I < 1
for all z B and for all i. In this case, x is called an attractive xed point. Note
that this condition is sucient but not necessary for convergence.
This gradient condition can be used on the Sum-Product message update F and
its analysis is dependent on the node and edge potentials.
2

In fact, this is a special case of whats called H


olders inequality, which generalizes Cauchy
Schwarz by upper-bounding absolute values of dot products by dual norms.

15

14.4.3

Final remarks

Weve discussed existence of at least one loopy BP xed point and weve shown
that any such xed point is a local extremum encountered in the Bethe variational
problem. Whether loopy BP converges is tricky; we sketched two dierent approaches
one could take. Luckily, some work has already been done for us: Weiss and Freeman
(2001) showed that if Gaussian loopy BP converges, then we are guaranteed that the
means are correct but not necessarily the covariances; additional conditions ensure
that the covariances are correct as well. And if loopy BP fails to converge, then a
recent result by Shih (2011) gives an algorithm with running time O(n2 (1/)n ) that
is guaranteed to solve the Bethe variational problem with error at most .

16

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology


Department of Electrical Engineering and Computer Science
6.438 Algorithms for Inference
Fall 2014

17

Variational Inference

Prompted by loopy graphs for which exact inference is computationally intractable,


we tread into algorithms for approximate inference in search for ecient solutions
that may have error. We began with loopy belief propagation, which meant applying
the parallel Sum-Product algorithm to loopy graphs with at most pairwise potentials.
Our analysis revealed that the problem could be viewed in terms of approximating
the log partition function of the distribution of interest.
Today we extend this train of analysis, presenting a general approach that ap
proximates distributions we care about through modifying the hard log partition
optimization. Previously, we applied a Bethe approximation to the log partition op
timization to recover an optimization problem intricately related to loopy BP. The
Bethe approximation amounted to changing the space were optimizing over to be a
simpler set, which slammed down the computational complexity of inference. But
we could have easily chosen a dierent space to optimize over such as some family
of probability distributions, which would yield dierent approximating distributions!
This general approach is called variational inference, and cleverly selecting which
family of distributions to optimize over will oer us lower and upper bounds on the
log partition function.
Well rst review the log partition optimization and Bethe approximation before
plunging into how to bound the log partition function for distributions with at most
pairwise potentials. Each bound will directly have an algorithm that gives an approx
imating distribution. But variational inference works for distributions with potentials
on larger cliques as well! Well save this for the end, when well also briey inject
variational inference with an information-theoretic interpretation.

17.1

Log partition optimization and Bethe approximation

We blaze through some previously stated results while dening a few new variables.
Unless otherwise stated, we take distribution x XN dened over graph G = (V, E)
to have factorization

1
px (x) = exp
i (xi ) +
ij (xi , xj ) ,
(1)
Z
iV
(i,j)E

where Z is the partition function.


Denoting H() ; E [log((x))] = x (x) log((x)) to be the entropy of
distribution , the hard log partition variational problem is:

X
X
log Z = sup E
i (xi ) +
ij (xi , xj ) + H() ,
(2)

M
iV

(i,j)E

where M is the set of all distributions over XN .


Applying the Bethe approximation amounted to saying that rather than optimiz
ing over M, well optimize over with factorization
(x) =

i (xi )
iV

(i,j)E

ij (xi , xj )
,
i (xi )j (xj )

(3)

where i s and ij s are constrained to behave like node and edge marginals. Ex
tremely important is the fact that E in the factorization is precisely the edge set in
the graph for px , which may be loopy! Adding the above factorization constraint on
to the log partition variational problem and scrawling a few lines of algebra, were
left with the Bethe variational problem below.

X
X
X
log Zbethe = sup
Ei [i (xi )] +
Eij [ij (xi , xj )] +
H(i ) +
H(ij ).


iV

iV

(i,j)E

(i,j)E

(4)
subject to:
X

i (xi ) 0

for all i V, xi X

i (xi ) = 1

for all i V

xi X

ij (xi , xj ) 0

for all (i, j) E, xi , xj X

ij (xi , xj ) = i (xi )

for all (i, j) E, xi X

ij (xi , xj ) = j (xj )

for all (i, j) E, xj X

xj X

X
xi X

We get out approximate node marginals i s and approximate edge marginals ij s.


Also, we previously argued that if px has a tree graph, then log Z = log Zbethe . How
ever, if px has a loopy graph, then log Z and log Zbethe may dier, and how much these
dier is messy but can be described by loop series expansions.??? Unfortunately, its
unclear whether log Zbethe is a lower or upper bound for log Z, as well discuss later.
But why is bounding the log partition function interesting? Recall that the
marginal over a subset S V can be computed by
P
P
P
x\ S
iV i (xi ) +
(i,j)E ij (xi , xj )
Z(xS = xS )
=
,
pxS (xS ) =
Z
Z
where x\S denotes all xi variables not in xS , and Z(xS = xS ) is the partition function
evaluated when xS is xed to have value xS . If we can bound log partition functions,
then we can bound partition functions. Thus, nding a lower bound on Z(xS = xS )
and an upper bound on Z would give us a lower bound for pxS (xS ). Meanwhile, nding
an upper bound on Z(xS = xS ) and a lower bound on Z would give an upper bound
for pxS (xS ). So its possible to sandwich pxS (xS ) into an interval!
2

17.2

Lower bound using mean eld

Imagine if were asked to optimize over 10 items, but were too lazy or its too expen
sive to check all 10. So we stop after checking ve of them and return the optimal
solution so far. Surely our solution provides a lower bound for what the best solution
is since if we checked the rest of the items, our solution will either stay the same
or improve. This idea is precisely what well use to secure a lower bound on log Z:
rather than optimizing over all distributions in M, well optimize over a subset of M.
Our solution will thus yield a lower bound on the log partition function.
The simplest family of distributions that is guaranteed to be a subset of M is the
set of distributions that fully factorize as singleton factors:
Y
(x) =
i (xi ).
iV

This factorization is called the mean eld factorization, and the family of distributions
with such a factorization will be denoted MMF . At a rst glance, mean eld might
seem too simplistic as there is no neighbor structure; all i s are independent! But as
well see, by optimizing over MMF , the solution will involve looking at neighbors
in the original graph. Furthermore, in literature, the mean eld factorization is far
and away the most popular way to do variational inference.
By looking at the original log partition variational problem and plugging in the
mean eld factorization constraint on , a few lines of algebra show that the mean
eld variational inference problem is as follows.

X
X
log ZMF = max
Ei [i (xi )] +
Ei j [ij (xi , xj )] +
H(i )
(5)

iV

iV

(i,j)E

subject to:

i (xi ) 0

for all i V, xi X

i (xi ) = 1

for all i V

xi X

The i s can be viewed as approximate node marginals of px . As weve already


justied, we are guaranteed that log ZMF log Z.
Lets actually optimize for by slapping in some Lagrange multipliers and setting
derivatives to 0. As with the Bethe variational problem, we need not introduce Lagrange multipliers for the nonnegativity constraints; well nd that without explicitly
enforcing nonnegativity, our solution will have nonnegative iP
(xi )s anyways. Thus,
for each i, we introduce Lagrange multiplier i for constraint xi X i (xi ) = 1. The

Lagrangian is given by:

L(, ) =

Ei [i (xi )] +

iV

i (xi )i (xi ) +

iV xi X

i (xi ) log i (xi ) +

iV xi X

iV

i (xi ) 1

xi X

i (xi )j (xj )ij (xi , xj )

(i,j)E xi ,xj X

XX

H(i ) +

iV

(i,j)E

XX

Ei j [ij (xi , xj )] +

iV

i (xi ) 1 .

xi X

Hence,
X
L(, )
j (xj )ij (xi , xj ) (log i (xi ) + 1) + i .
= i (xi ) +
i (xi )
jN (i)

Setting this to zero, we obtain


i (xi ) exp

i (xi ) +

jN (i)

j (xj )(xi , xj ) .

As advertised earlier, the solution at node i involves its neighbors. To actually com
pute the approximate marginals i s now, we typically will need to do an iterative
update, such as:

X
t
t+1
(x
)

exp

(x
)
+

(x
)(x
,
x
)
,
i
i i
j
i
j
j
i

jN (i)

where t indexes iteration numbers. If we manage to procure an optimal , then


plugging it back into the objective function yields log ZMF . Analyzing convergence in
general requires work, similar to analyzing loopy BP convergence. For example, if px
has structure that makes the mean eld variation problem (5) a convex optimization
problem, then our iterative updates will converge to the globally optimal solution.

17.3

Other lower bounds

Because MMF M, restricting the log partition variational problem to optimize over
MMF instead of M results in a lower bound for log Z. Of course, optimizing over any
subset of M yields a lower bound. For example, in principle we could optimize over
Mtree , the space of all tree distributions over nodes V. This would be overkill as there
are N N 2 such trees. We could instead restrict our attention to one specic tree
that connects all of V. Then we could optimize over Mtree( ) , dened to be the family
4

of distributions with factorization (3) except where the edges are from tree . Unlike
Bethe approximation, since is a tree, we are never optimizing over a loopy graph.
A hierarchical diagram relating these families is in Figure 1.

Figure 1: Hierarchy of a few probability family distributions.


Denoting log Ztree and log Ztree( ) to be the outputs of the log partition variational
problem restricted to Mtree and Mtree( ) respectively, then our earlier argument and
the hierarchy diagram indicate that:
log ZMF log Ztree( ) log Ztree log Z.
Of course, other subsets of M can be chosen; were just giving the above families as
examples as theyre easy to describe.
Lastly, we mention that the family of distributions Mbethe(G) corresponding to
distributions with factorization in (3), where specically the nodes and edges are
from graph G, is not necessarily a subset of M and therefore may include members
that, strictly speaking, dont correspond to any valid distribution over XN . To see
this, consider a 3-node fully-connected graph G. Then members of Mbethe(G) have
factorization
12 (x1 , x2 ) 23 (x2 , x3 ) 13 (x1 , x3 )
1 (x1 )2 (x2 ) 2 (x2 )3 (x3 ) 1 (x1 )3 (x3 )
12 (x1 , x2 ) 23 (x2 , x3 ) 13 (x1 , x3 )
=
,
1 (x1 )
2 (x2 )
3 (x3 )

(x) = 1 (x1 )2 (x2 )3 (x3 )

which, invoking the denition of conditional probability, would mean that has a
cyclic directed graphical model! So even though were guaranteed that i s and ij s
are valid distributions, a solution to the Bethe variational problem may have the joint
distribution not correspond to any consistent distribution over XN . This explains
why log Zbethe may not be a lower bound for log Z.
5

17.4

Upper bound using tree-reweighted belief propagation

Providing an upper bound for the log partition function turns out to be less straight
forward. One way to do this is via whats called tree-reweighted belief propagation,
which looks at convex combinations1 of trees. Well only sketch the method here,
deferring details to Wainwright, Jaakkola, and Willskys paper.2
Suppose we want an upper bound on the log partition function for px dened over
the graph in Figure 2a with only edge potentials. Well instead look at its spanning
trees where we cleverly assign new edge potentials as shown in Figures 2b, 2c, and
2d. Here, each tree is given a weight of 1/3 in the convex combination.
1

(a) Original graph

(b) Tree 1

(c) Tree 2

(d) Tree 3

Figure 2: A loopy graph and all its spanning trees 1 , 2 , and 3 . Edge potentials are
shown next to each edge.
Where do the edge potentials in the spanning trees come from? Well illustrate the
basic idea by explaining how the potentials for edge (1,2) across the spanning trees
are dened. The same idea works for dening potentials on the other edges of the
spanning trees. Observe that


1
3
1
1
3
12 =
12 +
(0 12 ) +
12 .
2
3 | {z }
3
2
3
|{z}
|{z}
|{z}
| {z }
| {z }
edge (1,2)s
weight of edge (1,2)s
tree 1
potential in
tree 1

weight of potential in
tree 2
tree
2

weight of edge (1,2)s


tree 3
potential in
tree 3

The next critical observation will involve optimizing over the family of distributions
Mtree(k ) dened in the previous section, where k = (V, Ek ) is now one of our spanning
trees. We shall denote ijk to be the potential for edge (i, j) E in tree k , where
if the edge isnt present in tree k , then the potential is just the 0 function. As a
reminder, we are assuming that we only have edge potentials, which is ne since if
we had singleton potentials, these could always be folded into edge potentials. We
1

A convex combination of n items x1 , .P


. . , xn is like a linear combination except that the weights
n
must be nonnegative P
and sum to 1, e.g., i=1 i xi is a convex combination of x1 , . . . , xn if all i s
n
are nonnegative and i=1 i = 1.
2
See M.J. Wainwright, T.S. Jaakkola, and A.S. Willskys A New Class of Upper Bounds on the
Log Partition Function (2005).

arrive at the following pivotal equation:

3
X
X 1X
ij (xi , xj ) = E
k (xi , xj )
E
3 k=1 ij
(i,j)E
(i,j)E

3
X
X
1
=
E
ijk (xi , xj )
3 k=1
(i,j)E

3
X
X
1
ijk (xi , xj ) .
=
E
3 k=1

(6)

(i,j)Ek

We are now ready to upper-bound the partition function:


X
log Z = sup E
ij (xi , xj ) + H()

M
(i,j)E

1 X
X
k

= sup
E
ij (xi , xj ) + H()

M 3
k=1
(i,j)Ek

3
1 X

X
k

= sup
E
ij (xi , xj ) + H()

M 3
k=1
(i,j)Ek

X
X
1
ijk (xi , xj ) + H() ,

sup E

3
M
k=1

(using equation (6))

(i,j)Ek

where the last step uses the fact that for any functions f1 , f2 dened over the same
domain and range, we have:





sup {f1 (x) + f2 (x)} sup f1 (x) + sup f2 (x) .
x

Note that for each spanning tree k ,


X
X
k
k

sup E
ij (xi , xj ) + H() = sup
E
ij (xi , xj ) + H() ,
Mtree(k )

M
(i,j)Ek

(i,j)Ek

since the underlying distribution is actually just tree k so it suces to optimize over
family Mtree(k ) . This optimization is a specic case of when the Bethe variational
problem is exact, so we can solve it using belief propagation!
Of course, weve only worked out a simple example for which we could have
chosen dierent weights for our convex combination, not just one where the weights
7

are all equal. Furthermore, we didnt give an algorithm that actually produces an
approximating distribution! Worse yet, for larger graphs still dened over pairwise
potentials, the number of spanning trees explodes and optimizing the choice of weights
to get the tightest upper bound requires care. Luckily, all of these loose ends are
resolved in the paper by Wainwright et al. We encourage those who are interested to
peruse the paper for the gory details.

17.5

Larger cliques and information-theoretic interpretation

We now consider if px still has graph G = (V, E) but now has factorization

X
1 Y
1
C (xC ) = exp
log C (xC ) ,
px (x) =
Z CC
Z
CC

(7)

where C is the set of maximal cliques, which may be larger than just pairwise cliques.
The log partition variational problem in this case is

X
log Z = sup E
log C (xC ) + H() .
(8)
M

CC

As a sanity check, we repeat a calculation from the loopy BP lecture to ensure that
the right-hand side optimization does yield log Z. First note that
X
log C (xC ) = log Z + log px (x).
CC

Then
E

log C (xC ) + H() = E

CC

log C (xC ) log (x)

CC

= E [log Z + log px (x) log (x)]


(x)
= log Z E log
px (x)
= log Z D( I px ),

(9)

where D(p I q) is the KL divergence between distributions p and q over the same
alphabet. Since D(p I q) 0 and with equality if and only if p q, plugging this
inequality into (9) gives

X
E
log C (xC ) log Z,
CC

where equality is achieved by setting px . Thus, indeed the right-hand side


maximization of (8) yields log Z. Note that our earlier discussion on lower-bounding
the log partition function easily carries over to this more general case. However, the
upper bound requires some heavier machinery, which well skip.
The last line of (9) says that maximizing the objective of (8) over M is
tantamount to minimizing D( I px ), so
(
#
)
"

X
argmin D( I px ) = argmax E
log C (xC ) + H() .
M

CC

This has a nice information-theoretic implication: constraining which family of dis


tributions we optimize over can be viewed in terms of KL divergence. For example, if
we just want the approximating mean eld distribution and dont care for the value
of log ZMF , then we solve
argmin D( I px ).
(10)
MMF

Thus, variational inference can be viewed as nding a member within a family of


approximating distributions that is closest to original distribution px in KL divergence!
This can be interpreted as projecting px onto a family of approximating distributions
that we must pre-specify.
We end by solving for optima of mean eld variational problem (10) using the more
general clique factorization (7) for px . The steps are nearly identical to the pairwise
factorization case but involves a littlePmore bookkeeping. As before, we introduce
Lagrange multiplier i for constraint xi X i (xi ) = 1. Then the Lagrangian is

L(, ) = D( I px ) +

iV

i (xi ) 1

xi X

 X X
(x)
+
i
i (xi ) 1
= E log
px (x)
iV
xi X
!

X
X
= E [log (x)] E [log px (x)] +
i
i (xi ) 1


"

= E log

"

iV

i (xi ) E log Z +

iV

X
iV

E [log i (xi )]

xi X

log C (xC ) +

CC

E [log C (xC )] +

CC

iV

X
iV

X
xi X

i (xi ) 1

xi X

i (xi ) 1

+ log Z

XX

X X
i (xi ) log i (xi )

iV xi X

CC xC X|C|

X
+
i

iV

xi X

i (xi ) 1

j (xj ) log C (xC )

jC

+ log Z.

Thus,
X
L(, )
= log i (xi ) + 1
i (xi )
CC

s.t. iC

log C (xC )

j (xj ) + i .

jC
j =i
6

xC \xi

Setting this to 0 gives the mean eld update equation

X X

Y
log C (xC )
j (xj ) .
i (xi ) exp

jC
CC xC \xi

s.t. iC

17.6

j =i
6

Concluding remarks

Weve presented variational inference as a way to approximate distributions. This


approach has interpretations of approximating the log partition function or doing an
information-theoretic projection. Weve also given a avor of the calculations involved
for obtaining update rules (which describe an algorithm) that nd an approximating
distribution. In practice, simple approximating distributions such as mean eld are
typically used because more complicated distributions can have update rules that are
hard to derive or, even if we do have formulas for them, computationally expensive
to compute.
Unfortunately, simple approximating distributions may not characterize our origi
nal distribution well. A dierent way to characterize a distribution is through samples
from it. Intuitively, if we have enough samples, then we have a good idea of what the
distribution looks like. With this inspiration, our next stop will be on how to sample
from a distribution without knowing its partition function, leading to a dierent class
of approximate inference algorithms.

10

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

18 Markov Chain Monte Carlo Methods and Ap


proximate MAP
In the recent lectures we have explored ways to perform approximate inference in
graphical models with structures that make ecient inference intractable. First we
saw loopy belief propagation and then found that it t into a larger framework of
variational approximation. Unfortunately, simple distributions may not always ap
proximate our distribution well. In these cases, we may settle for characterizing the
distribution with samples. Intuitively, if we have enough samples, we can recover any
important information about the distribution.
Today, in the rst half, well see a technique for sampling from a distribution
without knowing the partition function. Amazingly, we can easily create a Markov
chain whose stationary distribution is the distribution of interest. This area is very
rich and well only briey scratch the surface of it. In the second half, well switch
gears and look at an algorithm for approximately nding the MAP through graph
partitioning.
First, well see how sampling can capture essentially any aspect of interest of the
distribution.

18.1

Why sampling?

Given {x1 , . . . , xN } samples from px (x). Recall that the sample mean
N
1 N
f (xi )
N i=1

is an unbiased estimator of E [f (x)] for any f irrespective of whether the samples are
independent because of the linearity of expectation. If the samples are i.i.d. then
by the law of large numbers, we have that the sample mean converges to the true
expectation
N
1 N
f (xi ) E [f (x)]
N i=1
as k . With dierent choices of f , we can capture essentially any aspect of
interest of p. For example, choosing
f (x) = (x E [x])2 gives the variance
f (x) = log(p(x)) gives dierential entropy
f (x) = 1x>x gives p(x > x ) where x is a parameter

So if we have samples from the joint distribution p(x), we can probe aspects of the
distribution. However, our previous algorithms focused on the marginals p(xi ). Can
we use samples from the joint distribution p(x) to tell us about the marginals? In
fact, it is simple to see that if x1 , . . . , xN are samples from the joint distribution,
then x1i , . . . , xN
i are samples from the marginal distribution p(xi ). Hence, if we have
samples from the joint distribution, we can project to the components to get samples
from the marginals. Alternatively, if were interested in just the marginals, we might
come up with an algorithm to sample from the marginal distributions directly. One
might wonder if there is any advantage to doing this. The marginal distributions
tend to be much less complex than the joint distribution, so it may turn out to be
much simpler to sample from the marginal distributions directly, rather than sampling
from the joint distribution and projecting. For the moment, well focus on generating
samples from p(x) and in the next lecture, well return to sampling from the marginal
distributions.
Its clear that the sampling framework is powerful, but naively drawing samples
from p(x) when we dont know the partition function is intractable. In the next
section, we describe one approach to sampling from p(x) called Metropolis-Hastings.

18.2

Markov Chain Monte Carlo

Suppose, we are interested in sampling from px (x), but we only know px up to a


multiplicative constant (i.e. px (x) = px (x)/Z and we can calculate px (x)). Initially,
this seems like an immensely complicated problem because we do not know Z.
Our approach will be to construct a Markov chain P whose stationary distribution
is equal to px while only using px in our construction. Once we have created
the Markov chain, we can start from an arbitrary x, run the Markov chain until it
converges to and we will have a sample from px . Such an approach is called a
Markov Chain Monte Carlo approach. To develop this, we will have to answer
1. How to construct such a Markov chain P?
2. How long it takes for the Markov chain to converge to its stationary distribution?
Well describe the Metropolis-Hastings algorithm to answer the rst question. To
answer the second, well look at the mixing time of Markov chains through Cheegers
inequality.
18.2.1

Metropolis-Hastings

First, well introduce some notation to make the exposition clearer. Let be the
state space of possible values of x and well assume x is discrete. For example, if x
was a binary vector of length 10, then would be {0, 1}10 and have 210 elements. We
will construct a Markov chain P which we will represent as a matrix [Pij ] where the
(i, j) entry corresponds to the probability of transitioning from state i to state
2

i Pij

j Pji

Figure 1: Detailed balance says the i Pij = j Pji , in other words the probability
owing from i to j is the same as the ow from j to i.
j . We want the stationary distribution of P denoted by a vector = [ i ] to
be equal to px (i.e. i = px (i)). Furthermore, we will require P to be a reversible
Markov chain.
Denition 1 (Reversible Markov chain). A Markov Chain P is called reversible with
respect to if it satises
i Pij = j Pji .
This equation is also referred to as detailed balance.
Intuitively, detailed balance says that the probability owing from i to j is the
same amount of probability owing from j to i, where by probability ow from
i to j we mean i Pij (i.e. the probability of being in i and transitioning to j).
Importantly, if P is reversible with respect to and we did not assume was the
stationary distribution, detailed balance implies that is a stationary distribution
because
N
N
N
j Pji =
i Pij = i
Pij = i .
j

So showing that P satises the detailed balance equation with is one way of showing
the is a stationary distribution of P.
To obtain such a P, we will start with a proposed Markov chain K which we
will modify to create P and as well see the conditions that K must satisfy are very
mild and K may have little or no relation to px . Again, we will represent K as a
matrix [Kij ] and we require that
Kii > 0 for all i and
G(K) = (, E(K)) is connected where E(K) ; {(i, j) : Kij Kji > 0}.
In other words, all self-transitions must be possible and it must be possible to move
from any state to another state in some number of transitions of the Markov chain.
3

Then, dene

R(i, j) ; min 1,

px (j)Kji
px (i)Kij

= 1 min 1,

px (j)Kji
px (i)Kij

and
Pij ;

Kij R(i, j)
1 j =i Pij

j=i
.
j=i

This is the Metropolis-Hastings Markov chain P that we were after. Now it remains
to show that px is the stationary distribution of P, and as we commented above it
suces to show that P satises the detailed balance equation with px .
Lemma 1. For all i, j , detailed balance px (i)Pij = px (j)Pji holds.
Proof. For (i, j)
/ E(K) this is trivially true. For (i, j) E(K), without loss of
p (i)K

generality let px (j)Kji px (i)Kij . This implies that R(i, j) = 1 and R(j, i) = px (j)Kijji .
x
Then
px (j)Kji
=
px (j)Kji
= R(j, i)Kji px (j) = Pji px (j).

px (i)Pij = px (i)Kij = px (i)Kij

px (i)Kij
px (j)Kji

Kji px (j)

Thus, we conclude that px is the stationary distribution of P as desired.


Because the matrices describing K and P are enormously large3 , it can be helpful
to think of K and P as describing a process that explains how to generate a new state
j in the Markov chain given our current state i. From this perspective, the process
describing P is as follows, starting from state i, to generate state j:
Generate j / according to K with current state i.
Flip a coin with bias R(i, j / )
If heads, then the new state j = j / .
If tails, then the new state j = i, the old state.
This gives a convenient description of P, which can easily be implemented in code.
Also, R(i, j) is commonly referred to as the acceptance probability because it describes
the probability of accepting the proposed new state j / .
1

The equality holds because Z cancels out.


The astute reader will notice that the ratio in R(i, j) is directly related to the detailed balance
equation.
3
Theyre || || and the reason we cannot calculate px is that is so large.
2

Intuitively, we can think of Metropolis-Hastings as forcing K to be reversible in


a specic way. Given an arbitrary K, theres no reason that px (i)Kij will be equal
to px (j)Kji , in other words, the probability ow from i to j will not necessarily be
equal to the ow from j to i. To x this, we could scale the ow on one side to be
equal to the other and thats what Metropolis-Hastings does. To make this notion
more precise, let R(px ) be the space of all reversible Markov chains that have px as a
stationary distribution. Then the Metropolis-Hastings algorithm takes K and gives
us P R(px ) that satises
Theorem 1.
P = arg min d(K, Q) ; arg min
QR(px )

QR(px )

px (i)

N
j 6=i

|Kij Qij |.

Hence, P is the l1 -projection of K on R(px ).


18.2.2

Example: MRF

Well describe a simple example of using Metropolis-Hastings to sample from an MRF.


Suppose we have x1 , . . . , xn binary with

N
N
ij (xi , xj ),
p(x) exp
i (xi ) +
iV

(i,j)E

"

U (x)

"

for some graph G. Here = {0, 1}n , so it has 2n elements. Suppose we have K = [ 21n ]
the matrix with all entries equal to 21n that is the probability of transitioning from i
to j is equally probable for all j. Then the Metropolis-Hastings algorithm would give


exp(U (i))
Pij = Kij min 1,
exp(U (j))
1
= n min (1, exp(U (i) U (j))) .
2
Is there any downside to choosing such a simple K? If i has moderate probability,
the chance of randomly choosing a j that has higher probability is very low, so were
very unlikely to transition away from i. Thus it make take a long time for the Markov
chain to reach its stationary distribution.
18.2.3

Example: Gibbs Sampling

Gibbs sampling is an example of Metropolis-Hastings, where x = (x1 , . . . , xn ) and K


is dened by the following process for going from x x/
5

Select k {1, . . . , n} from a uniform distribution.


Set x/k = xk and sample x/k from p(x/k |xk ).
where xk ; x1 , . . . , xk1 , xk+1 , . . . , xn . The practical applicability of Gibbs sam
pling depends on the ease with which samples can be drawn from the conditional
distributions p(xk |xk ). In the case of undirected graphical models, the conditional
distributions for individual nodes depends only on the neighboring nodes, so in many
cases, it is simple to sample from the conditional distributions.
If p > 0 then it follows that the graph for K is connected and all self-transitions
are possible. Below, well see that K satises detailed balance, so the acceptance
probability will always be 1, hence the Metropolis-Hastings transition matrix P will
be equal to K.
Lemma 2. K satises detailed balance with respect to px .
Proof. For x and x/ , we must show that p(x)Kxx' = p(x/ )Kx' x . If x = x/ , then the
equation is satised trivially. Suppose x 6= x/ and suppose they dier in at least two
positions. By construction Kxx' = Kx' x = 0, so the equation is satised trivially.
Lastly, suppose x 6= x/ and they dier in exactly one position k. Then,
1
p(x)Kxx' = p(x)p(x/k |xk )
n
1
= p(xk |x/k )p(x/k )p(xk/ |x/k )
n
1
= p(xk |x/k )p(x/ )
n
= p(x/ )Kx' x .
using the fact that xk = x/k .
In practice, Gibbs sampling works well and in many cases it is simple to implement.
Explicitly, we start from an initial state x0 and generate putative samples x1 , . . . , xT
according to the following process:
for t = 0, . . . , T 1:
Select i {1, . . . , n} uniformly.
t+1
t
t
Set xt+1
from p(xt+1
i = xi and sample xi
i |xi ).
However, all of the caveats about using the samples generated by Metropolis-Hastings
apply. For example, we need to run the Markov chain until it has reached its sta
tionary distribution, so we need to toss out a number of initial samples in a process
called burn-in. In the next section, well see theoretical results on the time it takes
the Markov chain to reach its stationary distribution, but in practice people rely on
heuristics. More advanced forms of Gibbs sampling exist, such as block Gibbs sam
pling as seen on Problem Set 8, but their full development is beyond the scope of this
class.
6

18.3

Mixing Time

Now that weve constructed the Markov chain, we can sample from it, but we want
samples from the stationary distribution. We turn to the second question: how long
does it take for the Markov chain to converge to its stationary distribution?
For simplicity of exposition, well focus on a generic reversible Markov chain P
with state space and unique stationary distribution . Well also assume that
P regular, that is Pk > 0 for some k > 0. Intuitively, that means that for some
k > 0, it is possible to transition from any i to any j in exactly k steps.
Additionally, well assume that P is a lazy Markov chain, which means that Pii > 0
for all i . This is a mild condition because we can take any Markov chain Q
and turn it into a lazy Markov chain without changing its stationary distribution by
considering 12 (Q + I). The lazy condition ensures that all of the eigenvalues of P are
positive and it does not substantially increase the mixing time.
Were interested in measuring the time it takes P to go from any initial state to
its stationary distribution. Precisely, well dene
Denition 2 (f-mixing time of P). Given f > 0, Tmix (f) is the smallest time such
that for t Tmix (f)
|Pt |T V f,
P
t
for any initial distribution where |Pt |T V =
i |(P )i i | is the total
variation.
To get a bound on Tmix (f), we will focus our attention how what the operation of
multiplying with P does. Recall that as we apply P to any vector, the eigenvector
with the largest eigenvalue dominates and how quickly it dominates is determined by
the second largest eigenvalue. We will exploit this intuition to get a bound on Tmix (f)
that depends on the dierence between the largest and second largest eigenvalues.
The following is technical and is included for completeness. First, well the total
variation by a term that does not depend on the initial distribution .
N
N N
|(Pt )i i | =
|
j (Pt )ji i |
i

N N
=
|
j ((Pt )ji i )|
i

ij

j |(Pt )ji i |
j

||||1 max
j

= max
j

N
i

|(Pt )ji i |
N
i

|(Pt )ji i |

|(Pt )ji i |.

where used Holders inequality. Now we show how to bound


j .
N
i

|(Pt )ji i | for every

N (Pt )ij

|
1| i i
i
i

!
!#

12
2
N
(Pt )ij
N
i
,

i 1

i
i
i

|(Pt )ji i | =

where we used Cauchy-Schwarz to get the inequality. After some algebraic manipu
lation

21
!
!#
! 12
2
t 2 
t
N
(Pt )ij
N
N
(P
)
2(P )ji
ji
i
=
1
+
2
i

i 1

i
i

N
(Pt )2ji
i

! 12

Now well use the reversibility of P,


N
(Pt )2ji
i

! 21

N (Pt )ji (Pt )ji j

N
(Pt )ji (Pt )ij i

ij

N
(Pt )ji (Pt )ij

ij

! 12

! 21

1
! 12

 12
(P2t )jj
1 ,

where we used the reversibility of P to exchange j (Pt )ji for i (Pt )ij . Putting this
together, we conclude that
 2t
 12
N

(P )jj
t
|(P )ji i |
1 .

j
i

So we need to nd a bound on the diagonal entries of P2t . Consider the following


matrix

1
M ; diag( )P diag( )

where diag( ) is the diagonal matrix with along the diagonal. Note that M is
symmetric because
j Pji
i
i Pij
=
Pij =
ij
ij
j

Mij =

j
Pji = Mji ,
i

using the reversibility of P. Recall the spectral theorem from linear algebra which says
that for any real symmetric matrix A, there exist orthonormal eigenvectors v1 , . . . , vn
with eigenvalues 1 . . . n such that
A=

n
N

k vk vkT = V diag()VT ,

k=1

where V is the matrix with v1 , . . . , vn as columns and = (1 , . . . , n ).


Decomposing M in this way, shows that
Mt = (V diag()VT )t = V diag()t VT
because V is orthogonal. Hence
1

Pt = diag( )Mt diag( )

1
= diag( )V diag()t VT diag( )

1
= diag( )V diag()t VT diag( ).
From this representation of Pt , we conclude that
N
ti (vi )2j .
(Pt )jj =
i

It can be shown by the Perron-Frobenius theorem, that 1 = 1 and |k | < 1 for


k < 1 and the rst eigenvector of P is . M is similar4 to P hence they have the
same eigenvalues and their eigenspaces havethe same dimensions. By construction,
1
an eigenvector u of P implies that u diag( ) is an eigenvector of M with the
1
same eigenvalue. Thus diag( ) is eigenvector of M and because the eigenspace
1
corresponding to the eigenvalue 1 has dimension 1, v1 diag( ). Furthermore,
1
||v1 ||2 = 1 because V is orthogonal, so we conclude that v1 = diag( ). Using
4

Matrices A and B are similar if A = C1 BC for some invertible matrix C.

this, we can simplify the expression for (P2t )jj and upper bound it
2t

(P )jj = j +

n
N

2
2t
i (vi )j

i=2

j + 2t
2
j + 2t
2
= j +

n
N

(vi )2j

i=2
n
N

(vi )2j

i=1

22t ,

because V is orthogonal. Putting this into our earlier expression yields




 12 
 12
(P2t )jj
j + 2t
2
1

j
j
 2t  12
2
=
.

Putting all of the bounds together we have


t

|P |T V

 2t  12

 12
2
1

t
max

= 2
.

j
j
minj j

Setting the right hand side equal to f and solving for t gives
t=

log f + 12 log(minj j )
.
log 2

This gives an upper bound on the time it takes the Markov chain to mix so that
|Pt |T V < f. Because we used inequalities to arrive at t, we can only conclude
that
log 1E + 12 log min1j j
log f + 12 log(minj j )
Tmix (f) t =
=
log 1/2
log 2
1
1
1
log E + 2 log minj j

.
1 2
So, as expected, the mixing time depends on the dierence between the largest and
second largest eigenvalues. In this case, Cheegers celebrated inequality states that
1
2
2,
1 2

10

where the conductance of P is dened as


P

= (P) = min
S

i Pij
(S)(Sc )
iS,jSc

P
where (S) = kS k and similarly for (Sc ). Conductance takes the minimum
over S of the probability of starting in S and transitioning to Sc in one time step
normalized by the sizes of S and Sc . If the conductance is small, then there is a set
S such that transitioning out of S is dicult, so if the Markov chain gets stuck in S,
it will be unlikely to leave S, hence we would expect the mixing time to be large.

Sc

transitions across

Figure 2
Thus we conclude that
2
Tmix (f) 2


log

1
1
+ log
mini i
f

In fact, it can be shown that without converting our Markov chain to a lazy Markov

chain, we improve the bound by a factor of 2




1
1
1
+ log
.
Tmix (f) 2 log

min
f
18.3.1

Two-state example

Consider the simple Markov chain depicted in Figure 3 with a single binary variable x.
By symmetry, the stationary distribution = [0.5, 0.5]. Its clear that is minimized
when S = {0} and Sc = {1} so that
=

0.5
= 2.
0.5 0.5
11

Figure 3

18.3.2

Concluding Remarks

Weve seen how to sample from a distribution without knowing the partition function
via Metropolis-Hastings. And we saw how we could bound the time it takes for the
Markov chain to reach its stationary distribution from any initial distribution. In
the next lecture well continue to discuss sampling techniques and in particular, well
focus on techniques for a restricted set of models. For now, well take a brief aside to
talk about approximate MAP algorithms.

18.4

Approximate MAP and Partitioning

Analogously to loopy belief propagation, we can run max-product (or min-sum) on a


loopy graph to approximate the MAP. Unfortunately, we have limited understanding
of the approximation it generates, as in the case with loopy belief propagation. The
cases that are well understood suggest that max-product is akin to linear program
ming relaxation, but the discussion of this is beyond the scope of this class5 .
As we saw earlier, in the Gaussian setup, MAP and inference are equivalent. An
interesting fact is that if Gaussian BP converges on a loopy graph, the estimated
means are always correct6 .
Instead of pursuing loopy max-product, well focus on another generic procedure
for approximating the MAP based on graph partitionings. The key steps in this
approach are:
1. Partition the graph into small disjoint sets.
2. Estimate the MAP for each partition independently.
5

For more information see Sanghavi, S.; Malioutov, D.; Willsky, A.s Belief Propagation and LP
Relaxation for Weighted Matching in General Graphs (2011).
6
For more information see Weiss, Y; Freeman, W.s Correctness of Belief Propagation in Gaus
sian Graphical Models of Arbitrary Topology (2001).

12

3. Concatenate the MAPs for the subproblems to form a global estimate.


Because of its simplicity, this algorithm seems almost too good to be true and in
fact given a specic partition on a graph, we can choose the clique potentials so that
this algorithm will give a poor approximation to the MAP. The key is to exploit
randomness! Instead of choosing a single partition, well dene a distribution on
partitions and when we select a partition from this distribution, we can guarantee
that on average the algorithm will do well.
When a clever distribution on partitions of the graph exists, this produces a
linear time algorithm and can be quite accurate. Unfortunately, not all graphs admit
a good distribution on partitions, but in this case, we will can produce a bound on
the approximation error. In the following section, well precisely dene the algorithm
and derive bounds on the approximation error. At the end, well explore how to nd
a clever distribution on partitions.
To make the notion of a good distribution on partitions precise, well dene
Denition 3. An (f, k)-partitioning of graph G = (V, E) is a distribution on nite
partitions of V such that for any partition {V1 , . . . , VM } with non zero probability,
|Vm | k for all 1 m M . Furthermore, we require that for any e E, p(e
Ec ) f where Ec = E\ m (Vm Vm ) the set of cut edges and the probability is with
respect to the distribution on partitions.
Intuitively, an (f, k)-partitioning is a weighted set of partitions, such that in every
partition all of the Vm are small and the set of cut edges is small. This aligns well
with our algorithm because it means that the subproblems will be small because Vm
is small, so our algorithm will be ecient. Because our algorithm evaluates the MAP
for each partition independently, it misses out on the information contained on the
cut edges, so as long as the set of cut edges is small, we do not miss much by ignoring
them.

Let us consider a simple eexample of an N N grid graph G. Well show that


2
its possible to nd an (f, 1E )-partitioning for G for any f > 0. In this case, k = E12 .
Our strategy will be to rst construct a single partition has |Vm | k and a small
|Ec |. Then we will construct a distribution on partitions that satises the constraint
that for any e E, p(e Ec ).

Sub-divide the grid into k k squares, each containing k nodes (Figure 5).
There are M ; Nk such sub-squares; call them V1 , . . . , VM . By construction |Vm | k.
The edges in Ec are the ones that cross between
sub-squares. The number of edges

crossing out
of each such square is at most 4 k, so the total number of such edges are
at most 4M k 12 where 12 is for doubling counting edges. The total number of edges

1
in the grid is roughly 2N . Therefore, the fraction of cut edges is 2 k Nk 2N
= 1k = f.

Thinking of the sub-division into k k squares as a coarse grid, we could


shift the grid to the right and/or down tocreate a new partition. If we randomly
shift the entire sub-grid uniformly 0, . . . , k 1 to the right and then uniformly
13

V1

V3

VM

V2

c
cut edges E c and |E |

|E| on average

Figure 4: The nodes are partitioned into subsets V1 , . . . , VM and the red edges cor
respond to cut edges.

Figure 5: The original grid is sub-divided into a grid of

14

k k squares.


0, . . . , k 1 down this gives a distribution on partitions. By symmetry, it ensures
the distributional guarantee that p(e Ec ) f. Thus the grid graph admits an
e2
(f, 1E )partitioning for any f > 0.
18.4.1

Approximate MAP using (f, k)-partitioning

In this section, well prove a bound on the approximation error when we have an
(f, k)-partitioning. For our analysis we will restrict our attention to pairwise MRFs
that have non-negative potentials, so p takes the form

N
N
px (x) exp
i (xi ) +
ij (xi , xj )
iV

(i,j)E

"

"

{z

;U (x)

for a graph G = (V, E) and where i , ij 0. Then formally, the approximate MAP
algorithm is
1. Given an (f, k)-partitioning of G, sample a partition {V1 , . . . , VM } of V.
2. For each 1 m M : Using max-product on Gm = (Vm , E Vm Vm ) nd
N
N
m arg max
x
i (yi ) +
ij (yi , yj )
yX|Vm | iV
m

"

(i,j)E
i,jVm

{z

;Um (y)

"

= ((x
m )m ) as an approximation of the MAP.
3. Set x
We can get a handle on the approximation error by understanding how much error
arises from ignoring the edge potentials corresponding to Ec . If we use an (f, k)
partitioning, then we expect Ec to be small, so we can bound our approximation
error. The following theorem makes this intuition rigorous.
Theorem 2 (Jung-Shah).
)] U (x )(1 f).
E [U (x
where the expectation is taken over the (f, k)-partitioning.

15

Proof.

U (x ) =

i (xi ) +

iV

ij (x
i , xj )

(i,j)E

M
N
N
N
N

(x
)
+

(x
,
x
)
+
ij (xi , xj )
i
ij
i
i
j

m=1

M
N

iVm

(i,j)E
i,jVm

Um (x ) +

ij (xi , xj )

(i,j)Ec

m=1

) +
U (x

(i,j)Ec

1(i,j)Ec ij (xi , xj ).

(i,j)E

Therefore, by taking expectation with respect


to randomness in partitioning and using
P
the fact that i , i,j 0 and U (x ) (i,j)E ij (xi , xj ),
)] + fU (x ).
U (x ) E [U (x
)] is close to the
This means that given our choice of f, we can ensure that E [U (x
correct answer (i.e. the approximation error is small).
18.4.2

Generating (f, k)-partitionings

Weve seen that as long as we have an (f, k)-partitioning for an MRF, then we can
make guarantees about the approximation algorithm. Now we will show that a large
class of graphs have (f, k)-partitionings. First well describe a procedure for generating
a potential (f, k)-paritioning and then well see which class of graphs this realizes
an (f, k)-partitioning.
The procedure for generating a potential (f, k)-partitioning on a graph is given by
the following method for sampling a partition
1. Given G = (V, E), k, and f > 0. Dene the truncated geometric distribution
with parameter f truncated at k as follows
(
(1 f)l1 f l < k
p(x = l) =
.
(1 f)k1 l = k
2. Order the nodes V arbitrarily 1, . . . , N . For node i:
Sample Ri from a truncated geometric distribution with parameter f trun
cated at k.
16

Assign all nodes within distance7 Ri from i color i. If the node is already
colored, recolor it to i.
3. All nodes with the same color form a partition.
This gives a partition and denes a distribution on partitions. The questions is: for
what graphs G is this distribution an (f, k)-partitioning?
Intuitively, for any given node, we want the number of nodes within some distance
of it not to grow too quickly. Precisely,
Denition 4 (Poly-growth graph). A graph G is a poly-growth graph if there exists
> 0, C > 0 such that for any vertex v in the graph,
|Nv (r)| Cr ,
where Nv (r) is the number of nodes within distance r of v in G.
In this case, we know that
Theorem 3 (Jung-Shah). If G is a poly-growth graph then by selecting k = ( E log E ),
8
the above procedure results in an (f, Ck ) partition.9
This shows that we have a large class of graphs where we can apply the procedure
to generate an (f, k)-partitioning, which guarantees that our approximation error is
small and controlled by our choice of f.

Where distance is dened as the path length on the graph.

This notation means that k is asymptotically bounded above and below by


9
A similar procedure exists for all planar graphs.

17

log 7 .

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

19 Approximate Inference: Importance Sampling,


Particle Filters
Thus far, we have focused on developing inference algorithms, exact and approximate,
for discrete graphical models or Gaussian graphical model. The exact method involved
Elimination algorithm and Junction Tree for generic graph while Belief Propagation
(BP) (sum-product/max-product) for Tree graphical model. The approximate in
ference method involved variational approximation, graph partitioning and Markov
Chain Monte Carlo (MCMC) method. However, we did not discuss the scenario where
variables are continuous, and not necessarily Gaussian. Let us understand diculty
involved in such a scenario.
Why approximate? Consider the scenario of a continuous valued Hidden Markov
Model (HMM). Recall that HMM is a tree-structured and hence from graphical struc
ture perspective, BP is exact. Therefore, it is not clear what is the need for approx
imation. To that end, recall that when the variables were jointly Gaussian, BP can
be easily implemented because
(a) the messages in BP were Gaussian,
(b) the marginals were Gaussian,
(c) and we did not need to directly evaluate integrals, instead, we propagated means
and covariance parameters.
One of the diculties with applying BP to general distributions is that we have to pass
distributions as messages, but describing a distribution can be quite complex, which
means that sending a message can be expensive. In the Gaussian case, we avoided
this diculty because Gaussians are simply parametrized by a mean and covariance.
The other key was that the messages turned out to be Gaussian, so everything stayed
in the same family of distributions. Unfortunately, this does not happen for arbitrary
continuous distributions.
Gaussian models are widely used and powerful, but may not always the right
choice. Therefore, it warrants us considering the following general continuous HMM1 :
x0 p0|1 () : initialization
xn+1 |xn pxn+1 |xn () : dynamics
yn |xn pyn |xn () : observations
1

For exposition, well assume that all of the variables are scalars, but everything explained here
naturally extends to vector valued variables.

Were interested in the marginal of xn after observing y0 , . . . , yn (i.e. pn|n (xn |y0 , . . . , yn ))
also called the posterior marginal. Recall that we can calculate it exactly via BP in
an iterative manner in two steps:
Prediction: Transitioning to a new state
previous iteration


:
pxn+1 |xn (xn+1 |xn ) pn|n (xn |y0 , . . . , yn ) dxn

pn+1|n (xn+1 |y0 , . . . , yn ) =


xn

Update: Folding in a new observation


pn+1|n+1 (xn+1 |y0 , . . . , yn+1 ) =

1
py |x (yn+1 |xn+1 )pn+1|n (xn+1 |y0 , . . . , yn )
Z n+1 n+1

where Z is the partition function which is an implicit integral.


We use the shorthand notation pn|m to denote pxn |y0 ,...,ym (xn |y0 , . . . , ym ). In practice,
it is very hard to use these steps exactly, because both of the steps involve calculating
an integral. In the Gaussian case, these steps reduced to simple linear algebra, but
in general calculating the integrals for arbitrary distributions is intractable, so we use
approximations instead and this leads to approximate inference.
This is precisely the type of approximation that we discuss in this lecture. We
shall do so in the context of a tree graph structure oered by HMM. The method is
called particle ltering and can be seen as sequential MCMC building upon importance
sampling. This lecture develops method of particle ltering for HMM. It should be
noted that an adaptation of MCMC (using appropriate Metropolis Hasting rule for
continuous setting) can be used for approximate inference as well in such a scenario.
However, particle ltering allows exploitation of graphical structure like HMM for
eciency.
Remarks. It should be noted that there are other approaches known in the literature.
For example, we can modify the Kalman Filter algorithm that we derived earlier to
handle nonlinear dynamical systems (NLDS) by linearizing the state space equations
about the current state estimate; in this case, it is called the Extended Kalman Filter
(EKF). The EKF has a limited scope of applicability, but when it can be applied, it
is very useful. In fact, the EKF was used by the Apollo missions. Unfortunately, it
is hard to analyze theoretically. We shall not discuss EKF here.

19.1 Particle Filter for HMMs with General Continuous Dis


tribution
Consider an HMM with general continuous state and observations distributions as
depicted in Figure 1: we want to estimate hidden states x0 , . . . , xn given associated
observations y0 , . . . , yn . Concretely, nd pxi |y0 ,...,yn (xk |y0 , . . . , yn ), 0 i n. To that
end, let us gather what we know.
2

x0

x1

...

xn

y0

y1

...

yn

Figure 1: Example of an HMM.


We can sample from pxi () by sampling from px0 ,...,xn (), 0 i n. Since pxi ()
is the marginal of px0 ,...,xn () for 0 i n, simply extracting the ith coordinate of
samples from px0 ,...,xn () is the same as sampling from pxi (). In other words, suppose
we take S samples from px0 ,...,xn (),
{(x0 (s), . . . , xn (s)}Ss=1 .

(1)

Then extracting the ith coordinate,


{xi (s)}Ss=1

(2)

is a set of S samples from pxi (). Likewise, suppose we have a set of observations
y0 , . . . , yn ; then extracting the ith coordinate from samples from px0 ,...,xn |y0 ,...,yn (|y0 , . . . , yn )
yields samples from pxi |y0 ,...,yn (|y0 , . . . , yn ).
We can sample from px0 ,...,xn () by using the Markov property. Note that
by the Markov property of our model, xi is conditionally independent of x0 , . . . , xi2
given xi1 . Therefore, in order to generate sth sample from px0 ,...,xn (), we rst sample
x0 (s) px0 (),

(3)

and then for all i = 1, . . . , n, we sample


xi (s) pxi |xi1 (|xi1 (s)).

(4)

We can approximate expectations with samples. Suppose we wish to compute


an expectation of some function f () with respect to (marginal) distribution of xi ,
Epxi [f (xi )]. Given a set of S samples {xi (s)}Ss=1 from pxi (), by the strong law of large
numbers, we can approximate the expectation as
S
1s
f (xi (s)) Epxi [f (xn )] as S .
S s=1

(5)

This approximation is valid for a large class of functions, including all bounded mea
surable functions. For example, if f is the indicator function f (x) = l [x A] for a
set A, this can be used to compute pxi (xi A):
S
1s
l [xi (s) A] Epxi [l [xi A]] = pxi (xi A) as S .
S s=1

(6)

However, we cant sample directly from px0 ,...,xn |y0 ,...,yn (|y0 , . . . , yn ). Of course,
we arent very interested in the prior px0 ,...,xn (); we have observations y0 , . . . , yn from
the HMM and want to incorporate this information by using the posterior distribution
px0 ,...,xn |y0 ,...,yn (|y0 , . . . , yn ). By the rules of conditional probability, the posterior given
observations y0 , . . . , yn may be expressed as (using notation, v0n = (v0 , . . . , vn ) for any
vector v)2
px0 ,...,xn |y0 ,...,yn (x0n |y0n ) =

py0 ,...,yn |x0 ,...,xn (y0n |xn0 )px0 ,...,xn (x0n )


.
py0 ,...,yn (y0n )

Given the model description, we know px0 ,...,xn as well as py0 ,...,yn |x0 ,...,xn , both of which
factorize nicely. However, we may not know py0 ,...,yn (). Therefore, given observation
y0n , the conditional density of x0 , . . . , xn takes form
px0 ,...,xn |y0 ,...,yn (xn0 |y0n )

1
=
px ,...,x (xn )
Z 0 n 0

n
n

pyi |xi (yi |xi ) .

(7)

i=0

In above, Z represents unknown normalization constant. Therefore, the challenge


is, can we sample from a distribution that has an unknown normalization constant?
Importance sampling will rescue us from this challenge as explained next.

19.2

Importance Sampling

Let us re-state the challenge abstractly: we are given a distribution with density
such that
(x) =

q(x)
,
Z

(8)

where q(x) is a known function and Z is unknown normalization constant. We want


to sample from so as to estimate
Z
E [f (x)] = f (x)(x)dx.
(9)
We can sample from a known distribution with density , potentially very dierent
from . Importance sampling provides means to solves this problem:
2

Here, we assume that we have well-dened densities to be able to write conditional densities as
expressed.

Sample x(1), x(2), . . . , x(S) from known distribution with density .


Compute w(1), . . . , w(S) as
w(s) =

q(x(s))
.
(x(s))

(10)

Output estimation for E [f (x)] as

E(S)
=

S
s=1

w(s)f (x(s))
S
s=1

w(s)

(11)

Next we provide justication for this method: as S , the estimation E(S)


converges to the desired value E [f (x)]. To that end, we need following denition:
let support of distribution with density p be
supp(p) = {x : p(x) > 0}.

(12)

Theorem 1. Let supp() supp(). Then as S ,


E (S) E [f (x)] ,

with probability 1.

(13)

Proof. Using strong law of large numbers, it follows that with probability 1,
S
1s
q(x)
w(s)f (x(s)) E
f (x)
S s=1
(x)
Z
q(x)
=
f (x)(x) dx
supp() (x)
Z
=
Z(x)f (x) dx
supp()

= ZE [f (x)] ,

(14)

where we have used the fact that q(x) = 0 for x


/ supp(). Using similar argument,
it follows that with probability 1,
S
1s
w(s) ZE [1] = Z.
S s=1

(15)

This leads to the desired conclusion.


Remarks. It is worth noting that as long as supp() is contained in supp(), the
estimation converges irrespective of choice of . This is quite remarkable. Alas, it
comes at a cost. The choice of determines the variance of the estimator and hence
the number of samples S required to obtain good estimation.
5

(x)
p(x)

Figure 2: Depiction of importance sampling. (x) is the proposal distribution and


(x) is the distribution were trying to sample from. On average we will have to draw
many samples from to get a sample where has low probability, but as we can see
in this example, that may be where has most of its probability mass.

19.3

Particle Filtering: HMM

Now we shall utilize Importance Sampling for obtaining samples from distribution
px0 ,...,xn |y0 ,...,yn (xn0 |y0n ), which in above notation corresponds to . Therefore, the func
tion q is given by
q(xn0 )

px0 ,...,xn (xn0 )

n
n


pyi |xi (yi |xi ) .

(16)

i=0

The known distribution that we shall utilize to estimate the unknown distribution is
none other than px0 ,...,xn (), which in above notation is . Therefore, for any given xn0 ,
the weight function w is given by
q(xn0 )
(xn0 )
n
n
=
pyi |xi (yi |xi ).

w0n w(xn0 ) =

(17)

i=0

In summary, we estimate Epx0 ,...,xn |y0 ,...,yn [f (x0 , . . . , xn )|y0 , . . . , yn ] as follows.


1. Obtain S i.i.d. samples, xn0 (s), 1 s S, as per px0 ,...,xn .
2. Compute corresponding weights w0n (s) =

n
i=0

p(yi |xi (s)), 1 s S.

3. Output estimation of Epx0 ,...,xn |y0 ,...,yn [f (x0 , . . . , xn )|y0 , . . . , yn ] as


PS

w0n (s)f (xn0 (s))


.
PS
n
w
(s)
s=1 0

s=1

(18)

Sequential implementation. It is worth observing that the above described al


gorithm can be sequentially implemented. Specically, given the samples xn0 (s) and
corresponding weight function w0n (s), we can obtain x0n+1 (s) and w0n+1 (s) as follows:
Sample xn+1 (s) pxn+1 |xn (|xn (s)).
Compute w0n+1 (s) = w0n (s) pyn+1 |xn+1 (yn+1 |xn+1 (s)).
Such a sequential implementation is particularly useful in various settings, just like
sequential inference using BP.

19.4

Particle Filtering: Extending to Trees

In this section, well show how we can use a particle approach to do approximate
inference in general tree structured graphs. As in the previous section, well break
down the BP equations into two steps and show how particles are updated in each
step. The HMM structure was particularly nice and well see how the additional
exibility of general trees makes the computation more dicult and how the tools
weve learnt so far can surmount these issues.
Recall that for general tree structured graphs, the BP message equations were
Z
n
mij (xj ) =
ij i (xi )
mli (xi ) dxi .
xi

lN (i)\j

We can write this as two steps


ij (xi ) ; i (xi )

mli (xi )

(19)

ij (xi , xj )ij (xi ) dxi .

(20)

lN (i)\j

mij (xj ) =
xi

From the messages, the marginals can be computed as


n
pi (xi ) i (xi )
mli (xi ).
lN (i)

Naturally, well use weighted particle sets to represent messages mij and beliefs
ij . For the details to work out nicely, we require some additional normalization
conditions on the potentials
Z
i (xi ) dxi <
xi
Z
ij (xi , xj ) dxj < for all xi
xj

The rst issue that we run into is how to calculate ij . If we represent mli as a
weighted particle set, we have to answer what it means to multiply weighted particle
sets? Another way of thinking about a weighted particle set approximation to mli
is
s
wli (s)(xi xli (s)).
mli (xi )
s

where is the Kronecker delta3 . Then it is clear that multiplication of weighted


particle sets, just means multiplication of approximations. But this does not solve
the problem, consider mli (xi ) ml' i (xi )
s
wli (s)(xi xli (s))wl' i (s/ )(xi xl' i (s/ ))
mli (xi ) ml' i (xi )
s,s'

wli (s)wl' i (s/ ) (xi xli (s))(xi xl' i (s/ ))

s,s'

The distributions are continuous, so xli (s) = xl' i (s/ ) with probability 1, hence
the product would be 0. The solution to this problem is to place a Gaussian kernel
around each particle. Explicitly, instead of approximating mli as a sum of delta
functions, well approximate it as
s
mli (xi )
wli (s)N(xi ; xli (s), Jl1i ),
s

where Jli is an information matrix and whole sum is referred to as a mixture of


Gaussians. This takes care of computing the beliefs.
As a result, ij is a product of d 1 mixtures of Gaussians where d is the degree
of i. It isnt hard to see that ij is itself a mixture of Gaussians, but with S d1
components. So, computing the integral in the message equation and generating new
particles xij (s) boils down to sampling from a mixture of S d1 Gaussians. Sampling
from a mixture of Gaussians is usually not dicult, unfortunately, the mixture has
exponentially components, so in this case it is no small task. One approach to this
problem is to use MCMC, either Metropolis-Hastings or Gibbs sampling4 . This gives
us the new particles for our message, and we can iterate to compute weighted particle
approximations for the rest of the messages and marginals.
To summarize, we could extend particle methods to BP on general tree structures
in a natural way. In doing so, we had to overcome two obstacles: multiplying
weighted particle sets and sampling from a product of mixtures of Gaussians. We
overcame the rst issue by replacing the functions with Gaussians. We tackled the
3

The Kronecker delta has the special property that if f is a function then x f (x)(x) dx = f (0).
It essentially picks out the value of f at 0.
4
See Ecient Multiscale Sampling From Products of Gaussian Mixtures Ihler et al. (2003) for
more information about this approach.

second issue using familiar tools, MCMC. The end result is an approximate inference
algorithm to calculate posterior marginals on general tree structured graphs with
continuous distributions!
19.4.1

Concluding Remarks

In the big picture, we combined our BP algorithm with importance sampling to


do approximate inference of marginals for continuous distribution. This rounds out
our toolbox of algorithms; now we can infer posterior marginals for discrete and
continuous distributions. Of course, we could have just used the Monte Carlo methods
from the previous lecture to simulate the whole joint distribution. The problem with
that approach is that it would be very slow and inecient compared to the particle
ltering approach.

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

20

Learning Graphical Models

In this course, we rst saw how distributions could be represented using graphs.
Next, we saw how to perform statistical inference eciently in these distributions by
exploiting the graph structure. We now begin the third and nal phase of the course,
on learning graphical models from data.
In many situations, we can avoid learning entirely because the right model follows
from physical constraints. For instance, the structure and parameters may be dictated
by laws of nature, such as Newtonian mechanics. Alternatively, we may be analyzing
a system, such as an error correcting colde, which was specically engineered to have
certain probabilistic dependencies. However, when neither of these applies, we cant
set the parameters a priori, and instead we must learn them from data. The data we
use for learning the model is known as training data.
The learning task of graphical model would involve learning the graphical structure
and associated parameters (or potentials). The training data could be about all
variables or only partial observations. Given this, the following is a natural sets of
learning scenarios:
1. The graphical structure is known, but parameters are unknown. The observa
tions available to learn the parameters involve all variables.
Example. If we have to variables x1 and x2 , we may be interested in deter
mining whether they are independent. This is equivalent to determining
whether or not there is an edge between x1 and x2 . This is equivalent to
determining whether the associate edge parameters are non-trivial or not.
x1

x2

vs.

x1

x2

2. The graphical structure is known, but parameters are unknown. The observa
tions available to learn the parameters involve only a subset (partial) of vari
ables, while other are unobserved (hidden).
Example. The naive Bayes model, as shown in the following gure, is often
used for clustering. In it, we assume every observation x = (x1 , . . . , xN ) is
generated by rst choosing a cluster assignment u, and then independently
sampling each xi from some distribution pxi |u . We observe x, but not u.
And we do know the graphical structure.

x1

x2

xN

3. Both graphical structure and parameters are unknown. The observations avail
able to learn the parameters involve all variables.
Example. We are given a collection of image data for hand-written charac
ters/alphabets. The goal is to use it to product graphical model for each
character so as to utilize it for automated character recognition: given an
unknown image of a character, nd the probability of it being a particular
character, say c, using the graphical model; declare it to be character c for
which the probability is maximum. Now at gray-scale, image for a given
alphabet can be viewed as a graphical model with each pixel representing
binary variables. The data is available about all pixels/variables. But
graphical structure and associated parameters are unknown.
4. Both graphical structure and parameters are unknown. The observations avail
able to learn the parameters involve only a subset (partial) of variables, while
other are unobserved (hidden).
Example. Let x = (x1 , . . . , xN ) denote prices of stocks at a given time
and variables y = (y1 , . . . , yM ) capture the related un-observable ambient
state of the world, e.g. internal information about business operation, etc.
We only observe x. The structure of graphical model between variables
x, y is entirely unknown as well as y are unobserved. The goal is to learn
the graphical model and associated parameters between these variables to
potentially better predict future price variation or develop better portfolio
of investment.
We shall spend remainder of the lectures dealing with each of these scenarios, one
at a time. For each of these scenarios, we shall distinguish cases of directed and undi
rected graphical models. As we discuss learning algorithms, well nd ourselves using
inference algorithms, that we have learnt thus far, utilizing in an unusual manner.

20.1

Learning Simplest Graphical Model: One Node

We shall start with the question of learning simplest possible graphical model: graph
with one node. As we shall see, it will form back-bone of our rst learning task in
general - known graph structure, unknown parameters and all variables are observed.

Let x be one node graphical model, or simply a random variable. Let p(x; ) be a
distribution over a random variable x parameterized by . For instance, p(x; ) may
be a Bernoulli distribution where x = 1 with probability and x = 0 with probability
1 . We treat the parameter as nonrandom, but unknown. We can view p(x; )
as a function of x or :
p(; ) is the pdf or pmf for a distribution parameterized by ,
L(; x) 6 p(x; ) is the likelihood function for a given observation x.
We are given observations in terms of S samples, D = {x(1), . . . , x(S)}. The goal is
to learn parameter .
A reasonable philosophy for learning unknown parameter given observations is
to use maximum likelihood estimation: it is the unknown parameter with respect to
which the observations are most likely or have maximum likelihood.
That is, if we have one observation x = x(1), then
M L (x) = arg max p(x; ).

We shall assume that when multiple observations are present, they are generated
independently (and from identical unknown distribution). Therefore, the likelihood
function for D = {x(1), . . . , x(S)} is given by
L(; D) L(; x(1), . . . , x(S))
=

S
s

p(x(s); ).

(1)

s=1

Often we nd it easier to work with the log of the likelihood function, or log-likelihood:
f(; D) f(; x(1), . . . , x(S))
6 log L(; x(1), . . . , x(S))
=

S
s

log p(x(s); ).

(2)
(3)

s=1

Observe that the likelihood function is invariant to permutation of the data, be


cause we assumed the observations were all drawn i.i.d. from p(; ). This suggests we
should pick a representation of the data which throws away the order. One such rep
resentation is the empirical distribution of the data, which gives the relative frequency
of dierent symbols in the alphabet:
p(a) =

S
1s
1x(s)=a .
S s=1

Using short-hand D = {x(1), . . . , x(S)}, pD (a) is another way of writing the empirical
distribution. We can also write the log-likelihood function as f(; D). Note that p is
a vector of length |X|. We can think of it as a sucient statistic of the data, since it
is sucient to infer the maximum likelihood parameters.
3

20.1.1

Bernoulli variable

Going back to Bernoulli distribution (or coin ips), pD (0) is the empirical fraction of
0s in a sample, and pD (1) is the empirical fraction of 1s. The likelihood function is
given by
f(; D) = Sp(1) log + Sp(0) log(1 ).
We nd the maximum of this function by dierentiating and setting the derivative to
zero:
f

pD (1)
pD (0)
= S
S
.

(1 )

0 =

Since pD (0) = 1 pD (1), we have


pD (1)

=
.
1 pD (1)
1
Therefore, we obtain
M L = pD (1).
That is, the empirical estimation is the maximum likelihood estimation. By the
strong law of large numbers, the empirical distribution will eventually converge to
the correct probabilities, so we see that maximum likelihood is consistent in this case.
20.1.2

General discrete variable

When the variables take values in a general discrete alphabet X = {a1 , . . . , aM }


rather than {0, 1}, we use the multinomial distribution rather than the Bernoulli
distribution. In other words, p(x = am ) = m , where the parameters m satisfy:
M
s

m = 1,

m=1

m [0, 1].
The likelihood function is given by
L(; D) =

SpD (am )
m
.

Following exactly the same reasoning as in the Bernoulli case, we nd that the
maximum likelihood estimation of parameters is given by empirical distribution, i.e.
M L = (
pD (am ))1mM .
4

20.1.3

Information theoretic interpretation

Earlier in the course, we saw that we could perform approximate inference in graphical
models by solving a variational problem minimizing information divergence between
the true distribution p and an approximating distribution q. It turns out that max
imum likelihood has an analogous interpretation. We can rewrite the log-likelihood
function in terms of information theoretic quantities:
f(; D) =

S
s

log p(x(s); )

s=1

= S

pD (a) log p(a; )

aX

= SEpD [log p(x; )]


= S (H(
pD ) D(
pD lp(; )))
We can ignore the entropy term because it is a function of the empirical distribution,
and therefore xed. Therefore, maximizing the likelihood is equivalent to minimizing
the information divergence D(
pD lp(; )). Note the dierence between this variational
problem and the one we considered in our discussion of approximate inference: there
we were minimizing D(qlp) with respect to q, whereas here we are minimizing with
respect to the second argument.
Recall that KL divergence is zero when the two arguments are the same distribu
tion. In the multinomial case, since we are optimizing over the set of all distributions
over a nite alphabet X, we can match the distribution exactly, i.e. set p(; ) = pD .
However, in most interesting problems, we cannot match the data distribution ex
actly. (If we could, we would simply be overtting.) Instead, we generally optimize
over a restricted class of distributions parameterized by .

20.2

Learning Parameters for Directed Graphical Model

Now we discuss how the parameter estimation for one-node graphical model extends
neatly for learning parameters of a generic directed graphical model.
Let us start with an example of a directed graphical model as shown in Figure 1.
This graphical model obeys factorization
p(x1 , . . . , x4 ) = p(x1 )p(x2 |x1 )p(x3 |x1 )p(x4 |x2 , x3 ).
The joint distribution can be represented with four sets of parameters 1 , . . . , 4 .
Each i denes a multinomial distribution corresponding to each joint assignment
for the parents of xi . Suppose we have S samples of the 4-tuple (x1 , x2 , x3 , x4 ). The
log-likelihood function can be written as
f(x; ) = log p(x1 ; 1 ) + log p(x2 |x1 ; 2 ) + log p(x3 |x1 ; 3 ) + log p(x4 |x2 , x3 ; 4 ).
5

x1

x2

x4

x3

Figure 1
Observe that each i appears in exactly one of these terms, so all of the terms can be
optimized separately. Since each individual term is simply the likelihood function of
a multinomial distribution, we simply plug in the empirical distributions as before.
Precisely, in the above example, we compute empirical conditional distribution for
each variable xi with respect to its parents, e.g. px2 |x1 (|x1 ) for all possible values
of x1 . And this will be the maximum likelihood estimation of the directed graphical
model!
Remarks. It is worth highlighting two features of this problem which were neces
sary for the maximum likelihood problem to decompose into independent subprob
lems. First, we had complete observations, i.e. all of the observation tuples gave the
values of all of the variables. Second, the parameters were treated as independent
for each conditional probability distribution (CPD). If either of these assumptions is
violated, the maximum likelihood problem does not decompose into an independent
subproblem for each variable.

20.3

Bayesian parameter estimation

Here we discuss a method that utilizes prior information about unknown parameters
to learn them from data. It can also be viewed as a way to avoid the so called overtting by means of penalty. Again, as discussed above, this method can be naturally
extended to learning directed graphical model as long as the priors over parameters
corresponding to dierent variables are independent.
Recall that in maximum likelihood parameter estimation, we modeled the param
eter as nonradom, but unknown. In Bayesian parameter estimation, we model
as a random variable. We assume we are given a prior distribution p() as well as a
distribution p(x|) corresponding to the likelihood function. Notice that we use the
notation p(x|) rather than p(x; ) because is a random variable. The probability
of the data x is therefore given by

p(x) = p()p(x|)d.
In ML estimation, we tried to nd a particular value to use. In Bayesian
estimation, we instead use a weighted mixture of all possible values, where the weights
6

correspond to the posterior distribution

p(|x(1), . . . , x(S)).
In other words, in order to make predictions about future samples, we use the pre
dictive distribution

Z
'
p(x |x(1), . . . , x(S)) = p(|x(1), . . . , x(S))p(x' |)d.
For an arbitrary choice of prior, it is quite cumbersome to represent the poste
rior distribution and to compute this integral. However, for certain priors, called
conjugate priors, the posterior takes the same form as the prior, making these com
putations convenient. For the case of the multinomial distribution, the conjugate
prior is the Dirichlet prior. The parameter is drawn from the Dirichlet distribution
Dirichlet(1 , . . . , M ) if
M
s
p()
mm 1 .
m=1

It turns out that if {x(1), . . . , x(S)} are i.i.d. samples from a multinomial distribution
over alphabet size M , then if Dirichlet(1 , . . . , M ), the posterior is given by
|D Dirichlet(1 + SpD (a1 ), . . . , M + SpD (aM )).
The predictive distribution also turns out to have a convenient form:
m + S pD (am )
p(x' = am |x(1), . . . , x(S)) = M
.
m=1 m + S
In the special case of Bernoulli random variables,
1 1 1
(1 )0 1

Z
1 1 +SpD (1)1
|D
(1 )0 +SpD (0)1 .

Z'

The special case of the Dirichlet distribution for binary alphabets is also called the
beta distribution.
In a fully Bayesian setting, we would make predictions about new data by inte
grating out . However, if we want to decide a single value of , we can use the MAP
criterion:
M AP = arg max p(|D).

When we dierentiate and set the derivative equal to zero, we nd that


1 + Sp(1) 1
.
M AP =
0 + 1 + S 2
7

Observe that, unlike the maximum likelihood criterion, the MAP criterion takes the
number of training examples S into account. In particular, when S is small, the MAP
value is close to the prior; when S is large, the MAP value is close to the empirical
distribution. In this sense, Bayesian parameter estimation can be seen as controlling
overtting by penalizing more complex models, i.e. those farther from the prior.

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

21 Learning parameters of an undirected graphi


cal model
Today we shall see how to learn parameters or potential for an undirected graphi
cal model from observations for all variables and using knowledge of the graphical
structure.
Let G = (V, E) denote the associated undirected graph over N vertices V =
{1, . . . , N } and E V V . Let associated random variables x1 , . . . , xN take value
in nite alphabet X. For the purpose of this lecture, we shall restrict attention to
X = {0, 1}; however, the treatment of this lecture naturally extends for any nite
alphabet X. We shall assume that p(x) > 0 for all x XN . Recall that by the
Hammersley-Cliord theorem, stated and proved in an earlier lecture,

p(x) = exp
VC (xC ) ,
(1)
CC(G)

where C(G) is the set of all cliques of G, including the empty set; xC = (xi )iC , and
VC () is the potential function associated with clique C with the form:

Q(C)
if xC = 1
(2)
VC (xC ) =
0
otherwise.
We observe i.i.d. samples from this distribution, denoted as D = {x(1), . . . , x(S)}.
The goal is to learn the potential functions associated with the cliques of G from D.

21.1

A simple example: binary pair-wise graphical model

Before we consider the general graph, let us consider a graph that does not have
triangle. That is, C(G) consists of all vertices V , edges E and the empty set. As
before, X = {0, 1}. Given this, any such graphical model can be represented as
p(x) =

1
ij xi xj ,
exp
i x i +
Z
iV

(3)

(i,j)E

for i , ij R for all i V, (i, j) E; Z being normalization constant. Given this,


the question of boils down to learning parameters , from D, where
= (i , i V ; ij , (i, j) E).

If we had a very large number of samples ( 2N ) so that each assignment x


{0, 1}N is sampled enough times, we can estimation empirical probability of each
assignment x {0, 1}N . Now
log p(ei ) log p(0) = i ,
where ei is assignment in {0, 1}N with ith element 1 and everything else 0, 0 is the
assignment with all element 0. Similarity,
log p(eij ) log p(ei ) log p(ej ) + log p(0) = ij ,
where eij is assignment in {0, 1}N with ith and j th elements 1 and everything else 0.
Therefore, with potentially exponentially large number of samples, it is possible to
learn parameters . Clearly this is highly inecient. The question is, whether its
feasible to learn parameters with a lot fewer samples.
We shall utilize the quintessential property of the undirected graphical model: for
any i V , given XN (i) , Xi is independent of XV \N (i){i} where N (i) = {j V :
(i, j) E}. That is,
p(xi = |xN (i) = 0) = p(xi = |xV \{i} = 0).
Therefore,
log p(xi = 1|xN (i) = 0) log p(xi = 0|xN (i) = 0)

= log p(xi = 1|xV \{i} = 0) log p(xi = 0|xV \{i} = 0)

= log p(xi = 1, xV \{i} = 0) log p(xi = 0, xV \{i} = 0)

= log p(ei ) log p(0)

= i .

Thus, by estimating conditional probabilities p(xi = |xN (i) = 0) for all i V , we can
learn parameters i , i V . Similarly, by estimating p(xi = 1, xj = 1|xN (i)N (j)\{i,j} =
0), its feasible to learn ij for all (i, j) E.
Therefore, its feasible to learn parameters with samples scaling as 22d , if
|N (i)| d for all i V ; which is quite ecient when d N .

21.2

Generic undirected graphical model

For any graphical model with binary valued variables, to learn the associated clique
potentials VC () boils down to learning constants Q(C) as in (2). They can be learnt
in a very similar manner as in the example of binary pair-wise graphical model.

Concretely, for any i V ,


log p(xi = 1|xN (i) = 0) log p(xi = 0|xN (i) = 0)

= log p(xi = 1|xV \{i} = 0) log p(xi = 0|xV \{i} = 0)

= log p(xi = 1, xV \{i} = 0) log p(xi = 0, xV \{i} = 0)

= log p(ei ) log p(0)

= Q({i}).

And more generally, for any clique C V ,


log p(xC = 1|xN (C) = 0) log p(xi = 0|xN (C) = 0)

X
=
Q(C / ).

C l C

Using the above identity, its possible to learn constants associated with cliques of
increasing sizes iteratively. The number of samples required scale as 2dc , where dc is
bound on the number of neighbors of vertices in any clique (including the vertices in
the clique) for the graph G.

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

22 Parameter Estimation from Partial Observa


tions
Weve looked at learning parameters given graph structure for directed and undirected
graphical model when all variables are observed. Today, we consider when we dont
observe all the variables, focusing just on parameter estimation. A key issue well
face is that parameters get coupled, complicating the estimation process.

22.1

Latent variables

There are numerous examples where we only observe variables partially and are re
quired to learn the parameters associated with entire graphical model. For example,
consider a classication task such as determining whether an email received is spam
or ham. Unfortunately, email senders dont actually give a label for whether the email
they sent is spam or ham. Thus, treating the label as a random variable, the label
is hidden and we want to infer what the label is given observations, which for email
classication could be features extracted from email content.
We give a sense of why latent variables could make estimation more dicult via
the following example:
Example 1. Consider the following two-node graph.
x

With
px,y (x, y; ) = py (y; y )px|y (x|y; x|y ),
= y (y)x|y (x|y).
where
y = [y (y)]yY ,

x|y = [x|y (x|y)]xX,yY .

Assuming that the model is fully observed, then as weve seen previously, the ML
estimates come from empirical distribution matching:
y (y) = py (y),

px,y (x, y)
x|y (x|y) =
.
py (y)

Now consider if we instead only observed y . Then the log likelihood is given by

y (y)x|y (x|y).
(; y) = log
px,y (x, y; ) = log
x

for which we cannot push the log inside the summation. As a result, we cannot
separate y from x|y , so parameters y and x|y are said to mix. Consequently,
adjusting y could aect x|y .
The example above imparts the issue of parameter coupling when dealing with
latent variables. As a result, often were left resorting to iterative algorithms that
numerically compute ML estimates. For example, if the log likelihood is dierentiable,
then we can use gradient ascent. More generally, we can view ML estimation as a
generic hill-climbing problem. But we havent exploited the fact that were dealing
with a probability distribution, which is what well do next.
22.1.1

The Expectation-Maximization (EM) algorithm

Let y = (y1 , . . . , yN ) be the set of observed variables and x = (x1 , . . . , xN ' ) be the set
of latent variables, where N and N t need not be the same. Well refer to (y, x) as the
complete data, where well assume that we know joint distribution py,x (, ; ) with
parameter . Given observation y = y, our goal is to nd the ML estimate given by
ML = arg max py (y; ) = arg max log py (y; ) .

"
yv "
6i(;y)

Well refer to the log likelihood `(; y) that we want to maximize as the incomplete
log likelihood, whereas well call `c (; y, x) = log py,x (y, x; ) the complete log likeli
hood, which would be the relevant log likelihood to maximize if we had observed the
complete data.
The key idea behind the EM algorithm is the following. We will introduce a
distribution q over hidden variables x that we get to choose. Notationally well write
q(|y) to indicate that we want q to depend on our (xed) observations y. Then
`(; y) = log py (y; )

X
= log
py,x (y, x; )

= log

q(x|y)

py,x (y, x; )
q(x|y)

py,x (y, x; )
q(x|y)
py,x (y, x; )
Eq(|y) log
q(x|y)

X
py,x (y, x; )
=
q(x|y) log
q(x|y)
x
= log Eq(|y)

6 L(q, ).

(Jensens inequality)

(1)
2

In particular, L(q, ) is easier to compute since weve pushed the log into the sum
mation, and we were able to do so only because we introduced distribution q and
used Jensens inequality and concavity of the log function. So rather than maximiz
ing `(; y), which is hard to compute, the EM algorithm will maximize lower bound
L(q, ) by alternating between maximizing over q and then over :
E-step: q (i+1) = arg maxq L(q, (i) )
M-step: i+1 = arg max L(q (i+1) , )
We make a few remarks. First, in the M-step, by treating distribution q as xed, we
are solving


py,x (y, x; )
arg max L(q, ) = arg max Eq(|y) log

q(x|y)
= arg max{Eq(|y) [log py,x (y, x; )] Eq(|y) [log q(x|y)]}

"
yv
"
not dependent on

= arg max Eq(|y) [log py,x (y, x; )] .

"
yv
"

(2)

6iqc (;y), the expected complete


log likelihood w.r.t. q(|y)

Note that `qc (; y) is essentially an estimate of the incomplete log likelihood as we


are taking the expectation over hidden variables with respect to q(|y), eectively
summing the hidden variables out but not through a marginalization. Also, note that
its the expectation of the complete log likelihood, so deriving the optimal will turn
out to be nearly identical to deriving an ML estimate for the complete log likelihood,
where we will ll in expected counts related to hidden variables x.
Second, the E-step can be solved explicitly. Note that if we choose q (i+1) (|y) =
px|y (|y; (i) ), then
L(px|y (|y; (i) ), (i) ) =

px|y (x|y; (i) ) log

py,x (y, x; (i) )


px|y (x|y; (i) )

px|y (x|y; (i) )py (y; (i) )


px|y (x|y; ) log
px|y (x|y; (i) )
(i)

px|y (x|y; (i) ) log py (y; (i) )

xx

px|y (x|y; (i) ) log py (y; (i) )

"

yv

=1
(i)

= log py (y; )
= `((i) ; y).
3

"

So we have L(px|y (|y; (i) ), (i) ) = `((i) ; y) while from inequality (1), we know that
`(; y) L(q, ) for all distributions q over random variable x, so it must be that
choosing q (i+1) (|y) = px|y (|y; (i) ) is optimal for solving the E-step.
By plugging the optimal choice of q from the E-step into the M-step and specically
equation (2), we see that we can combine both steps of an EM iteration into one
update:
(i+1) = arg max Epx|y (x|y;(i) ) [log py,x (y, x; )]

= arg max E[log py,x (y, x; )|y = y; (i) ].

(3)

Hence, in the E-step, we can compute expectation E[log py,x (y, x; )|y = y; (i) ] rather
than computing out the full conditional distribution px|y (x|y; (i) ), and in the M-step,
we maximize this expectation with respect to , justifying the name of the algorithm.
We end with two properties of the algorithm. First, the sequence of parameter
estimates (i) produced never decreases the likelihood and will reach a local maximum.
Second, often the rst few iterations will be the most useful and then the algorithm
slows down. One way to accelerate convergence after the rst few iterations is to
switch to gradient ascent.
Example: Parameter estimation for hidden Markov models
Specializing to the case of HMMs, the EM algorithm is called the Baum-Welch
algorithm (Baum et al. 1970; ecient implementation described by Welch in 2003).
Consider a homogeneous HMM given by the diagram below.
x0

x1

x2

y0

y1

y2

xT

yT

Well assume that xi , yi take on values in {1, 2, . . . , M }. The model parameters are
= {, A, } where:
is the initial state distribution (i = px0 (i))
A is the transition matrix (aij = pxt+1 |xt (j, i))
is the emission distribution (ij = pyt |xt (j|i))
We want an ML estimate of given only observation y = y.
Well use the following data representation trick: Let xit = 1{xt = i}. Eectively,
we represent state xt {1, 2, . . . , M } as a bit-string (xt1 , xt2 , . . . , xtM ), where exactly
4

one of the xit s is 1 and the rest are 0. Well do the same for yt , letting ytj = 1{yt = j}.
This way of representing each state will prove extremely helpful in the sequel.
Before looking at the case where x is hidden, we rst suppose that we observe
the complete data where we have one observation for each xi and one observation for
each yi . Then optimizing for the parameters involves maximizing the following:
log px,y (x, y; )
= log x0

T
1
g

axt ,xt+1

t=0

= log

xM
g

= log

t=0
1{x0

M
X

i 0

xi xj
aijt t+1

T 1
M
M
X
X
X

xit xjt+1 log aij +

t=0 i=1 j=1

M
X

M
M
X
X

T 1
X

i=1 j=1

t=0

i=1

6i

log i +

"

1{xt =i,yt =j}

ij

xi y j
ij
t t

t=0 i=1 j=1

i=1

"yv"

t=0 i=1 j=1

!x T M M

ggg

t=0 i=1 j=1

xi0 log i +
xi0

1{xt =i,xt+1

aij

!x T M M
ggg
=j}

t=0 i=1 j=1

! xT 1 M M

g gg
xi

i=1

xt ,yt

xT 1 M M
!
g gg
=i}

i=1

xM
g

T
g

T X
M X
M

xit ytj
log ij

t=0 i=1 j=1

xit xjt+1

log aij +

M X
M
T

yv

"

(4)

t=0

i=1 j=1

6mij

xit ytj
log ij .

"

yv

6nij

"

Note that mij is just the number of times we see state i followed by state j in the
data. Similarly, nij is the number of times we see state i emit observation j.
M
By introducing Lagrange multipliers for constraints M
i=1 i = 1,
j=1 aij = 1 for
M
all i, and j=1 ij = 1 for all i followed by setting a bunch of partial derivatives to 0,
the ML estimates are given by:

ai = i ,

mij
a
aij = M
,
k=1 mik
"
yv
"
fraction of transitions
from i that go to j

nij
aij = M
.
k=1 nik
"
yv
"

(5)

fraction of times state i


emits observation j

We now consider the case when x is hidden, so well use the EM algorithm. What well
nd is that our calculations above will not go to waste! Writing out the expectation

to be maximized in an EM iteration, i.e. equation (3), we have


Epx|y (|y;() ) [log px,y (x, y; )]
"M

!
#
xT 1
x T
M
M
M
M

X
X
X
X
X
X
X
j
= Epx|y (|y;() )
x0i log i +
xti xt+1
log aij +
xti ytj log ij
i=1

M
X

Epx|y (|y;() ) [x0i ] log i +

i=1 j=1

t=0

M
M
X
X

xT 1

i=1

i=1

Epx|y (|y;() ) [xti ]ytj

Epx |y (|y;() ) [x0i ] log i +


" 0 yv
"

log ij

xT 1

M X
M

X
X

Epx ,x |y (,|y;() ) [xti xt+1


]
t t+1

"

x T
M X
M

log aij

t=0

i=1 j=1

Epx|y (|y;() ) [xti xt+1


] log aij

t=0

i=1 j=1
M
X

t=0

i=1 j=1

x T
M
M
X
X
X

t=0

i=1 j=1

yv

mij

"

Epx |y (|y;() ) [xti ]ytj

log ij .

t=0

i=1 j=1

"

yv

"

nij

Something magical happens: Note that using the sum-product algorithm with one
forward and one backward pass, we can compute all node marginals pxi |y (|y; (i) ) and
edge marginals pxi ,xi+1 |y (, |y; (i) ). Once we have these marginals, we can directly
j
compute all the expectations Epx |y (|y;() ) [x0i ] and Epx ,x |y (,|y;() ) [xti xt+1
]; hence, we
t t+1
0
can compute all i , mij , and nij . Computing these expectations is precisely the Estep. As for the M-step, we can reuse our ML estimate formulas (5) from before
except using our new values of i , mij , and nij .
For this problem, we can actually write down the expectations needed in terms of
node and edge marginal probabilities. Using the fact that indictaor random variables
are Bernoulli, we have:
Epx |y (|y;() ) [xti ] = P(xt = i|y = y; (i) )
t

Epx ,x

t t+1 |y

i j
(,|y;() ) [xt xt+1 ]

= P(xt = i, xt+1 = j|y = y; (i) )

Then we can simplify our ML estimates as follows:


(i+1)

ai

= i = E px

0 |y

i
(|y;) [x0 ]

= P(x0 = i|y = y; (i) ).

(6)

Next, note that the denominator of the ML estimate for a


aij can be written as
M
X

mik =

k=1

M X
T 1

Epx ,x

t t+1 |y

i k

(,|y;() ) [xt xt+1 ]

k=1 t=0
T
1
X

Epx ,x

t t+1 |y

"
(,|y;() )

xti

t=0

T 1
X

M
X

xtk+1

k=1

Epx ,x

t t+1 |y

i
(,|y;() ) [xt
]

(one bit is 1 and the rest are 0 for xt+1 )

t=0

T 1

P(xt = i|y = y; (i) ).

t=0

A similar result will hold for the denominator of the ML estimate for aij . Thus,
PT 1
P(xt = i, xt+1 = j|y = y; (i) )
mij
(i+1)
a
aij
= PM
= t=0PT 1

(7)
(i) )
m
P(x
=
i|y
=
y;

ik
t
k=1
t=0

PT
(i)
nij
(i+1)
t=0 P(xt = i, yt = j|y = y; )
aij
= PM
=
(8)
PT
(i)
t=0 P(xt = i|y = y; )
k=1 nik
In summary, each iteration of the Baum-Welch algorithm does the following:
a(i) , a(i) }, run the sum (i) , A
E-step: Given current parameter estimates (i) = {a
product algorithm to compute all node and edge marginals.
a(i+1) , a(i+1) }
a(i+1) , A
M-step: Using node and edge marginals, compute (i+1) = {
using equations (6), (7), and (8).
For the E-step, if we use the - forward-backward algorithm, edge marginals can be
recovered using the following equation:
P(xt = xt , xt+1 = xt+1 |y = y)
mt1t (xt ) t (xt ) t,t+1 (xt , xt+1 ) t+1 (xt+1 ) mt+2t+1 (xt+1 )
" yv " "
yv
" " yv "
p(yt |xt )

p(yt+1 |xt+1 )

p(xt+1 |xt )

= mt1t (xt )p(yt |xt )p(xt+1 |xt )p(yt+1 |xt+1 )mt+2t+1 (xt+1 )
"
#

X
=
t1 (xt1 )p(xt |xt1 ) p(yt |xt )p(xt+1 |xt )p(yt+1 |xt+1 )t+1 (xt+1 )
xt1

(recall how BP messages and , messages relate)


#

X
=
t1 (xt1 )p(xt |xt1 )p(yt |xt ) p(xt+1 |xt )p(yt+1 |xt+1 )t+1 (xt+1 )
"

xt1

= t (xt )p(xt+1 |xt )p(yt+1 |xt+1 )t+1 (xt+1 ).


7

Hence, edge marginals at iteration ` are of the form:


(i) (i)

P(xt = i, xt+1 = j|y = y; (i) ) t (i)a


aij ajyt+1 t+1 (j).

(9)

We end our discussion of Baum-Welch with some key takeaways. First, note the
interplay between inference and modeling: In the E-step, were doing inference given
our previous parameter estimates to obtain quantities that help us improve parameter
estimates in the M-step.
Second, we were able to recycle our ML estimation calculations from the fully
observed case. What we ended up introducing are essentially soft-counts, e.g., the
new mij is the expected number of times we see state i followed by state j.
Third, by linearity of expectation, the expected log likelihood for the HMM only
ended up depending on expectations of nodes and edges, and as a result, we only
needed to compute the node and edge marginals of distribution q (i+1) = px|y (|y; (i) ).
We never needed to compute out the full joint distribution px|y (|y; (i) )!

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

23

Learning Structure in Directed Graphs

Last time, we looked at parameter estimation for a directed graphical model given
its graphical model structure. Today, well instead focus on learning graphical model
structure for DAGs. As with the previous lecture, we provide both frequentist and
Bayesian perspectives.
Well stick to a setup where we have N random variables x1 , x2 , . . . , xN that are
nodes in a graphical model,
where each xi has alphabet size M . The number of
o
N
possible edges is thus 2 = N (N 1)/2, and since an edge is either present or not in
a graph, there are 2N (N 1)/2 possible graphs over the N nodes. Viewing each graph
as a dierent model, we treat learning graphical model structure as a model selection
problem. Of course, wed be on a road to nowhere without any data, so well assume
that we have at our disposal K i.i.d. observations of x = (x1 , . . . , xN ). In particular,
each observation has N values and no random variable is hidden. The two main
questions that arise are:
How do we score any particular model (i.e., graph) given data?
How do we nd a model with the highest score?
We present the frequentist perspective rst before looking at the Bayesian perspective.

23.1

Frequentist Perspective: Likelihood Score

As a reminder, the frequentist perspective has the fundamental interpretation that the
parameters of interest, which in our case are parameters G of the graphical model G,
are deterministic but unknown. With this interpretation, it does not make sense to
talk about parameter G having a distribution, since its deterministic!
Moving right along, note that for a xed model (i.e., graph), evaluating the log
likelihood for the model at the ML estimate for the model gives what well refer to as
the likelihood score, which we could use to score a graph! In particular, it seems that
the likelihood score should be higher for more plausible graphs. Lets formalize this
idea. Recycling notation from the previous lecture, denote ((G, G ); D) to be the log
likelihood of graphical model (G, G ) evaluated on our observations D. Then, a best
graph is one that is a solution to the following optimization problem:
D),
max ((G, G ); D) = max max ((G, G ); D) = max (G;
G
G, G
G
G
"
"
"
involves ML estimation
given graph structure G

where (G; D) ; ((G, GML ); D) is the likelihood score.

Well now look at what maximizing the likelihood score means for DAGs, for

which we will relate the likelihood score to mutual information and entropy. First,
we recall some denitions before forging on. The mutual information of two random
variables u and v is given by
pu,v (u, v) log

I(u; v ) ;
u,v

pu,v (u, v)
,
pu (u)pv (v)

which is symmetric (i.e., I(u; v ) = I(v ; u)) and non-negative. The entropy of random
variable u is given by
H(u) ;

pu (u) log pu (u) 0.


u

Well use the following formula:


H(u|v ) =

pu,v (u, v) log pu|v (u|v)


u,v

pu,v (u, v) log

pu,v (u, v)
pv (v)

pu,v (u, v) log

pu (u)pu,v (u, v)
pu (u)pv (v)

u,v

=
u,v

pu,v (u, v) log


u,v

pu,v (u, v) log


u,v

pu,v (u, v) log


u,v

pu,v (u, v)
+ log pu (u)
pu (u)pv (v)

pu,v (u, v)

pu (u)pv (v)
pu,v (u, v)

pu (u)pv (v)

pu,v (u, v) log pu (u)


u,v

pu (u) log pu (u)


u

= I(u; v ) + H(u).

(1)

Finally, whenever we are looking at mutual information or entropy that comes from

empirical distributions, well put on hats: I and H.


For a DAG G with parameters G , we have
px1 ,...,xN (x1 , . . . , xN ; G ) =

N
N

pxi |xi (xi |xi ; Gi ),

i=1

where i denotes the set of parents of node i and Gi denote parameters (i.e., table
entries) of conditional probability table pxi |xi ; note that which nodes are parents of
node i depends on the graph G. We can more explicitly write out how conditional
probability table pxi |xi relates to its parameters Gi :
pxi |xi (xi |xi ; Gi ) = [Gi ]xi ,xi .
2

Assuming that parameters from dierent tables are independent, then we can build

the likelihood score from likelihood scores of individual conditional probability tables:

D) =
(G;

N
X

i (G; Di )

(2)

i=1

where Di here refers to the data corresponding to random variables {xi , xi }. Recall
that the ML estimate Gi for the Gi is the empirical distribution:
[Gi ]xi ,xi = pxi |xi (xi |xi ) =

pxi ,xi (xi , xi )


.
pxi (xi )

Furthermore,
i (G; Di ) =

(xi |x ) = I(xi ; x ) H
(xi ),
pxi ,xi (a, b) log pxi |xi (a|b) = H
i
i

(3)

a,b

where the last step uses equation (1). Putting together (2) and (3), we arrive at
(G; D) =

N
X
i=1

i (G; Di ) =

N
X

(xi )
I(xi ; xi ) H

i=1

N
X
i=1

I(xi ; xi )

N
X

(xi ) .
H

"i=1 "

"

independent of G

The main message here is that given a topological ordering (i.e., a node ordering where
parents occur before children in the ordering), we can evaluate the likelihood score
by summing up empirical mutual information terms while subtracting o the node
entropies. Note that all graphs will have the same node entropies getting subtracted
o! Hence, for comparing graphs, we only need to compare the sum of empirical
mutual information terms per graph.
Example 1. (Chow-Liu 1968) Suppose we restrict ourselves to optimizing over trees
across all N nodes. In this scenario, all models are equally complex in the sense that
they each have N 1 edges. Our earlier analysis suggests that all we need to do
to nd a tree with the highest likelihood score is to nd one where the sum of the
empirical mutual information terms is largest. This is just a maximum spanning tree
problem!
In particular, we can just compute the empirical mutual information between
every pair of nodes and then run Kruskals algorithm: We sort the empirical mutual
information values across all pairs. Then, starting from an empty (no edges) graph
on the N nodes, we iterate through the edges from largest mutual information value
to smallest, adding an edge if it does not introduce a cycle. We repeat this until we
have N 1 edges.
Care must be taken for assigning edge orientations to get a directed tree, since
we do not want any node having more than one parent, which would have required
looking at an empirical mutual information that involves more than two nodes.
3

Example 2. Our second example will reveal a critical pitfall of using the likelihood
score to evaluate the quality of a model. In particular, what well see is that the
likelihood score favors more complicated models!
Suppose were choosing between graphs G0 and G1 shown below.
G0 :
G1 :

(x) H
(y )
(G0 ; D) = H

(x) H
(y )
(G1 ; D) = I(x; y ) H

Note that we have


(G1 ; D) (G0 ; D) = I(x; y ) 0,
which means that the more complicated model G1 will always be at least as good
as G0 . Thus, it is safe to always prefer G1 . This phenomenon extends to larger models
as well, where the likelihood score will favor more complex models. Well address this
by placing priors on parameters, bringing us into Bayesian statistics.

23.2

Bayesian Perspective: Bayesian Score

Unlike frequentists, Bayesians say that since we dont know parameters G of graph G,
then we might as well treat these parameters as random. In particular, we place a
prior distribution p(G) on graph G and a prior p(G |G) on parameters G given graph G.
Then the quality of a graph can be quantied by its posterior distribution evaluated
on our data D. Applying Bayes rule, the posterior is given by
p(G|D) =

p(D|G)p(G)
.
p(D)

Since data D is observed and xed, maximizing the above across all graphs is the
same as just maximizing the numerator; in particular, well maximize the log of the
numerator, which well call the Bayesian score B :
B (G; D) = log p(D|G) + log p(G),

(4)

p(D|G, G )p(G |G)dG

(5)

where
p(D|G) =

is called the marginal likelihood. Note, importantly, that we marginalize out random
parameters G .
It is possible to approximate the marginal likelihood using a Laplace approxi
mation, which is beyond the scope of this course and involves approximating the
integrand of (5) as a Gaussian. The end result is that we have
p(D|G) p(D|GML , G) p(GML |G)|D ,
"
"
"
Occam factor

where |D is a width associated with the Gaussian and the Occam factor turns out
to favor simpler models, alluding to Occams razor.
23.2.1

Marginal Likelihoods with Unknown Distributions

Well now study a graph with a single node, which may not seem particularly exciting,
but in fact looking at the marginal likelihood for this single-node graph and comparing
it with the likelihood score in an asymptotic regime will give us a good sense of how
the Bayesian score diers from the likelihood score.
Suppose random variable x takes on values in {1, 2, . . . , M }. Then x is a singletrial multinomial random variable (this single-trial case is also called a categorical
random variable) with parameters = (1 , . . . , M ). We choose prior p (; ) to be
the conjugate prior, i.e. is Dirichlet with parameters = (1 , . . . , M ). Then if we
have K i.i.d. observations x1 , . . . , xK of x, where we let K(m) denote the number of
times alphabet symbol m occurs, then
p(D; ) = px1 ,...,xK (x1 , . . . , xK )
Z
K
N
pxk | (xk |)d
= p (; )
Z

p (; )
Z

p (; )

k=1
K N
M
N
k=1 m=1
M N
K
N

1{xk =m}
d
m

1{xk =m}
m
d

m=1 k=1

p (; )

M
N

K(m)
m
d

m=1

Z m

(
rM

M
i=1

i ) N

ii 1

M
N

i=1 (i ) i=1
m=1
Z
M
N
( M i )
m +K(m)1
= rM i=1
m
d,
(
)
i
i=1
m=1

K(m)
m
d

(6)

where (z) = 0 tz1 et dt is the gamma function. Note that (n) = (n 1)! for
positive integer n, so it is a continuous extension of the factorial function. Next,
observe that the integral in (6) just evaluates to be the partition function of a Dirichlet
distribution with parameters (1 + K(1), 2 + K(2), . . . , M + K(M )). Hence,
Z N
M
m=1

m +K(m)1
d
m

rM

i=1 (i + K(i))
( M
i=1 (i + K(i))

rM
=

i=1

(i + K(i))
M
i=1

i + K)

(7)

Stitching together (6) and (7), we get

P
rM
P
M
N
( M
( M
(i + K(i))
i=1 i )
i=1 (i + K(i))
i=1 i )
p(D; ) = rM
= PM
. (8)
PM
(i )
( i=1 i + K) i=1
i=1 (i ) (
i=1 i + K)
Well now see what happens with asymptotics. For simplicity, we shall assume M = 2
and 1 = 2 = 1. Eectively this means that x is Bernoulli except that well let it
take on values in {1, 2} instead of {0, 1}. Then using equation (8),
p(D; ) =
=
=
=
=
=

(1 + 2 ) (1 + K(1)) (2 + K(2))
(1 )
(2 )
( + 2 + K)
(2) (1 + K(1)) (1 + K(2))
(2 + K)
(1)
(1)
K(1)!K(2)!
(K + 1)!
K(1)!K(2)!
(K + 1)K!

1
1
K!
K + 1 K(1)!K(2)!

1
1
K
.
K +1
K(1)

(9)

We shall use Stirlings approximation of the binomial coecient:


n
m

enH(Ber(m/n))

n
,
2m(n m)

(10)

where H(Ber(m/n)) denotes the entropy of a Bernoulli random variable with param
eter m/n. Combining (9) and (10) results in

1
1
K
p(D; ) =
K +1
K(1)
.1
1
K

eKH(Ber(K(1)/K))
K +1
2K(1)K(2)
1
2K(1)K(2)
eKH(Ber(K(1)/K))
K +1
K
1
2K(1)K(2)

=
eKH (x)
.
K +1
K
Taking the log of both sides yields the approximate Bayesian score where we note
that there is no p(G) term since theres only one graph possible:
=

(x) log(K + 1) + 1 log(2K(1)K(2)) 1 log K. (11)


B (D) = log p(D; ) KH
2
2
6

???
In contrast, the likelihood score is:

(D)
= log p(D; ML )
= log{(1ML )K(1) (2ML )K(2) }

K(1) 
K(2)
K(1)
K(2)
= log
K
K
K(1)
K(2)
= K(1) log
+ K(2) log
K
K


K(1)
K(1) K(2)
K(2)
=K
log
+
log
K
K
K
K

= KH (x).

(12)

Comparing (11) and (12), we see that the Bayesian score incurs an extra O(log K)
factor, which essentially penalizes a model when there is insucient data. In fact, a
more general result holds, which well mention at the end of lecture that shows that
the Bayesian score will asymptotically penalize model complexity when we have a
large number of samples K.
23.2.2

Marginal Likelihoods with DAGs

Lets return to Example 2 considered in the likelihood score setup and see how the
Bayesian score diers. Assuming that p(G0 ) = p(G1 ), then comparing the Bayesian
score (4) just involves comparing the marginal likelihoods p(D|G). Note that G0
has parameters x , y [0, 1]M , whereas G1 has parameters x [0, 1]M and y =
(y |1 , y |2 , . . . , y |M ) with each y |m [0, 1]M . Of course, each length-M parameter
vector describing a distribution must sum to 1.
The marginal likelihood for G0 is given by
Z Z
p(x , y |G0 )p(D|x , y , G0 )dx dy
p(D|G0 ) =
Z Z

p(x |G0 )p(y |G0 )

K
N

p(xk |x , G0 )

k=1

p(x |G0 )

K
N

p(xk |x , G0 )dx

k=1

K
N

p(yk |y , G0 )dx dy

k=1

p(y |G0 )

K
N
k=1

p(yk |y , G0 )dy .

Meanwhile, the marginal likelihood for G1 is given by


p(D|G1 )
Z Z
=
p(x , y |G1 )p(D|x , y , G1 )dx dy
! K
mM
Z Z
N
N
=
p(x |G1 )
p(y |m |G1 )
p(xk |x , G1 )p(yk |xk , y |xk , G1 ) dx dy
m=1

k=1

!m M
! K
Z Z m
K
N
N
N
=
p(x |G1 )
p(xk |x , G1 )
p(y |m |G1 )
p(yk |xk , y |xk , G1 )dx dy
m=1

k=1

k=1

!m M
!
Z Z m
K
N
N
p(xk |x , G1 )
p(y |m |G1 )

=
p(x |G1 )

m=1

k=1
M
K

NN

p(yk |xk , y |xk , G1 )1{xk =m} dx dy

m=1 k=1

!
Z m
K
N
p(xk |x , G1 ) dx
=
p(x |G1 )
k=1

M
N

p(y |m |G1 )

m=1

p(yk |xk , y |xk , G1 )1{xk =m} dy |m .

k=1

If we choose hyperparameters (which are the Dirichlet parameters ) of x to be the


same across the two models G0 and G1 , then the rst factors involving an integral
over x in marginal likelihoods p(D|G0 ) and p(D|G1 ) are the same, so we just need
to compare the second factors; whichever is larger dictates the model favored by the
Bayesian score:
Z

p(y |G0 )

K
N

p(yk |y , G0 )dy

M Z
N

p(y |m |G1 )

m=1

k=1

K
N

p(yk |xk , y |xk , G1 )1{xk =m} dy |m .

k=1

What will happen is that when we have only a little data, we will tend to prefer G0 ,
and when we have more data, we will prefer G1 provided that x and y actually are
correlated. The transition point at which we switch between favoring G0 and G1
depends on the strength of the correlation between x and y .
23.2.3

Approximations for large K

For large K, there is a general result that the Bayesian score satises
B (G; D)

D) log K dim G
(G;
"
"2
"

Bayesian Information Criterion (BIC)

+O(1),

where dim G is the number of independent parameters in model G and the trailing O(1)
constant does not depend on K. Note that (G; D) is the likelihood score and scales
linearly with K (a specic example that showed this is in equation (12)). Importantly,
the second term, which comes from the Occam factor, penalizes complex models.
However, as the number of observations K , since the likelihood score grows
linearly in K, it will dominate the second term, pushing us toward the frequentist
solution and nding the correct model provided that the correct model has nonzero
probability under our prior p(G).

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.438 Algorithms for Inference


Fall 2014

24

Learning Exponential Family Models

So far, in our discussion of learning, we have focused on discrete variables from nite
alphabets. We derived a convenient form for the MAP estimate in the case of DAGs:
i,i = pxi |xi (|) = pxi |xxi (|).

(1)

In other words, we chose the CPD entries such that the model distribution p
matches the empirical distribution p. We will see that this is a special case of a more
general result which holds for many families of distributions which take a particular
form. However, we cant apply (1) directly to continuous random variables, because
the empirical distribution p is a mixture of delta functions, whereas we ultimately
want a continuous distribution p. Instead, we must choose p so as to match certain
statistics of the empirical distribution.

24.1

Gaussian Parameter Estimation

To build an intuition, we begin with the special case of Gaussian distributions. Sup
pose we are given an i.i.d. sample x1 , . . . , xK from a Gaussian distribution p(; , 2 ).
Then, by denition,
= Ep [x]
2 = Ep [(x )2 ].
This suggests setting the parameters equal to the corresponding empirical statistics.
In fact, it is straightforward to check, by setting the gradient to zero, that the maxi
mum likelihood parameter estimates are1

M L =

K
1
xk
K k=1

M
L =

K
1
(xk )
2.
K k=1

In introductory statistics classes, we are often told to divide by K 1 rather than K when
estimating the variance of a multivariate Gaussian. This is in order that the resulting estimate be
unbiased. The maximum likelihood estimate, as it turns out, is biased. This is not necessarily a bad
thing, because the biased estimate also has lower variance, i.e. there is a tradeo between bias and
variance.

The corresponding ML estimates for a multivariate Gaussian are given by:

ML

K
X
1
=
xk
K k=1

K
X
1

M L )(xk
M L )T .
M L =
(xk
K k=1

24.1.1

ML parameter estimation in Gaussian DAGs

We saw that in a Gaussian DAG, the CPDs take the form


p(xi |xi ) = N(0 + 1 u1 + + L uL ; 2 ),
where xi = (u1 , . . . , uL ). We can rewrite this in innovation form:
xi = 0 + 1 u1 + + L uL + w,
where w N(0, 2 ) is independent of the ul s. This representation highlights the
M L is the solution
relationship with linear least squares estimation, and shows that
to a set of L + 1 linear equations.
The parameters of this CPD are = (0 , . . . , L , 2 ). When we set the gradient
of the log-likelihood to zero, we get:
xu 1
0 =
x
uu u
=
1

uu ux
xu 1

0 =
xx

uu ux ,
where
K
X
1
xi

x =
K k=1

X
xx = 1

(xi
x )2

K k=1
K
X

xu = 1
u )T

(xi
x )(uk
K k=1
K
X
1

u )(ui
u )T .
uu =
(ui
K k=1

Recall, however, that if p(x|u) = N(0 + T u, 0 ), then


0 = x xu 1
uu u
= 1
uu ux
0 = xx xu 1
uu ux .
In other words, ML in Gaussian DAGs is another example of moment matching.
24.1.2

Bayesian Parameter Estimation in Gaussian DAGs

In our discussion of discrete DAGs, we derived the ML estimates, and then derived
Bayesian parameter estimates in order to reduce variance. Now we do the equivalent
for Gaussian DAGs. Observe that the likelihood function for univariate Gaussians
takes the form
p(D; , 2 )

K
X
1
1
exp

(xk )2
K
2 2 k=1

1
1
= K exp 2

K
X

x2k

k=1

K
X

xk + K2

(2)

k=1

As in the discrete case, we use this functional form to derive conjugate priors. First,
if 2 is known, we see the conjugate prior takes the form
p() exp a b2 .
In other words, the conjugate prior for the mean parameter of a Gaussian with known
variance is a Gaussian.
What if the mean is known and we want a prior over the variance? We simply
take the functional form of (2), but with respect to this time, to nd a conjugate
prior
1
b
p() a exp 2 .

This distribution has a more convenient form when we rewrite it in terms of the pre
cision = 1/. Then, we see that the conjugate prior for is a gamma distribution:
p( ) 1 e .
The corresponding prior over is known as an inverse Gamma distribution. When
both and are unknown, we get what is called an Gaussian-Gamma prior:
p(, ) = N(; a, (b)1 ) (; , ).
Analogous results hold for the multivariate Gaussian case. The conjugate prior for
with known J is a multivariate Gaussian. The conjugate prior for J with known
is a matrix analogue of the gamma distribution called the Wishart distribution. When
both and J are unknown, the conjugate prior is a Gaussian-Wishart distribution.
3

24.2

Linear Exponential Families

Weve seen three dierent examples of maximum likelihood estimation which led to
similar-looking expectation matching criteria: CPDs of discrete DAGs, potentials in
undirected graphs, and Gaussian distributions. These three examples are all special
cases of a very general class of probabilistic models called exponential families. A
family of distributions is an exponential family if it can be written in the form
!
k

X
1
p(x; ) =
exp
i fi (x) ,
Z()
i=1
where x is an N -dimensional vector and is a k-dimensional vector. The functions fi
are called features, or sucient statistics, because they are sucient for estimating the
parameters. When the family of distributions is written in this form, the parameters
are known as the natural parameters.
Lets consider some examples.
1. A multinomial distribution can be written as an exponential family with fa0 (x) =
1x=a0 and the natural parameters are a0 = ln p(a0 ).
2. In an undirected graphical model, the distribution can be written as:
p(x) =

1
Z

C (xC )
CC

X
1
= exp
ln C (xC )
Z
CC

X
X

1
= exp
ln C (xC )1xC =x'C .
Z
|C|
CC '
xC |X|

This is an exponential family representation where the sucient statistics cor


respond to indicator variables for each clique and each joint assignment to the
variables in that clique:
fC,x'C (x) = 1xC =x'C ,
and the natural parameters C,xC correspond to the log potentials ln C (xC ).
Observe that this is the same parameterization of undirected graphical models
which we used to derive the tree reweighted belief propagation algorithm in our
discussion of variational inference.
3. If the variables x = (x1 , . . . , xN ) are jointly Gaussian, the joint PDF is given by
!
X
X

1
hi xj .
p(x) exp
Jij xi xj +
2 i,j
i
4

This is an exponential family with sucient statistics

x1
x1
.
.
.

x
f (x) =
N2 .

x1

x1 x2
.

..

x2N
and natural parameters

h1
h2
..
.

hN
=

12 J11

12 J12

..

.
1
2 JN N

4. We might be tempted to conclude from this that every family of distributions is


an exponential family. However, in fact families are not. As a simple example,
even the family of Laplacian distributions with scale parameter 1
p(x; ) =

1 |x|
e
Z

is not an exponential family.


24.2.1

ML Estimation in Linear Exponential Families

Suppose we are given i.i.d. samples D = x1 , . . . , xK from a discrete exponential family


p(x; ) = Z(1) exp( T f (x)). As usual, we compute the gradient of the log likelihood:
K
X
1
1
T

ln Z()
f(; D) =
f (x)
K k=1 i
i
i K
K
X

X

1
T

ln
exp T x
=
f (x)
i
K k=1 i
x

= ED [fi (x)] E [fi (x)].


The derivation in the continuous case is identical, except that the partition function

expands to an integral rather than a sum. This shows that in any exponential family,

the ML parameter estimates correspond to moment matching, i.e. they match the

empirical expectations of the sucient statistics:


ED [fi (x)] = E[fi (x)].
Interestingly, in our discussion of Gaussians, we found the ML estimates by taking
derivatives with respect to the information form parameters, and we wound up with
ML solutions in terms of the covariance form. Our derivation here shows that this
phenomenon applies to all exponential family models, not just Gaussians. We can
summarize these analogous representations in a table:
natural parameters expected sucient statistics
Gaussian distribution information form
covariance form
multinomial distribution
log odds
probability table
undirected graphical model
log potentials
clique marginals
24.2.2

Maximum Entropy Interpretation

In the last section, we started with a parametric form of a distribution, maximized the
data likelihood, and wound up with a constraint on the expected sucient statistics.
Interestingly, we can arrive at the same solution from the opposite direction: starting
with constraints on the expected sucient statistics, we choose a distribution which
maximizes the entropy subject to those constraints, and it turns out to have that same
parametric form. Intuitively, the entropy of a distribution is a measure of how spread
out, or uncertain, it is. If all we know about some distribution is the expectations
of certain statistics fi (x), it would make sense to choose the distribution with as
little commitment as possible, i.e. the one that is most uncertain, subject to those
constraints. This suggests maximizing the entropy subject to the moment constraints:
max H(p)
p

subject to Ep [fi (x)] = ED [fi (x)]

X
p(x) = 1.
x

For simplicity, assume p is a discrete distribution. To solve this optimization


problem, we write out the Lagrangian:
!
K

X
X
L(p) = H(p) +
i (Ep [fi (x)] ED [fi (x)]) +
p(x) 1
x

i=1

X
x

p(x) ln p(x) +

X
i=1

p(x)fi (x) ED [fi (x)]

X
x

p(x) 1 .

We take the derivative with respect to p(x):


K

X
L
= ln p(x) 1 +
i fi (x) + .
p(x)
i=1
Setting this to zero, we nd that
p(x) exp

i fi (x) .

i=1

This is simply the denition of an exponential family with sucient statistics fi .


This shows a striking and philosophically interesting equivalence between exponential
families and the principle of maximum entropy.

MIT OpenCourseWare
http://ocw.mit.edu

6.438 Algorithms for Inference


Fall 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like