You are on page 1of 19

Conditional Expectation and Martingales

Palash Sarkar
Applied Statistics Unit
Indian Statistical Institute
203, B.T. Road, Kolkata
INDIA 700 108
e-mail: palash@isical.ac.in
1 Introduction
This is a set of very hurriedly compiled notes. It has not been proof-checked and is likely to have some
errors. These, though, should be minor errors and maybe cleared up by referring to the relevant texts. The
texts that have been used in the preparation of the notes are Feller; Grimmett and Stirzaker; Goswami
and Rao; David Williams; and Mitzenmacher and Upfal.
The purpose behind the notes is to provide an idea of conditional expectations and martingales while
keeping within the ambit of discrete probability. In the future, these notes may be expanded into a more
rened set of notes which will also include the more general denitions of conditional expectations and
martingales.
These notes have been prepared to supplement two one-and-half hour lectures on discrete probability
that had been delivered at a workshop on combinatorics at the Applied Statistics Unit of the Indian
Statistical Institute on 17 and 18 February, 2011.
2 A Recapitulation of Discrete Probability
The sample space is a countable set {x
1
, x
2
, x
3
, . . .} with a non-negative number p
i
associated to x
i
such
that
p
1
+p
2
+p
3
+ = 1.
This will also be written as p
i
= p(x
i
). The x
i
s are called points and subsets of the sample space are called
events. If A is an event, then its probability is dened as
Pr[A] =

xA
p(x).
If A
1
and A
2
are disjoint events, i.e., A
1
A
2
= , then it is easy to see that
Pr[A
1
A
2
] = Pr[A
1
] + Pr[A
2
].
For two arbitrary events (i.e., not necessarily disjoint) A
1
and A
2
, the following result is also quite easy to
verify.
Pr[A
1
A
2
] = Pr[A
1
] + Pr[A
2
] Pr[A
1
A
2
]
Pr[A
1
] + Pr[A
2
].
The rst row (i.e., the equality) extends using the principle of inclusion and exclusion. On the other hand,
the second row (i.e., the inequality) extends directly by induction to give the following bound which is
called the union bound.
Pr[A
1
A
2
A
k
] Pr[A
1
] + Pr[A
2
] + + Pr[A
k
].
Two events A and B are said to be independent if
Pr[A B] = Pr[A] Pr[B].
Given n events A
1
, A
2
, . . . , A
n
, these are said to be k-wise independent, if for any 1 j k and
{i
1
, . . . , i
j
} {1, . . . , n},
Pr[A
i
1
A
i
j
] = Pr[A
i
1
] Pr[A
i
j
].
If k = 2, then the events are said to be pairwise independent, while if k = n, then the events are said to
be mutually independent. Intuitively, as k increases from 2 to n, then the amount of randomness in the
sequence A
1
, . . . , A
n
increases. In many cases, it is of interest to investigate how much can be achieved
with pairwise independence.
The notion of independence is of great importance. In fact, it is this multiplicative feature which combines
with disjoint additivity to give probability theory much of its surprising results.
Let A be an event with Pr[A] > 0. Let B be another event. The probability of B conditioned on A (also
stated as the conditional probability of B given A) is denoted by Pr[B|A] and is dened in the following
manner.
Pr[B|A] =
Pr[A B]
Pr[A]
.
For the sake of convenience, A B is also written as AB. If A and B are independent events, then
Pr[B|A] = Pr[B]. (Note that Pr[A] has to be positive for Pr[B|A] to be dened.)
The chain rule for conditional probabilities is the following relation between two events A and B.
Pr[A B] = Pr[B|A] Pr[A].
If Pr[A] > 0, then this follows from the denition of conditional probability. If, on the other hand, Pr[A] = 0,
then A is a null event (for the discrete case A = , but, for the more general denition of probability, A
may be non-empty but still have Pr[A] = 0) and A B is a subset of A which is also a null event, so that
Pr[A B] = 0. So, again both sides are equal. This relation easily generalises to more than two events.
Pr[A
1
A
2
A
n
] = Pr[A
1
] Pr[A
2
|A
1
] Pr[A
3
|A
2
A
1
]
Pr[A
n
|A
n1
A
1
].
A simple relation which is used very often is the following. Let A be an event and B be an event with
positive probability.
Pr[A] = Pr[A (B B)]
= Pr[(A B) (A B)]
= Pr[A B] + Pr[A B]
= Pr[A|B]Pr[B] + Pr[A|B]Pr[B].
Using the fact that values of probabilities are between 0 and 1, the above relation is often used to obtain
bounds on Pr[A].
Pr[A|B]Pr[B] Pr[A] Pr[A|B] + Pr[B].
The crux in using this relation is to choose the event B suitably.
A random variable X is a function from a sample space to the reals. Given a real a, the probability that
X takes the value a is dened to be the probability of the event {x : X(x) = a}. This is written as
Pr[X = a] = Pr[{x : X(x) = a}].
Think of X = a as the event {x : X(x) = a} which is a subset of the sample space. This will help in
understanding much of the description of random variables. In the following, when we talk of two (or more)
random variables, then all of these would be dened over the same sample space.
Since we are working with discrete sample spaces, the possible values that a random variable may assume
is also countable. So, suppose that X takes the values a
1
, a
2
, a
3
, . . .. Then

i1
Pr[X = a
i
] = 1.
Let X and Y be two random variables taking values a
1
, a
2
, a
3
, . . . and b
1
, b
2
, b
3
, . . . respectively. Then
the joint distribution of X and Y is dened as follows.
Pr[X = a
i
, Y = b
j
] = Pr[{x : X(x) = a
i
and Y (x) = b
j
}].
Suppose p(a
i
, b
j
) is the joint distribution of random variables X and Y . Then the marginal distributions
of X and Y are obtained as follows.
f(a
i
)

= Pr[X = a
i
] = p(a
i
, b
1
) +p(a
i
, b
2
) +p(a
i
, b
3
) +
g(a
i
)

= Pr[Y = b
j
] = p(a
1
, b
j
) +p(a
2
, b
j
) +p(a
3
, b
j
) + .
Let X and Y be random variables. They are said to be independent if
Pr[X = a, Y = b] = Pr[X = a] Pr[Y = b]
for every possible value a and b that X and Y can take. Extension to more than two variables is
straightforward. Random variables X
1
, X
2
, . . . , X
n
are said to be k-wise independent if for every choice of
{i
1
, . . . , i
k
} {1, . . . , n},
Pr[X
i
1
= a
1
, . . . , X
i
k
= a
k
] = Pr[X
i
1
= a
1
] Pr[X
i
k
= a
k
]
for every choice of a
1
, . . . , a
k
that the random variables X
i
1
, . . . , X
i
k
can take. As in the case of events, if
k = 2, then the random variables are said to be pairwise independent, where as if k = n, then the variables
are said to be mutually independent.
If X is a random variable and g is a function from reals to reals, then g(X) is a random variable. The
probability that g(X) takes a value b is
Pr[g(X) = b] = Pr[{x : g(X(x)) = b}].
If X is a random variable over a discrete sample space, taking the values a
1
, a
2
, . . ., then the expectation
of X is denoted as E[X] and is dened as
E[X] =

i1
a
i
Pr[X = a
i
]
provided the series converges absolutely.
The following basic result is easy to prove.
Lemma 1. If X is a random variable taking values a
1
, a
2
, a
3
, . . . and g is a function from reals to reals,
then
E[g(X)] =

i1
g(a
i
)Pr[X = a
i
].
Setting g(x) = ax for a constant a, the above result shows that E[aX] = aE[X]. Similarly, setting g(x) to
a constant b, we get
E[b] = E[g(X)] =

i1
g(a
i
)Pr[X = a
i
]
= b

i1
Pr[X = a
i
]
= b.
Expectation distributes over a sum of random variables a feature which is called the linearity of expec-
tation. This is stated in the following result.
Theorem 1. Let X
1
, X
2
, . . . , X
k
be random variables with nite expectations. Then
E[X
1
+X
2
+ +X
k
] = E[X
1
] +E[X
2
] + +E[X
k
].
Proof: It is sucient to prove the result for k = 2. Let X and Y be two random variables having p(a
i
, b
j
)
as the joint distribution and f(a
i
) and g(b
j
) as the respective marginal distributions. Then
E[X] =

i1
a
i
f(a
i
) =

i1
a
i

j1
p(a
i
, b
j
) =

i1,j1
a
i
p(a
i
, b
j
).
Similarly, E[Y ] =

i1,j1
b
j
p(a
i
, b
j
) and so
E[X] +E[Y ] =

i1,j1
(a
i
+b
j
)p(a
i
, b
j
)
= E[X +Y ].
The rearrangements of the sums are possible as the series are absolutely convergent.
It would have been nice if expectation also disbributed over the product of two random variables. That,
however, is not true and the distributive result holds only if the two random variables are independent.
(The distributive relation over arbitrary random variables would perhaps be too nice and would probably
have made the theory uninteresting.)
Theorem 2. Let X and Y be independent random variables. Then
E[XY ] = E[X] E[Y ].
Proof: As in the above result, let p(a
i
, b
j
) be the joint distribution and f(a
i
) and g(b
j
) be the respective
marginal distributions of X and Y . By the independence of X and Y , we have p(a
i
, b
j
) = f(a
i
)g(b
j
).
E[XY ] =

i1

j1
a
i
b
j
p(a
i
, b
j
)
=

i1

j1
a
i
b
j
f(a
i
)g(b
j
)
=
_
_

i1
a
i
f(a
i
)
_
_

_
_

j1
b
j
g(b
j
)
_
_
= E[X] E[Y ].
As before, the rearrangement of terms is allowed due to the absolute convergence of the series.
The r-th moment of a random variable X is dened to be the expectation of X
r
, if it exists, i.e., the
r-th moment is
E[X
r
] =

i1
a
r
i
f(a
i
)
provided the series converges absolutely.
Since |a|
r1
|a|
r
+ 1, if E[X
r
] exists, then so does E[X
r1
]. Let = E[X]. Then the r-th central
moment of X is dened to be E[(X )
r
].
Observe that (a )
2
2(a
2
+
2
) and so the second central moment of X exists whenever the second
moment of X exists.
E[(X )
2
] = E[X
2
2X +
2
]
= E[X
2
] 2E[X] +
2
= E[X
2
]
2
.
The second central moment of X is called its variance Var(X) and the positive square root of the variance
is called its standard deviation. The variance is a measure of spread of X. It is a sum over i of terms
(a
i
)
2
f(a
i
). Each of these terms are positive. So, if the variance is small, then each of these terms should
be small. But, then either (a
i
)
2
is small, implying |a
i
| is small or, if it is large, then f(a
i
) is small.
In other words, the probability that X takes values away from its mean is small.
For two random variables X and Y with expectations
x
and
y
, the co-variance Cov(X, Y ) is dened
to be the expectation of (X
x
)(Y
y
), whenever it exists. Note that |a
i
b
j
| (a
2
i
+b
2
j
)/2 and so E[XY ]
exists whenever E[X
2
] and E[Y
2
] exists. The following relation is easy to obtain using the linearity of
expectation.
E[(X
x
)(Y
y
)] = E[XY ]
x

y
.
As a result, Cov(X, Y ) equals 0 if X and Y are independent. The converse, however, is not true, i.e., it is
possible that Cov(X, Y ) = 0 but X and Y are not independent.
Theorem 3. If X
1
, X
2
, . . . , X
n
are random variables, then
Var(X
1
+X
2
+ +X
n
) =
n

i=1
Var(X
i
) +

1i<jn
2Cov(X
i
, X
j
).
Consequently, if X
1
, X
2
, . . . , X
n
are pairwise independent random variables, then
Var(X
1
+X
2
+ +X
n
) =
n

i=1
Var(X
i
).
Proof: Let
i
= E[X
i
].
Var(X
1
+X
2
+ +X
n
) = E[(X
1
+X
2
+ +X
n
(
1
+
2
+ +
n
))
2
]
= E[((X
1

1
) + (X
2

2
) + + (X
n

n
))
2
]
= E[
n

i=1
(X
i

i
)
2
+ 2

1i<jn
(X
i

i
)(X
j

j
)].
Using the linearity of expectation, we get the rst result. The second statement follows from the fact that
pairwise independence ensures that the covariances are zeros.
The collection (Pr[X = a
1
], Pr[X = a
2
], Pr[X = a
3
], . . .) is called the probability distribution of the
random variable X. Next we dene some useful distributions of random variables.
Bernoulli. Here X can take two values 0 and 1 (which is interpreted as failure F and success S) with
probabilities q and p respectively. Then p +q = 1. It is easy to verify that E[X] = p and Var(X) = pq.
Binomial. Let X
1
, . . . , X
n
be independent Bernoulli distributed random variables with probability of suc-
cess p (and probability of failure q) and dene X = X
1
+ X
2
+ + X
n
. Then X is said to follow the
binomial distribution. X can take values in the set {0, . . . , n} and
Pr[X = i] =
_
n
i
_
p
i
q
ni
.
Using linearity of expectation, it can be shown that E[X] = np and that Var(X) = npq.
Poisson. Let X be a random variable which can take any non-negative integer value and
Pr[X = i] = e

i
i!
.
Then X is said to follow the Poisson distribution with parameter . It is not too dicult to directly work
out that E[X] = Var(X) = .
Geometric. Let X be a random variable which can take and non-negative integer values and let p and q
be such that p +q = 1. Suppose Pr[X = i] = q
i
p. Then X is said to follow the geometric distribution. This
can be interpreted as an unlimited sequence of independent Bernoulli trials and X denotes the number of
failures before the rst success. It can be shown that E[X] = q/p.
Let A be an event and I
A
be a random variable taking values 0 and 1 such that Pr[I
A
= 1] = Pr[A].
Then I
A
is called an indicator random variable (of the event A). Clearly, I
A
= 1 I
A
and for two events
A and B, I
AB
= I
A
I
B
.
I
AB
= 1 I
AB
= 1 I
AB
= 1 I
A
I
B
= 1 (1 I
A
)(1 I
B
)
= I
A
+I
B
I
A
I
B
= I
A
+I
B
I
AB
.
Since an indicator variable can take only the two values 0 and 1, its expectation is equal to the probability
that it takes the value 1.
Pr[A B] = Pr[I
AB
] = E[I
AB
]
= E[I
A
+I
B
I
AB
]
= E[I
A
] +E[I
B
] E[I
AB
]
= Pr[A] + Pr[B] Pr[AB].
This technique extends to more than two variables. If A
1
, . . . , A
n
are events, then
I
A
1
An
= 1 I
A
1
An
= 1 I
A
1
An
= 1 I
A
1
I
An
= 1 (1 I
A
1
)(1 I
A
2
) (1 I
An
).
Now multiply out; use I
A
i
1
A
ir
= I
A
i
1
I
A
ir
, take expectations and use P[A] = E[I
A
] to obtain the
principle of inclusion and exclusion.
3 Conditional Expectation
Let X and Y be random variables such that E[X] is nite. Then
E[X|Y = y] =

xPr[X = x|Y = y]

= (y).
In other words, the quantity E[X|Y = y] is a function (y) of y. The conditional expectation of X given
Y is dened to be (Y ) and is written as (Y )

= E[X|Y ]. So, the conditional expectation of X given Y is
a random variable which is function of the random variable Y .
Proposition 1. E[E[X|Y ]] = E[X].
Proof:
E[E[X|Y ]] = E[(Y )] =

y
(y)Pr[Y = y]
=

x
xPr[X = x|Y = y]Pr[Y = y]
=

x
x

y
Pr[X = x|Y = y]Pr[Y = y]
=

x
x

y
Pr[X = x, Y = y]
=

x
xPr[X = x]
= E[X].

Proposition 2. If X has nite expectation and if g is a function such that Xg(Y ) also has nite expec-
tation, then E[Xg(Y )|Y ] = E[X|Y ]g(Y ).
Proof: Let
1
(Y ) = E[Xg(Y )|Y ] and
2
(Y ) = E[X|Y ]g(Y ). We have to show that for each y,
1
(y) =

2
(y).

1
(y) = E[Xg(Y )|Y = y]
=

x
xg(y)Pr[X = x|Y = y]
= g(y)

x
xPr[X = x|Y = y]
= g(y)E[X|Y = y]
=
2
(y).
If Y is a constant, i.e., Pr[Y = a] = 1 for some a, then E[X|Y = a] =

x
xPr[X = x|Y = a] =

x
xPr[X = x] = E[X] and so, in this case, E[X|Y ] = E[X].
Proposition 3. E[(Xg(Y ))
2
] E[(XE[X|Y ])
2
] for any pair of random variables X and Y such that
X
2
and g(Y )
2
havve nite expectations.
Proof: Let (Y ) = E[X|Y ].
(X g(Y ))
2
= (X (Y ) +(Y ) g(Y ))
2
= (X (Y ))
2
+ ((Y ) g(Y ))
2
+2(X (Y ))((Y ) g(Y )).
Now,
E[(X (Y ))((Y ) g(Y ))] =

x,y
(x (y))((y) g(y))Pr[X = x, Y = y]
=

y
((y) g(y))

x
(x (y))Pr[X = x, Y = y]
=

y
((y) g(y))Pr[Y = y]

x
(x (y))Pr[X = x|Y = y].
The last term can be simplied as follows.

x
(x (y))Pr[X = x|Y = y] =

x
xPr[X = x|Y = y] (y)

x
Pr[X = x|Y = y]
= (y) (y)
= 0.
So,
E[(X g(Y ))
2
] = E[(X (Y ))
2
] +E[((Y ) g(Y ))
2
]
E[(X (Y ))
2
].

If Y is a constant such that g(Y ) = b, then E[(X b)
2
] E[(X E[X])
2
] = Var(X). This gives us the
result that the mean squared error is minimum about the expectation.
An interpretation of the above result is the following. Suppose we observe the random variable Y . From
this observation, we would like to form an opinion of the random variable X. This opinion is in the form
of a prediction function of Y . The above proposition tells us that the best predictor for X from Y is the
conditional expectation of X given Y . In fact, this can itself be used to obtain a denition of the conditional
expectation and proves to be useful in other parts of the theory.
Proposition 4. For any function g, such that g(X) has nite expectation,
E[g(X)|Y = y] =

x
g(x)Pr[X = x|Y = y].
Proof: Let Z = g(X). Then
E[Z|Y = y] =

z
zPr[Z = z|Y = y]
=

z
z

x:g(x)=z
Pr[X = x|Y = y]
=

x
g(x)Pr[X = x|Y = y].

Proposition 5. |E[X|Y ]| E[|X||Y ].
Proof: We will use g(x) = |x| in the previous proposition.
|E[X|Y = y]| =

x
xPr[X = x|Y = y]

x
|x|Pr[X = x|Y = y]
= E[|X||Y y].

Proposition 6. E[E[X|Y, Z]|Y ] = E[X|Y ].
Proof: Let (Y, Z) = E[X|Y, Z]. Then
E[(Y, Z)|Y = y] =

z
(y, z)Pr[Z = z|Y = y]
=

z
_

x
xPr[X = x|Y = y, Z = z]
_
Pr[Z = z|Y = y]
=

x
x
Pr[X = x, Y = y, Z = z]
Pr[Y = y, Z = z]

Pr[Y = y, Z = z]
Pr[Y = y]
=

x
xPr[X = x, Z = z|Y = y]
=

x
x

z
Pr[X = x, Z = z|Y = y]
=

x
xPr[X = x|Y = y]
= E[X|Y = y].
From this the result follows.
Let Z = X
n
and Y = (X
1
, . . . , X
n1
). Then
E[E[X|X
1
, . . . , X
n
]|X
1
, . . . , X
n1
] = E[X|X
1
, . . . , X
n1
].
Proposition 7. E[E[g(X, Y )|Z, W]|Z] = E[g(X, Y )|Z].
Proof: Let U = g(X, Y ) and apply the previous result.
More generally, if U = g(X
1
, . . . , X
m
), then
E[E[U|Y
1
, . . . , Y
n
]|Y
1
, . . . , Y
n1
] = E[U|Y
1
, . . . , Y
n1
].
Next, we consider a situation, where conditional expectation plays a major role. Suppose a fair coin is
tossed a Poisson number of times. Then we would like to nd the conditional expectation of the time of
occurrence of the rst head, given the total number of heads.
Let N Poisson(). Suppose a coin is tossed N times and let X be the number of heads and T be the
time of occurrence of the rst head. (In case there are no heads, T is dened to be the number of tosses
plus one, i.e., T = N + 1, if X = 0.) If N = 0, then X = 0 and so T = 1. We wish to nd E[T|X = x] for
each x 0.
The plan is to rst compute E[T|N, X] and then to compute E[T|X] as E[T|X] = E[E[T|N, X]|X].
For 0 x n, let f(n, x) = E[T|N = n, X = x]. Then f(n, x) is the expected waiting time for the rst
head given that the total number of tosses is n and the number of heads is x. A recurrence for f(n, x) is
obtained as follows.
For 1 x n, f(n, x) = +, where
1. = E[T|x heads in n tosses, rst is head] Pr[rst head|N = n, X = x].
2. = E[T|x heads in n tosses, rst is tail] Pr[rst tail|N = n, X = x].
We have
=
_
n1
x1
_
_
n
x
_ =
x
n
and
= (1 +f(n 1, x))
_
n1
x
_
_
n
x
_
=
(n x)
n
(1 +f(n 1, x)) .
This gives
f(n, x) = 1 +
n x
n
f(n 1, x).
Since f(x, x) = 1, we obtain by induction on n that for n x, f(n, x) = (n +1)/(x +1). So, E[T|X, N] =
(N + 1)/(X + 1). We need to compute the conditional expectation of this given X = x and for that we
need the conditional distribution of N given X = x.
Pr[N = n, X = x] = e

n
n!
_
n
x
_
1
2
n
.
Pr[X = x] =

nx
e

n
n!
_
n
x
_
1
2
n
=
e

x!

m0
_

2
_
m+x
1
m!
=
e

x!
_

2
_
x

m0
_

2
_
m
1
m!
=
e
/2
x!

_

2
_
x
.
Now Pr[N = n|X = x] = Pr[N = n, X = x]/Pr[X = x] and simplication gives
Pr[N = n|X = x] = e
/2
_

2
_
nx
1
(n x)!
.
So for x 1,
E[T|X = x] = E
_
N + 1
X + 1
|X = x
_
=
x +/2 + 1
x + 1
= 1 +

2(x + 1)
.
So, E[T|X] = 1 +/(2(X + 1)).
Given X = 0, T = 1 +N and E[N] = , but, E[T|X = 0] = 1 +/2.
4 Martingales
A sequence of random variables S
1
, S
2
, . . . is a martingale with respect to another sequence of random
variables X
1
, X
2
, . . . if for all n 1 the following two conditions hold.
1. E[|S
n
|] < .
2. E[S
n+1
|X
1
, . . . , X
n
] = S
n
.
If S
n
= X
n
for n 1, then the sequence is a martingale with respect to itself.
E[S
n+1
|X
1
, . . . , X
n
] is a function (X
1
, . . . , X
n
) and so the relation E[S
n+1
|X
1
, . . . , X
n
] = S
n
is mean-
ingless unless S
n
itself is a function of X
1
, . . . , X
n
. A specied sequence itself may not be a martingale.
But, it is often possible to nd such that S
n
= (X
n
) is a martingale.
The Martingale. The following gambling strategy is called a martingale. A gambler lays Rs. 1 bet on
the rst game. Everytime he loses, he doubles his earlier stake. If he wins on the T-th bet, then he leaves
with a prot of 2
T
(1 + 2 + + 2
T1
) = 1.
Let Y
n
be the accumulated gain after the n-th play (losses are negative). We have Y
0
= 0 and |Y
n
|
1 + 2 + + 2
n1
= 2
n
1. Also, Y
n+1
= Y
n
if the gambler has stopped by time n + 1. Otherwise,
Y
n+1
= Y
n
2
n
with probability 1/2; or Y
n+1
= Y
n
+ 2
n
with probability 1/2. So E[Y
n+1
|Y
1
, . . . , Y
n
] = Y
n
which shows that {Y
n
} is a martingale with respect to itself.
Example 1. Let X
1
, X
2
, . . . be a sequence of integer valued random variables and S
0
= 0, S
1
= S
0
+X
1
,
. . ., S
n
= X
n
+S
n1
be the partial sums.
E[S
n+1
|S
1
= s
1
, . . . , S
n
= s
n
] =

s
n+1
s
n+1
Pr[S
n+1
= s
n+1
|S
1
= s
1
, . . . , S
n
= s
n
]
=

x
n+1
(s
n
+x
n+1
)Pr[S
n
= s
n
, X
n+1
= x
n+1
|S
1
= s
1
, . . . , S
n
= s
n
]
=

x
n+1
(s
n
+x
n+1
)Pr[X
n+1
= x
n+1
|S
1
= s
1
, . . . , S
n
= s
n
]
= s
n
+

x
n+1
x
n+1
Pr[X
n+1
= x
n+1
|S
1
= s
1
, . . . , S
n
= s
n
].
So, E[S
n+1
|S
1
, . . . , S
n
] = S
n
+E[X
n+1
|S
1
, . . . , S
n
].
1. If E[X
n+1
|S
1
, . . . , S
n
] = 0, then {S
n
} is a martingale.
2. If X
n
=
n
Y
n
, where
n
is a random variable taking values 1 with probability 1/2 each and
n
is
independent of all other random variables, then
E[X
n+1
|S
1
= s
1
, . . . , S
n
= s
n
] =

x
n+1
x
n+1
Pr[X
n+1
= x
n+1
|S
1
= s
1
, . . . , S
n
= s
n
]
=

y
n+1
_
y
n+1
2
Pr[Y
n+1
= y
n+1
|S
1
= s
1
, . . . , S
n
= s
n
]

y
n+1
2
Pr[Y
n+1
= y
n+1
|S
1
= s
1
, . . . , S
n
= s
n
]
_
= 0.
This captures the idea of a gamblers gain in a fair game.
Example 2. Let X
1
, X
2
, . . . be independent random variables with zero means and let S
n
= X
1
+ +X
n
.
E[S
n+1
|X
1
, . . . , X
n
] = E[S
n
+X
n+1
|X
1
, . . . , X
n
]
= E[S
n
|X
1
, . . . , X
n
] +E[X
n+1
|X
1
, . . . , X
n
]
= S
n
+ 0.
The last equality follows from a simple calculation.
Example 3. Let X
0
, X
1
, . . . be a discrete time Markov chain with transition matrix P = [p
i,j
] and the
number of states in the state space is countable. Suppose : S IR is a bounded function which satises
the following. For all i S,

jS
p
i,j
(j) = (i).
Let S
n
= (X
n
). Then
E[S
n+1
|X
1
, . . . , X
n
] = E[(X
n+1
)|X
1
, . . . , X
n
]
= E[(X
n+1
)|X
n
]
=

jS
p
Xn,j
(j)
= (X
n
)
= S
n
.
Example 4. Let X
1
, X
2
, . . . be independent variables with zero means and nite variances. Let S
n
=

n
i=1
X
i
. Dene T
n
= S
2
n
.
E[T
n+1
|X
1
, . . . , X
n
] = E[S
2
n
+ 2S
n
X
n+1
+X
2
n+1
|X
1
, . . . , X
n
]
= T
n
+ 2E[X
n+1
]E[S
n
|X
1
, . . . , X
n
] +E[X
2
n+1
]
= T
n
+E[X
2
n+1
] T
n
.
So {T
n
} is not a martingale. It is a sub-martingale. If the inequality had been , then it would have
been a super-martingale.
Example 5. Suppose Y
n
= S
2
n


n
i=1

2
i
, where
i
is the standard deviation of X
i
. Then {Y
n
} is a
martingale with respect to {X
n
}. If each X
i
takes the values 1, then Y
n
= S
2
n
n is a martingale.
Example 6. Polyas urn scheme. A urn contains b black and r red balls. A ball is drawn at random, its
colour noted, it is replaced and along with it an additional ball of the same colour is also put into the urn.
This process is repeated. Note that, at each stage, the number of balls increases by one and after n trials,
the urn will have b +r +n balls.
Let X
n
be the proportion of red balls after n trials with X
0
= r/(b + r). A computation shows that
E[X
n
|X
0
= x
0
, . . . , X
n1
= x
n1
] = x
n1
and so E[X
n
|X
0
, . . . , X
n1
] = X
n1
.
Example 7. Let {Y
n
} be an independent identically distributed sequence taking values 1 with prob-
ability 1/2 each. Set S
n
=

n
i=1
Y
i
. For any (0, 1), the sequence {X
n
} dened as X
0
= 1 and
X
n
= 2
n

(n+Sn)/2
(1 )
(nSn)/2
denes a martingale.
E[X
n
|X
0
, . . . , X
n1
] = 2
n
E[
(n+Sn)/2
(1 )
(nSn)/2
|X
0
, . . . , X
n1
]
= 2
n

n/2
(1 )
n/2
E[
(S
n1
+Xn)/2
(1 )
(S
n1
Xn)/2
|X
0
, . . . , X
n1
].
E[
(S
n1
+Xn)/2
(1 )
(S
n1
Xn)/2
|X
0
, . . . , X
n1
] = E[
Xn/2
(1 )
Xn/2
]
E[
S
n1
/2
(1 )
S
n1
/2
|X
0
, . . . , X
n1
]
= E[
Xn/2
(1 )
Xn/2
]E[
S
n1
/2
(1 )
S
n1
/2
]
=
Note that
E[(S
n1
)|X
1
= x
1
, . . . , X
n1
= x
n1
] =

s
n1
(s
n1
)P[S
n1
= s
n1
|X
1
= x
1
, . . . , X
n1
= x
n1
]
= (x
1
+ +x
n1
)
and so E[(S
n1
)|X
1
, . . . , X
n1
] = (S
n1
).
Doob Martingale. Let X
0
, X
1
, . . . , X
n
be a sequence of random variables and Y is a random variable
with E[|Y |] < . Let Z
i
= E[Y |X
0
, . . . , X
i
] for i = 0, . . . , n. Then
E[Z
i+1
|X
0
, . . . , X
i
] = E[E[Y |X
0
, . . . , X
i+1
]|X
0
, . . . , X
i
]
= E[Y |X
0
, . . . , X
i
]
= Z
i
.
So, {Z
n
} is a martingale with respect to {X
n
}. In most applications, we start the Doob martingale with
Z
0
= E[Y ], which corresponds to Z
0
being a trivial random variable which is independent of Y .
The interpretation of the Doob martingale is the following. We want to estimate Y which is a function
of X
1
, . . . , X
n
. The Z
i
s are rened estimates giving more information gradually. If Y is fully determined
by X
1
, . . . , X
n
, then Z
n
= Y .
4.1 A Branching Process Example
Let X be a random variable taking non-negative values and assume that Pr[X = 0] > 0. The probability
generating function for X is dened to be
f() = E[
X
] =

k0

k
Pr[X = k].
Taking derivatives
f

() = E[X
X1
] =

k
k1
Pr[X = k]
and = E[X] = f

(1) =

kPr[X = k] .
Suppose X
(m)
r
be a doubly innite sequence of independent random variables each of which is distributed
according to the distribution of X. The idea is that X
(n+1)
r
represents the number of children (who will be
in the n-th generation) of the r-th animal (if there is one) in the n-th generation. Let Z
0
= 1 and
Z
n+1
= X
(n+1)
1
+ +X
(n+1)
Zn
.
Then Z
n
is the size of the n-th generation. Let f
n
() = E[
Zn
] be the probability generating function for
Z
n
.
Proposition 8. f
n+1
() = f
n
(f()).
Consequently, f
n
is the n-fold composition of f.
Proof: Let U =
Z
n+1
and V = Z
n
. We compute E[U] as E[E[U|V ]], i.e., we use the basic tower property
of conditional expectation.
E[
Z
n+1
|Z
n
= k] = E[
X
(n+1)
1
++X
(n+1)
k
|Z
n
= k].
But, Z
n
is independent of the variables X
(n+1)
1
, . . . , X
(n+1)
k
and so the conditional expectation is equal to
the absolute expectation.
E[
Z
n+1
|Z
n
= k] = E[
X
(n+1)
1

X
(n+1)
k
]
= E[
X
(n+1)
1
] E[
X
(n+1)
k
]
= f()
k
.
The last but one equality follows due to the multiplicative property of expectation for independent variables
and the last equality follows from the fact that all the Xs have the same distribution as that of X. This
shows that E[
Z
n+1
|Z
n
] = f()
Zn
. Now,
E[
Z
n+1
] = E[E[
Z
n+1
|Z
n
]] = E[f()
Zn
].
By denition, E[
Zn
] = f
n
() and so, E[f()
Zn
] = f
n
(f()).
Let
n
= Pr[Z
n
= 0], i.e.,
n
is the probability that the population vanishes at the n-th stage. Then

n
= f
n
(0) and so

n+1
= f
n+1
(0) = f(f
n
(0)) = f(
n
).
Let be the limit of
n
as n goes to innity. Then
= f().
Theorem 4. If E[X] > 1, then the extinction probability is the unique root of the equation = f()
which lies strictly between 0 and 1. If E[X] 1, then = 1.
Recall that Z
n+1
= X
(n+1)
1
+ +X
(n+1)
Zn
where X
(n+1)
i
are independent of the values Z
1
, Z
2
, . . . , Z
n
. Then,
it is clear that
Pr[Z
n+1
= j|Z
0
= i
0
, . . . , Z
n
= i
n
] = Pr[Z
n+1
= j|Z
n
= i
n
].
This shows that the sequence {Z
n
} forms a Markov chain.
E[Z
n+1
|Z
0
= i
0
, . . . , Z
n
= i
n
] =

j
jPr[Z
n+1
= j|Z
n
= i
n
] = E[Z
n+1
|Z
n
= i
n
].
This in turn says that E[Z
n+1
|Z
0
, . . . , Z
n
] = E[Z
n+1
|Z
n
].
Since, each of the animals in the n-th generation gives rise to children it is intuitively obvious that
E[Z
n+1
|Z
n
] = Z
n
. This is conrmed by dierentiating both sides of E[
Z
n+1
|Z
n
] = f()
Zn
with respect
to and setting = 1.
Dene M
n
= Z
n
/
n
, n 0. Then
E[M
n+1
|Z
0
, Z
1
, . . . , Z
n
] = M
n
.
So, {M
n
} is a martingale with respect to the sequence {Z
n
}. In other words, this says that given the history
of Z up to stage n, the next value M
n+1
of M is on average what it is now. The notion of constant on
average conveys much more information than the correct but, less informative statement E[M
n
] = 1.
5 Stopping Times
Suppose that Z
1
, . . . , Z
n
is a (nite) martingale with respect to X
0
, X
1
, . . . , X
n
. Assume that the Z
i
s are
the winnings of a gambler in a fair game and that the gambler had decided (before the start of the game)
to quit after n games. The following result tells us that the expected win of a gambler does not change.
Lemma 2. If the sequence Z
1
, . . . , Z
n
is a martingale with respect to X
0
, X
1
, . . . , X
n
, then E[Z
n
] = E[Z
0
].
Proof: From the martingale property, we have Z
i
= E[Z
i+1
|X
0
, . . . , X
i
]. Taking expectation on both sides,
we get
E[Z
i
] = E[E[Z
i+1
|X
0
, . . . , X
i
]] = E[Z
i+1
].
Repeating this argument gives the result.
Suppose now that the number of games that the gambler decides to play is not xed at the beginning.
Instead, the gambler decides to quit mid-way depending on the outcomes of the games played so far. Then
the time that the gambler chooses to quit is called the stopping time.
Formally, a non-negative, integer-valued random variable T is a stopping time for the sequence {Z
n
} if
the event T = n depends only on the value of the random variables Z
0
, . . . , Z
n
. If Z
1
, Z
2
, . . . are independent,
then T is a stopping time if the event T = n is independent of Z
n+1
, Z
n+2
, . . .
Examples of stopping time would be if the gambler decides to stop after winning three times in succession,
or after winning a certain amount of money, etcetera. On the other hand, let T be the last time that the
gambler wins ve times in a row. Then T would not be a stopping time, since the last time cannot be
determined without reference to the future.
Theorem 5. If Z
0
, Z
1
, . . . is a martingale with respect to X
1
, X
2
, . . . and if T is a stopping time for
X
1
, X
2
, . . ., then
E[Z
T
] = E[Z
0
]
whenever one of the following conditions hold.
1. The Z
i
are bounded, i.e., |Z
i
| c for some constant c and for all i 0.
2. T is bounded.
3. E[T] < and there is a constant c such that E[|Z
i+1
Z
i
||X
1
, . . . , X
i
] < c.
Consider a sequence of independent, fair games. In each round, a player wins a dollar with probability
1/2 or loses a dollar with probability 1/2. Let Z
0
= 0 and let X
i
be the amount won on the i-th game.
Also, let Z
i
be the amount of winnings after i games. Suppose that the player quits the game when she
either loses l
1
dollars or wins l
2
dollars. What is the probability that the player wins l
2
dollars before losing
l
1
dollars?
Let T be the rst time the player has either won l
2
or lost l
1
. Then T is a stopping time for X
1
, X
2
, . . ..
The sequence Z
0
, Z
1
, . . . is a martingale and since the values of the Z
i
are clearly bounded, we can apply
the martingale stopping theorem. So, E[Z
T
] = 0. Let q be the probability that the gambler quits playing
after winning l
2
dollars. Then
E[Z
T
] = l
2
q l
1
(1 q) = 0.
This shows that q = l
1
/(l
1
+l
2
). So, the probability q is obtained using the martingale stopping theorem.
Theorem 6 (Walds equation). Let X
1
, X
2
, . . . be nonnegative, independent, identically distributed ran-
dom variables with distribution X. Let T be a stopping time for this sequence. If T and X have bounded
expectations, then
E
_
T

i=1
X
i
_
= E[T] E[X].
There are dierent proofs of the equality that do not require the random variables X
1
, X
2
, . . . to be
nonnegative.
Proof: For i 1, let Z
i
=

i
j=1
(X
i
E[X]). The sequence Z
1
, Z
2
, . . . is a martingale with respect to
X
1
, X
2
, . . . and E[Z
1
] = 0. It is given that T has bounded expectation and
E[|Z
i+1
Z
i
||X
1
, . . . , X
i
] = E[|X
i+1
E[X]|] 2E[X].
So, applying the martingale stopping theorem, we get E[Z
T
] = E[Z
1
] = 0 and so
0 = E[Z
T
] = E
_
_
T

j=1
(X
j
E[X])
_
_
= E
_
_
_
_
T

j=1
X
j
_
_
TE[X])
_
_
= E
_
_
T

j=1
X
j
_
_
E[T]E[X]).

Consider a gambling game, in which a player rst rolls a die. If the outcome is X, then X more dice are
rolled and the gain Z is the sum of the outcomes of the X dice. What is the expected gain of the gambler?
For 1 i X, let Y
i
be the outcome of the i-th die. Then E[Z] = E[

X
i=1
Y
i
]. By denition, X is a
stopping time for the sequence Y
1
, Y
2
, . . . and so by Walds equation,
E[Z] = E[X]E[Y
i
] = (7/2)
2
= 49/4.
6 Martingale Tail Inequalities
Theorem 7 (Azuma-Hoeding Inequality). Let X
0
, . . . , X
n
be a martingale such that |X
k
X
k1
|
c
k
. Then, for all t 0 and any > 0,
Pr[|X
t
X
0
| ] 2 exp
_

2
/2
_
/
_
n

k=1
c
2
k
__
.
Proof: Dene the so-called martingale dierence sequence Y
i
= X
i
X
i1
, i = 1, . . . , t. Note that |Y
i
| c
i
and since X
0
, X
1
, . . . is a martingale,
E[Y
i
|X
0
, X
1
, . . . , X
i1
] = E[X
i
X
i1
|X
0
, X
1
, . . . , X
i1
]
= E[X
i
|X
0
, X
1
, . . . , X
i1
] X
i1
= 0.
Consider E[e
Y
i
|X
0
, X
1
, . . . , X
i1
]. Write
Y
i
= c
i
1 Y
i
/c
i
2
+c
i
1 +Y
i
/c
i
2
.
From the convexity of e
Y
i
, it follows that
e
Y
i

1 y
i
/c
i
2
e
c
i
+c
i
1 +Y
i
/c
i
2
e
c
i
=
e
c
i
+e
c
i
2
+
Y
i
2c
i
_
e
c
i
e
c
i
_
.
Since E[Y
i
|X
0
, X
1
, . . . , X
i1
] = 0, we have
E
_
e
Y
i
|X
0
, X
1
, . . . , X
i1
_
E
_
e
c
i
+e
c
i
2
+
Y
i
2c
i
_
e
c
i
e
c
i
_
|X
0
, X
1
, . . . , X
i1
_
=
e
c
i
+e
c
i
2
e
(c
i
)
2
/2
.
The last inequality follows using the Taylor series expansion of e
x
.
Using
E
_
e
(XtX
0
)
_
= E
_
e
(X
t1
X
0
)
e
(XtX
t1
)
_
we can write
E
_
e
(XtX
0
)
|X
0
, . . . , X
t1
_
= E
_
e
(X
t1
X
0
)
e
(XtX
t1
)
|X
0
, . . . , X
t1
_
= e
(X
t1
X
0
)
E
_
e
(XtX
t1
)
|X
0
, . . . , X
t1
_
= e
(X
t1
X
0
)
E
_
e
Yt
|X
0
, . . . , X
t1
_
e
(X
t1
X
0
)
e
c
2
t
/2
.
Taking expectations and iterating, we obtain the following.
E[e
(XtX
0
)
] = E[E[e
(XtX
0
)
]|X
0
, . . . , X
t1
]
E[e
(X
t1
X
0
)
]e
c
2
t
/2

exp
_

2
2
t

i=1
c
2
i
_
.
Hence,
Pr[X
t
X
0
] = Pr[e
(XtX
0
)
e

E
_
e
(XtX
0
)
_
e

exp
_

2
t

k=1
c
2
i
2

_
exp
_

2
2

t
k=1
c
2
k
_
.
The last inequality comes from choosing = /(

t
k=1
c
2
k
). A similar argument gives the bound for Pr[X
t

X
0
] and can be seen by replacing X
i
with X
i
everywhere.
Corollary 1. Let X
0
, X
1
, . . . be a martingale such that for all k 1, X
k
X
k1
c. Then for all t 1
and > 0,
Pr
_
|X
t
X
0
| c

t
_
2e

2
/2
.
A more general form of the Azuma-Hoeding inequality is given below and yields slightly tighter bounds
in applications.
Theorem 8. Let X
0
, X
1
, . . . , X
n
be a martingale such that
B
k
X
k
X
k1
B
k
+d
k
for some constants d
k
and for some random variables B
k
that may be functions of X
0
, X
1
, . . . , X
k1
. Then,
for all t 0 and any > 0,
Pr[|X
t
X
0
| ] 2 exp
_
2
2

t
k=1
d
2
k
_
.
A general formalization. A function f(x
1
, . . . , x
n
) satises the Lipschitz condition if for any x
1
, . . . , x
n
and y
n
,
|f(x
1
, . . . , x
i1
, x
i
, x
i+1
, . . . , x
n
) f(x
1
, . . . , x
i1
, y
i
, x
i+1
, . . . , x
n
)| c.
In other words, changing any component in the input changes the output by at most c.
Let Z
0
= E[f(X
1
, . . . , X
n
)] and Z
k
= E[f(X
1
, . . . , X
n
)|X
1
, . . . , X
k
]. Then {Z
n
} is a Doob martingale. If
the X
k
are independent random variables, then it can be shown that there exist random variables B
k
such
that
B
k
Z
k
Z
k1
B
k
+c.
(For a proof of this see Mitzenmacher and Upfal.) It is necessary that X
1
, X
2
, . . . are independent. If they
are not, then the relation may not hold. The advantage of this fact is that the gap between the lower and
upper bounds on Z
k
Z
k1
is at most c and so the Azuma-Hoeding inequality applies.
7 Applications of the Azuma-Hoeding Inequality.
Pattern matching. Let X = (X
1
, . . . , X
n
) be a sequence of characters chosen independently and
uniformly at random from an alphabet of size s. Let B = (b
1
, . . . , b
k
) be a xed string of length k.
Let F be the number of occurrences of B in X. Using linearity of expectation, it is easy to see that
E[F] = (n k + 1)(1/s)
k
.
Let Z
0
= E[F] and for 1 i n, let Z
i
= E[F|X
1
, . . . , X
i
]. The sequence Z
0
, . . . , Z
n
is a Doob
martingale and Z
n
= F. Since each character in the string can participate in no more than k possible
matches, it follows that |Z
i+1
Z
i
| k. In other words, the value of X
i+1
can aect the value of F by at
most k in either direction. So,
|E[F|X
1
, . . . , X
i+1
] E[F|X
1
, . . . , X
i
]| = |Z
i+1
Z
i
| k.
By the Azuma-Hoeding bound,
Pr[|F E[F]| ] 2e

2
/2nk
2
.
From the corollary,
Pr[|F E[F]| k

n] 2e

2
/2
.
Slightly better bounds can be obtained by using the more general framework. Let F = f(X
1
, . . . , X
n
).
Then changing the value of any input can change the value of F by at most k and so the function satises
the Lipschitz condition. The stronger version of the Azuma-Hoeding bound can now be applied to obtain
Pr[|F E[F]| ] 2e
2
2
/nk
2
.
This improves the value in the exponent by a factor of 4.
Balls and bins. Suppose m balls are thrown independently and uniformly at random into n bins. Let X
i
be the random variables representing the bin into which the i-th ball falls. Let F be the number of empty
bins after m balls are thrown. Then the sequence Z
i
= E[F|X
1
, . . . , X
i
] is a Doob martingale.
The claim is that the function F = f(X
1
, X
2
, . . . , X
n
) satises the Lipschitz condition with bound 1.
Consider how placing the i-th ball can change the value of F. If the i-th ball falls into an otherwise empty
bin, then changing the value of X
i
to a non-empty bin increases the value of F by one; similarly, if the
i-th ball falls into an non-empty bin, then changing the value of X
i
such that the i-th ball falls into an
empty bin decreases the value of F by one. In all other cases, changing X
i
leaves F
i
unchanged. So, using
the Azuma-Hoeding inequality, we obtain
Pr[|F E[F]|] 2

2
/m
.
Note that E[F] = n(1 1/n)
m
, but, it was possible to obtain a concentration bound for F without using
E[F]. In fact, in many cases, it is possible to obtain a concentration bound for a random variable without
knowing its expectation.

You might also like