You are on page 1of 17

NOTES ON A COURSE ON PROBABILITY THEORY:

LAWS OF LARGE NUMBERS


MATTEO MARSILI, ICTP, MARSILI@ICTP.IT

This material can be found in Feller Chapters VIII, IX and X and in [2] chapter VI. Here
I give a more concise discussion of these issues. You can go back to Feller and Gnedenko
for a more extended discussion.
Question: what are the limit laws of probability for events or random variables which
involve N events A1 , . . . , AN or random variables X1 , . . . , XN with N ! 1.
1. 0

1 laws

There are combinations of many events that occur almost surely (a.s.)1. 0 1 laws
concern events EN which depends on N events or random variables, and state that, under
some conditions, P (EN ) ! 0 or 1 as N ! 1. For example, one can show (using a theorem
by Kolmogorov)P
that if X1 , . . . , Xn . . . be a sequence of independent random variables, then
the event A = { i f (xi ) converges} has probability either zee or one. Here we only want
to focus on one result:
1.1. Borel-Cantelli lemmas. Let A1 , A2 , . . . be an infinite series of events. If
1
X
P (Aj ) < +1
j=1

then at most a finite number of events occur almost surely. This means the probability of
the event
E = \1
j=1 Akj
for any sub-sequence kj of integers is zero. In other words2
P (Aj i.o.) = 0
Proof: 8N > 0

E [1
j=N Aj

but then
P (E) P

[1
j=N Aj

1
X

j=N

P (Aj ) ! 0

1An event occurs almost surely (a.s.) if its probability is equal to one. Almost refer to the fact that the
complement of this event need not be the empty set.
2Events in an infinite sequence A , n = 1, 2, 3, . . . are said to occur infinitely often (i.o.) if A occurs
n
n
for an infinite subsequence of indices n.
1

MATTEO MARSILI, ICTP, MARSILI@ICTP.IT

as N ! 1.
note that no independence is needed.
Pn
Example: returns to the origin of a random walk. Let Sn =
i=1 Xi with P (Xi =
+1) = p = 1 P (Xi = 1) independent. Then

2n
1
P {An } = P {S2n = 0} =
[p(1 p)]n ' p e a(p)n , a(p) = log[4p(1 p)]
n
n

Then for all p 6= 1/2, a(p) > 0 and BC lemma applies, i.e. a biased random walker returns
to the origin only a finite number of time. In order to deal with the case p = 1/2 we need
a converse of BC.
1.2. The converse of Borel-Cantellis Lemma. If Ak are independent and
1
X
j=1

P (Aj ) = 1

then an infinite number of events occurs almost surely. This means there existe a subsequence kj of indices such that the event
E = \1
j=1 Akj
has P (E) = 1. In other words
P (Aj i.o.) = 1
Proof: for some n
c
E c = \1
k=n Ak
Aj does not occur i.o. if none of the Ak occurs for k > n. Hence
( 1
)
1
1
Y
Y
X
P (E c ) =
P (Ack ) =
[1 P (Ak )] < exp
P (Ak ) = 0
k=n

k=n

k=n

c.v.d.
(Hamlet)
Example: Ak = {SSS at 3k 2, 3k 1, 3k Bernoulli trials}. Or Ak = {Xk = Xk
}
(Hamlet)
th
where Xk
is the k bit in the binary representation of Shakespears Hamlet.
Random walks: Bernoulli trial with p = 1/2
Sn =

n
X
k=1

Xk , P {Xk = 1} =

Take An = {Sn = 0} what can we say?

p
P (An ) 1/ n

hence series diverges but An are not independent.


Take d dimensional random walk
n
X
(a)
Sn(a) =
Xk , a = 1, . . . , d
k=1

1
2

NOTES ON A COURSE ON PROBABILITY THEORY:


(a)

LAWS OF LARGE NUMBERS

(a)

(1)

(d)

with Xk i.i.d. with distribution P {Xk = 1} = 12 . Then (Sn , . . . , Sn ) is a point on a


d dimensional hypercubic lattice. Consider the return to the origin in n steps
Then

(1)
(d)
A(d)
n = {Sn = 0, . . . , Sn = 0}

d/2
P (A(d)
.
n )n
For d > 2 the random walk returns to the origin at most a finite number of times, i.e. it is
transient. For d 2 the random walk is recurrent, i.e. An occurs i.o., but thats harder to
show.
Refinement of BC lemma:
We can relax the independence assumption and replace it by the milder one
P
1i<jN P (Ai \ Aj )
P
lim
=2
n!1
[ 1iN P (Ai )]2

and the result still holds.

2. The Law of Large Numbers


The law of large numbers is a statement about the convergence of the sample mean of
a sequence X1 , . . . , Xn , . . . of independent random variables to a constant (the expected
value)
n
1X
Xi ! .
n
i=1

In physics this manifests for example in the fact that thermodynamic quantities such as
the free energy of disordered materials attain values which are sample independent. Indeed
the energy density can be considered to be a sum of many local terms, which may fluctuate
because e.g. of impurities. The quantities which satisfy laws of large numbers in physics
are called self-averaging.
Typically, the Law of Large Numbers is used when we want to estimate expected values
of random variables
X
(1)
E[f (X)] =
p! f [X(!)]
!2

and the way we do it is to take T samples !t , t = 1, . . . , T , that in the best of the possible
worlds can be thought of as independent draws from the distribution p! . Then we compute
the mean and argue that
(2)

T
1X
f [X(!t )] E[f (X)].
T
t=1

This is generically what goes under the name of Law of Large Numbers.
Now, at first sight, this cant be true! After all, the number of independent samples T
cannot be very large, whereas the sample space can consist of a very large number of
elements, and at worst it can even be an infinite or a continuous set!

MATTEO MARSILI, ICTP, MARSILI@ICTP.IT

There are simple cases where you can imagine || to be astronomically large. The
classical example in physics is a gas, where in principle ! should specify the position and
velocity of each and every particle. How can a handful of measures !1 , . . . , !T allow us to
compute expected values? So, if Eq. (2) is true, something quite peculiar must happen!
Indeed, the peculiar thing which happens is that particular combinations of sampled
random variables are not very random, their probability distribution concentrates on a
small neighbourhood of a single point.
2.1. Dierent types of limit. [From C. W. Gardiner, Handbook of stochastic methods.
Springer-Verlag, 1985; Chap. 2]
Before proceeding, its important to state clearly what we mean by limit.
Given a sequence of random variables Xn (!), there are dierent senses in which the
statement Xn (!) ! X(!) can be interpreted, because were talking about convergence of
functions.
Almost certain convergence: For fixed !, the statement Xn (!) ! X(!) reduces

to convergence of sequences, so it is well defined. If this happens for all ! 2

where P {} = 1, we say that Xn (!) ! X(!) a.s.


Convergence in mean square: If
h
i
2
lim E (Xn X) = 0
n!1

then we say that Xn (!) ! X(!) in mean square.


Convergence in probability: If 8 > 0
lim P {|Xn

n!1

X| > } = 0

then we say that Xn (!) ! X(!) in probability. By Chebysev inequality it is easy to


show that convergence in mean square implies convergence in probability. Almost
certain convergence implies convergence in probability.
Convergence in distribution: If for all continuous and bounded functions f (x)
lim E[f (Xn )] = E[f (X)]

n!1

then Xn (!) ! X(!) in distribution. Note that this is equivalent to


Z
lim dx[pn (x) p(x)]f (x) = 0,
8f (x)
n!1

i.e. that the distribution of Xn converges to that of X. Mean square convergence


and convergence in probability implies convergence in distribution.
2.2. The weak law of large numbers (WLLN). A sequence X1 , . . . , Xn , . . . of independent and identically distributed (i.i.d.) random variables with E[Xi ] = satisfies the
WLLN if, for all a > 0
(
)
n
1X
lim P
Xi
a =0
n!1
n
If the variance V [Xi ] = E[(x

)2 ] =

i=1
2

is finite, then the WLLN holds.

NOTES ON A COURSE ON PROBABILITY THEORY:

LAWS OF LARGE NUMBERS

Proof:
V [X] =

dxp(x)(x

dxp(x)(x
|x | a

)2

a2 P {|x

a}

then: (Chebyshev inequality)


P {|x
Now
V
then
P

"

a}

V [X]
a2

#
n
1X
V [Xi ]
Xi =
n
n
i=1

1X
Xi
n

i=1

V [Xi ]
!0
na2

as n ! 1. Of course the same proof shows that the WLLN holds whenever the variables
Xi are uncorrelated, or e.g. when the correlation E[(Xi )(Xj )] decays fast enough
with |i j|, as for example when i is a time index.
Actually Kintchine showed that the WLLN applies for i.i.d. variables as long as |E[Xi ]| <
+1 (see[2]).
2.3. The strong law of large numbers (SLLN). A sequence X1 , . . . , Xn , . . . of independent random variables with E[Xi ] = satisfies the SLLN if, for all , > 0 there is an
N (a, ) such that
(
)
n
1X
P
Xi < a, 8n > N (a, )
1
n
i=1

The SLLN states the almost certain convergence of the mean to the expected value, whereas
the WLLN states the convergence in probability. Naively, while the WLLN concerns the
limit of the probability of deviations of the sample mean from the average, the SLLN is
a statement on the probability of the limit of the deviations, i.e. it says that deviations
larger than a can be made as rare as one wishes by taking n large enough.
Sufficient conditions for the SLLN are (see [2])
Xi i.i.d. and E[Xi ] = < +1
E[Xi ] = i < +1 and V [Xi ] =

2
i

X
k

(Kolmogorov).

< +1 with
2
k
k2

< +1

MATTEO MARSILI, ICTP, MARSILI@ICTP.IT

Example: Convergence of the frequency k/n ! p in Bernoulli trials. The SLLN, in


this case, is equivalent to saying that for all > 0, the probability that the event

n
X
Sn
1 w. p. p
An =
p > , Sn =
Xi , Xi =
0 w. p. 1 p
n
i=1

occurs infinitely often, is zero.


We can prove something even stronger in a simple way. Take
(
)
p
Sn np
p
Bn =
> 2a log n ,
np(1 p)

Using De Moivre - Laplace approximation of the binomial distribution, we find that


r Z 1
r
2
2 a log n
x2 /2

P (Bn ) =
dxe
<
e
n a
p

2a log n
Then for a > 1

P (Bn ) < +1

and Borel-Cantelli lemma ensures us that the SLLN holds (note indeed that Bn An ).
This result gives quite precise bounds on the excursions of the frequency around the probability.
3. Shannons theorem and the Asymptotic Equipartition Property
Shannon theorem is one particularly important application of the law of large numbers.
Let C = {c1 , . . . , cL } be a message written in an alphabet of N symbols ci = 0, 1, . . . , N
1. Alice wants to send this message to Bob in the most efficient way. One simple way is to
compute
L
X
K(C) =
ci N i 1
i=1

and send the binary representation of this number. This makes L log2 N bits, i.e. log2 N
bits per character. Can Alice do better?
If Alice is sending a message in English (and we imagine Bob knows it) a sequence such
as . . . QBAT T X . . . will be very unlikely. Bob is not expecting such sequences. Likewise,
imagine that L
1 and that characters are drawn independently, but not all character
are equally likely, i.e.
pc = P {ci = c}, 8i
This distribution is known to both Alice and Bob. What is the optimal coding O(C), i.e.
the one using as few bits as possible?
Theorem (Shannon): If ci are i.i.d. with distribution pc , then there is a coding
function O(C) such that a.s. a message needs at most
X
H2 [p] =
pc log2 pc
c

NOTES ON A COURSE ON PROBABILITY THEORY:

LAWS OF LARGE NUMBERS

bits per character to be transmitted, for L ! 1.


Note:
if pc = c,k then S2 [p] = 0. Of course no message is needed. If instead pc = 1/N
then the theorem says that the coding above is optimal.
The knowledge of pc allows Alice and Bob to gain
I[p] = log2 N

H2 [p]

bits per character. Then one can say this is the information content of pc .
Proof (idea): For any message C, let
P (C) =

L
Y

pc i

i=1

Let O(C) be the rank of the messages according to their probability P (C). In other
words, O(C) is such that for any two messages C1 , C2 , O(C1 ) < O(C2 ) iif P (C1 ) > P (C2 ),
so small numbers O(C) correspond to highly probable messages [if P (C1 ) = P (C2 ) take
O(C1 ) < O(C2 ) if K(C2 ) < K(C2P
)].
For any function fc , let hf i = c pc fc and
L

f (C) =

1X
f ci
L
i=1

Then the Weak Law of Large Numbers (WLLN) implies that, for any > 0, P {|f (C)
hf i| > } ! 0 as L ! 1. Now take fc = log pc , then f (C) = L 1 log P (C) and hf i =
H[p]. The WLLN implies that, a.s.
e

L(H[p]+)

< P (C) < e

L(H[p] )

i.e. that all messages will typically have the same probability, P (C) ' e LH[p] . The number
of these messages is just the inverse of this probability, i.e. there are eLH[p] . This means
that the largest integer I need to encode such messages is of order O(C) eLH[p] . In order
to do this I need less than H2 [p] bits per character. It is easy to see that one cannot do
better.
In brief, what Shannon theorem states is that, for L large, the sequences which were
likely to observe are those with a probability p(C) eLH[p] and that their number is
eLH[p] . We shall call these the typical sequences. All non-typical sequences will occur with a
probability which is exponentially small (in L) with respect to typical ones. Whereas typical
sequences have the same probability. This is a property called Asymptotic Equipartition
Property.
A more detailed treatment of this issue is given in [1], chapter 3, which is mandatory
reading.
Exercise: A sport newspaper gives every friday the probabilities of the outcomes (1, X
or 2) of the 13 football matches in the italian league that are in the schedina, a popular

MATTEO MARSILI, ICTP, MARSILI@ICTP.IT

betting scheme in Italy. Take the examples where the probabilities of the three dierent
outcomes are 50%, 30% and 20% in any possible order. So the forecast may look like this:
1
50%
30%
20%
50%
..
.

X
30%
50%
50%
30%
..
.

2
20%
20%
30%
20%
..
.

30% 50% 20%


30% 50% 20%
50% 20% 30%
The simplest schedina consists of two forecasts (!1 , . . . , !13 ), with !i 2 {1, X, 2}.
How would you know whether a particular schedina is a typical one? Do you expect
people will play a typical schedina?
3.1. A Lottery. A further example to gain intuition on typical sequences is the following:
Imagine a lottery where if I invest 1 euro, I get 2 euros with probability 1/2 and 0 < q < 1/2
euros otherwise. If the invested capital is W0 , after the game I get, on average

2 w. p. 1/2
E[W1 ] = E[X1 ]W0 = (1 + q/2)W0 , X1 =
q w. p. 1/2
For any q > 0 this seems like a convenient game. Indeed, if the game is repeated n times,
with Xi , i = 1, . . . , n being i.i.d. random variables as X1 , and I invest W0 euros each time,
then the LLN ensures that I will get a positive gain, per bet equal to q/2 times W0 .
Now the gain is clearly proportional to W0 , so the more one invests the higher the gain.
In particular, the best thing to do is to invest all the capital accumulated, at each time.
In this way, the capital after n bets will be Wn = Xn X1 W0 and I expect my capital
to increase as
E[Wn ] = (1 + q/2)n W0
i.e. exponentially. Great!
However, it is easy to show that if 0 < q < 1/2 then, for all a > 0
P {Wn > a} ! 0 as n ! 1
Indeed
n

1
1X
1
log(Wn /W0 ) =
log Xi ! E[log Xi ] = log(2q)
n
n
2
i=1

almost surely, as n ! 1, by the SLLN. This means that, almost surely,


Wn ' W0 e

cn

1
, c = | log(2q)| > 0
2

NOTES ON A COURSE ON PROBABILITY THEORY:

LAWS OF LARGE NUMBERS

i.e. the capital will typically vanish exponentially. Here typical means with very high
probability, which tends to one as n ! 1. The discrepancy of this result with the behavior
of E[Wn ] becomes evident if one takes some care in evaluating the expected value
n
h
i
X
n
(3)
E[Wn ] =
2 n 2k (q/2)n k
k
k=0
Z 1

(4)
n
dxenf (x)
=
0

where in first we compute the expected value by averaging over the number k of lucky
outcomes (the term in [. . .] is the corresponding gain), and then we use Stirlings formula
for the binomial and change the sum on k into an integral on x = k/n, and the function
f (x) is
f (x) = x log x (1 x) log(1 x) + (1 x) log(q/2).
The integral can be performed with the saddle point method, which means that it is
dominated by the sequences of outcomes where the fraction x of lucky outcomes, which
is derived by imposing that f 0 (x ) = 0, is
x =

1
.
1 + q/2

Sequences of this type will occur only with exponentially small probability, but when
they occur they yield an exponentially large gain, which dominates the calculation of the
expected value. By the Asymptotic equipartition property, instead, were guaranteed to
see almost surely a typical sequence with a frequency x = 1/2, and this yields a gain
Wntypical (2q)n W0 , that vanishes as n ! 1, because 2q < 1.
Problem: The above is a bit frustrating, since a game which seems profitable leads us to
bankrupt. Maybe one should not invest all the capital. Consider the strategy of investing
a fraction 2 [0, 1] of the capital Wn each time. What is the best value of ?
For more on this issue, and for its relation to finance, see the discussion in the paper
http://arxiv.org/abs/0902.2965
4. Information
Mandatory reading: Chapter 2 and 3 [1], pages 279-287 of [9]
Suggested: the rest of Chapter 4.1 and 4.2 in [9].

Shannon theorem gives a precise connection between probability and information. It


states that the entropy of a distribution quantifies its information content. There is a more
direct way to see this.
Take a specific example: Alice (A) asks to Bob (B) a question. She anticipates that the
answer ! 2 can be one of n = || possible ones. One way to quantify the information
content of Bobs answer is to count the number of binary questions (yes/no) that A needs

10

MATTEO MARSILI, ICTP, MARSILI@ICTP.IT

to pose to B in order to know the answer !. This can be measured in bits. Take for
example the case = {a, b, c, d}. Then A may ask a first question
Q1: is ! 2 {a, b} or not?
and depending on the answer, A may ask
Q2: if ! 2 {a, b} is ! = a or not? Otherwise, if ! 62 {a, b} is ! = c or not?
The answers to these two questions reveal the correct outcome !. Hence the information
is H = 2 bits.
Now imagine that A knows that answers are not all equally likely, i.e. she knows the
prior probability p! of answers. Can she do better? Imagine that in the previous case
pa = 1/2, pb = 1/4 and pc = pd = 1/8. Then A can modify her questions as follows:
Q1: is ! = a or not?
only if ! 6= a A will need to pose a further question. Then she may ask:
Q2: is ! = b or not?
Only if the result is no, she will need to ask
Q3: is ! = c or not?
The number of questions A expects to be needed is
7
(5)
H = pa 1 + pb 2 + pc 3 + pd 3 =
4
which is less than 2. So the knowledge of p! carries some information on the answer, which
can be quantified (in 1/4 of a bit, which is what A gains).
Indeed this is the optimal way to elicit the answer, given p! , and it is also the optimal
way to encode it, i.e. it is the optimal representation. For imagine that A and B have
to decide on a code to transform the possible answers in bits, which are then send over a
binary channel. Then the simplest choice would be:
a ! 00; b ! 01; c ! 10; d ! 11
this will require 2 bits. But, if p! is as above and they know it, A and B can also decide
to use a dierent code:
a ! 0; b ! 10; c ! 110; d ! 111

In this case they expect that the number of bits which are needed will be 7/4 on average.
The fact that these two apparently dierent problems (eliciting and coding information)
are the same, is evident from the fact that As choice of the type of questions is in practice
deciding a binary coding (yes! 1, no! 0) of the possible answers.
Indeed, the optimal way to elicit the answer or equivalently the optimal way to encode
it, is given by the entropy
X
(6)
H[p] = Ep [log2 1/p] =
p! log2 p!
!

of the distribution. For a derivation of this result, see [9]. Note indeed, that in the example
above, we used bit strings of length exactly equal to log2 1/p! .

NOTES ON A COURSE ON PROBABILITY THEORY:

LAWS OF LARGE NUMBERS

11

The entropy quantifies how much Bs reply can be surprising for A. Indeed if both A
and B knows that p! = !,a , for example, then Bs reply is not surprising at all. Actually
A doesnt even need to ask, so no bit need to be exchanged. As (you will see) in statistical
mechanics, the entropy quantifies the uncertainty. In this case, it quantifies the uncertainty
of Alice about Bobs answer before she hears the answer. After the hears the answer, she
knows that one answer occurs with probability one and the others with probability zero,
i.e. H = 0. Then H measures how much Alice hes decreased her degree of uncertainty.
The entropy above is a functional of the distribution p! . It can also be defined for a
random variable X(!) taking discrete values x 2 X as
X
H[X] =
px log px
x2X

where px = P {X(!) = x}. Well use the latter definition, which reduces to the former for
X = !.
The entropy is minimal when X is a constant and is maximal when X is maximally
uncertain: px = 1/|X . Accordingly 0 H[X] log |X |.
The entropy can be generalized to any number of random variables X1 , . . . , Xn in a
straight forward fashion. Likewise, we can define conditional entropies H(X|Y ) as the
entropy of the conditional distribution p(x|y), averaged over y. The law of conditional
probability imply that H(X|Y ) = H(X, Y ) H(Y ) (check!).

4.1. Relative entropy. Imagine now that A has a wrong estimate q! of the probability
p! of Bs answers !. How much this impacts on the efficiency of the questions shes going
to ask?
Given q, A is going to eectively encode Bs answers in such a way that answer ! will
require log2 1/q! bits3, so the number of questions she needs to ask, on average, is
X
Ep [log2 1/q] =
p! log2 1/q!
!

the dierence between this and the most efficient representation, which requires H[p] bits,
is
X
p!
DKL (p||q) =
p! log2
q!
!

which is known as the Kullback-Leibler divergence. It tells us how much we pay in terms
of bits our error in the estimate of probabilities. In this sense, DKL tells us how far
we are from the truth, which is why it is often considered as a distance, though it is not
symmetric and it does not satisfy the triangle inequality.
Though it is not evident, a priori, DKL (p||q) 0 and it vanishes only for q = p. The
way to prove it (do it!), is to use the convexity of the logarithm log2 x x 1 in the
definition of DKL .
3This argument assumes that the length of messages can be continuous, whereas it is discrete! Chapter

5 of [1] shows how the derivation can be carried out more precisely, and how optimal data compression
that achieves the Shannon bound of H[p] bits per answer can be achieved in long messages. We keep our
discussion at a non-rigorous level for the moment, to illustrate the main concepts.

12

MATTEO MARSILI, ICTP, MARSILI@ICTP.IT

4.2. Mutual information. Imagine you have two random variables X 2 X and Y 2 Y
with joint distribution p(x, y) and marginals p(x) and p(y)4. One way to quantify their
mutual dependence is to compute how much information is lost by assuming that they are
independent. This is given by
(7)
(8)

I(X, Y ) = DKL [p(x, y)||p(x)p(y)]


X
p(x, y)
=
p(x, y) log2
p(x)p(y)
x2X ,y2Y

(9)

= H(X) + H(Y )

H(X, Y )

and it is called the mutual information between X and Y . The last equality, which follows
from simple algebra, with the positivity of DKL implies that H(X, Y ) H(X) H(Y ).
In order to illustrate the meaning of I consider the following problem. We are interested
in estimating a random variable X of which at present we know the distribution p(x), and
the corresponding entropy H(x) which quantifies our state of uncertainty about X. You
can think of X as a parameter of our theory of the world (or as the theory itself, if you
allow X to be multi-dimensional).
Now we have the possibility to perform an experiment, i.e. to measure a random variable
Y , of which we know, before doing the experiment, its distribution. We also know the joint
distribution p(x, y) of the two variables. How much information can I expect the experiment
will convey on X? The reduction in the uncertainty is given by
H(X)

H(X|Y ) = I(X, Y )

as can be shown by a direct calculation. So the mutual information tells us how much
we learn, on average, about X if we know Y . Note that I(X, Y ) = I(Y, X) = H(Y )
H(Y |X). This means that experiments cannot tell us more, about theories, than how much
information our theories contain about experiments.
Another important point is that knowledge of Y reduces a priori the uncertainty on H,
since H(X|Y ) H(X), but a posteriori this might not be the case! Take for example:
Y \X 1
2
1
0 3/4
2
1/8 1/8
Then H(X) = 0.544 and H(X|Y ) = 0.25 bits, but H(X|Y = 2) = 1 bit.
Exercise: Problem 77 of [9]. Let there be n + 1 boxes labeled ! = 0, 1, . . . , n. One of
the boxes contains a prize, the others are empty. The probability p0 that the prize is in
! = 0 is much larger than the probability that it is in any other box: p0
p! and we take
p!>0 = (1 p0 )/n for simplicity. Note that, if n is large, p0 can nonetheless be also small.
There are two possible options:
1) open the box ! = 0
4The abuse of notation of the symbol p() follows the notation of [1]. It should be understood that p(x)
and p(y) are in principle dierent functions of their arguments.

NOTES ON A COURSE ON PROBABILITY THEORY:

LAWS OF LARGE NUMBERS

13

2) open simultaneously all boxes ! > n/2 (imagine n is even)


Which one is the most convenient? Which one conveys most information on where the
prize actually is?
Draw a plot of the threshold p0 for which 1 and 2 are equivalent, according to the two
criteria.
This is a toy model for a situation where a phenomenon can be explained by alternative
theories, one of which is the prevailing one, whereas the others are very unlikely but are
many. The two options correspond to two possible experiments, one that tries to refute or
confirm the prevailing theory, the other that can exclude half of the unlikely ones.
Check that even if p0 = 0.99 it might be more informative to exclude unlikely theories
if n > 270.
Contrast this result with the problem of finding the experiment that is most likely to
find the box with the prize.
hint: Formulate the problem in terms of a random variable X(!) = ! that encodes the
state of the system and two random variables Y (!)and Z(!)so that the result of the two
experiments is equivalent to knowing the value of either Y or Z.
5. Entropy for continuous variables
(see Chapter 8 of [1]. In the following, log means log2 )
The generalization of the concept of entropy to continuous variables is problematic.
Indeed, imagine that Alice asks to Bob about the price X of his car. Even if she knows
the a priori pdf p(x) of X, how many binary questions does she need to pose? Yet, even
if the car retailer were really charging any price X 2 R, Alice may be happy to know X
to a pre-assigned precision . So imagine that Alice quantizes the random variable X
into the random variable X that takes values xi for all integer i = 0, 1, 2, . . . that are
defined as
Z (i+1)
(10)
p(xi ) =
dxp(x),
i

where we define p(xi ) = P {X = xi }. She can now give a precise estimate of the
information content of Bobs answer, which is H[X ]. Note that
X
(11)
H[X ] =
p(xi ) log [p(xi ) ]
i

(12)

XZ
i

(13)
(14)

(i+1)

dxp(x) log p(xi )

log

' h[X] log


Z
h[X] =
dxp(x) log p(x)

where the approximation gets more an more precise as ! 0. Here h[X] is called dierential entropy and its meaning is that h[X] log is the amount of information needed to
specify X to a precision . Notice that h[X] need not be positive. For example, a uniform

14

MATTEO MARSILI, ICTP, MARSILI@ICTP.IT

random variable X 2 [0, a] has h[X] = log a which is negative if a < 1. If a = 1/8 and you
want to determine X up to the nth binary digit (i.e.
= 2 n ), you will need n 3 bits,
because the first three bits will be zero anyhow.
One property of the entropy that we used, is that H[X] does not actually depends on
what values X takes. It only depends on the value of the probabilities px = P {X = x}.
In particular, if we do a bijective transformation X ! Y = f (X) i.e. such that to every
possible value of X there corresponds one and only one value of Y then H[X] = H[Y ].
This is not true for the dierential entropy, because if Y = F (x), even when f (x) is
monotonous and hence to every X there correspond one and only one Y the pdf
transforms as pY (y) = pX (x)/|f 0 (x)|x=f 1 (y) . Therefore
(15)

h[Y ] = h[X] + E[log |f 0 (X)|]

in other fords, the dierential entropy is not reparametrization invariant. A simple application of this is that, if a is a constant, then h[X + a] = h[X] and h[aX] = h[X] + log |a|.
Exercise: compute the dierential entropy for a Gaussian with mean and variance
2 , for an exponential distribution p(x) = ae ax , a, x > 0, and for a multi-dimensional
Gaussian with mean
~ and covariance Cov[Xi , Xj ] = Ai,j .
The Kullback-Leibler divergence (or relative entropy) generalizes to continuous variables
as
Z
p(x)
(16)
DKL [p||q] = dxp(x) log
q(x)
Contrary to the dierential entropy, the dierential entropy is reparametrization invariant. If p and q represent two possible distributions for the random variable X, their
divergence remains the same if one changes parametrization Y = f (X). As for discrete
variables, it is easy to see that DKL [p||q]
0 with equality holding only if p = q (apart
from sets of measure zero).
In the same way, one can define the mutual information I[X, Y ] between continuous
variables as
Z
Z
(17)
I[X, Y ] = DKL [p(x, y)||p(x)p(y)], p(x) = dyp(x, y), p(y) = dxp(x, y).
This imply that I[X, Y ]

0 with equality if and only if X and Y are independent.

5.1. Asymptotic Equipartition Property (AEP) for continuous distributions.


The AEP generalizes to continuous variables as well. Let X1 , . . . , Xn , . . . be a sequence of
i.i.d. RV with pdf p(x) with a finite E[Xi ] = . Just as for discrete variables, the law of
large numbers implies that
1
(18)
log p(X1 , . . . , Xn ) ! h[X] a.s.
n
which means that a.s. all sequences X1 , . . . , Xn have the same probability.
Exactly as in the previous case, for any > 0 we can define -typical sequences as

1
(n)
(19)
A = X 1 , . . . , X n :
log p(X1 , . . . , Xn ) h[X] <
n

NOTES ON A COURSE ON PROBABILITY THEORY:

LAWS OF LARGE NUMBERS

15

(n)

and again, by definition P {(X1 , . . . , Xn ) 2 A } > 1 if n is large enough. The only


dierence is that we cannot count the sequences. Rather, for any A Rn we can define
the volume
Z
(20)
V (A) =
dx1 dxn
A

and we can show that the volume of -typical sequences is


nh[X]
V (A(n)
.
)2

(n)

In some cases this does not say much. For example if X 2 [0, a] is uniform, then V (A )
an which is the volume of all sequences in [0, a]5. This is useful, for example, for Gaussian
variables as it says that all sequences of n i.i.d. Gaussian variables with zero mean will be
in a volume (2e 2 )n/2 around the origin.
6. Distributions of maximal entropy
Now that we have a quantitative notion of information, we can address the problem of
funding distributions that are consistent with a given state of knowledge. Consider the
case of a discrete random variable X 2 for which we assume the expectation of some
functions fk (X) are known to take some values
X
(21)
E[fk (X)] =
px fk (x) = Fk ,
k = 1, . . . , K.
x2

Then the distribution that encodes this and only this information, is given by the one that
maximizes the entropy, subject to these constraints. This implies we have to solve the
problem:
(
"
#
"
#)
X
X
X
X
(22)
max
px log px +
px fk (x) Fk + 0
px 1
k
p,

The solution is
(23)

px =

1 Pk
e
Z( )

k fk (x)

where Z( ) ensures normalization and the values of


that Eqs. (21) are satisfied, i.e.
(24)

Fk =

@ log Z
,
@ k

should be adjusted in such a way

k = 0, 1, . . . , K

where we set F0 = 1.
There is a precise sense in which this is the correct choice for the probability of X, as
explained in [6]: If you take M
1 samples from this distribution, by the law of large
numbers you expect that the sample averages of the functions fk (X) will be very close to
5This is not surprising, given that, as youll see below, the uniform distribution is the one of maximal
entropy, for variables in X 2 [0, a].

16

MATTEO MARSILI, ICTP, MARSILI@ICTP.IT

the expected values Fk . In other words, if mx is the number of times that the outcome x
is observed, then
1 X
(25)
fk
mx fk (x)
k = 1, . . . , K
= Fk ,
M x
and among the samples with this property, there are
M!
W (m)
~ =Q
x mx !

ways to get a particular frequency mx /M . Those with a frequency mx /M


= px close
to the probability of Eq. (23) outnumber the others. This is easily shown by Stirlings
approximation6.
This discussion generalizes in a straightforward manner to continuous variables X with
pdf p(x). You can see for example that the distribution of maximal entropy for X 2 [0, 1)
with E[X] = is the exponential and the distribution of maximal entropy for X 2 R with
E[X] = and V [X] = 2 is the Gaussian.
The only problem is that Eq. (23) may not be normalizable. For example, what is the
distribution of maximal entropy for X 2 R with E[X] = ? And what is the distribution
of maximal entropy with E[X] = , V [X] = 2 and E[X 3 ] = ? See [1] Chapter 12.
6.1. Sufficient statistics. See [1] Section 2.10
Consider the typical problem in science: Given a system, you can do experiments and
~ = (X1 , . . . , Xn ) of (independent) measures of some of its properties. On
derive a series X
the other hand, you can work out a theory, which in general takes the form of a generative
model for the measures X, given some parameter . Youd like to infer the value of
~ The way to do that, is to derive a particular combination of the variables T (X)
~
from X.
~ is generated from and
which provides information about . The situation is that X
~
~
~
~
~ T and are
T (X) from X, i.e. ! X ! T (X). This means that, conditional on X,
~ cannot provide more information
independent. By the data processing inequality, T (X)
~ contains about . I.e.
than the information that X

~ ) I(X,
~ ).
I T (X,

~ is called sufficient statistics for if this holds as an equality: I T (X,


~ = I(X,
~ ).
T (X)
~ is called sufficient statistics for if ! T (X)
~ ! X,
~ i.e. if conditional
Equivalently, T (X)
on T , and X are independent.
As a simple example, in the case of Bernoulli trials, the number of success in n trials is
~
~ of n trials with
sufficient statistics for p. Indeed, the probability P {X|k}
of any string X
6The inverse problem is non-trivial and very interesting. Imagine you are given a sample m
~ from an

unknown distribution. How can you find what are the quantities fk (X) that you should constrain in the
maximum entropy distribution? Apparently, if the sample average f` of a quantity that has not been
included in the constraints diers substantially from the expected value it takes on the distribution Eq.
(23), then f` (x) should be added to the constraints.

NOTES ON A COURSE ON PROBABILITY THEORY:

LAWS OF LARGE NUMBERS

17

k successes has the same probability, which is independent of p. For any prior distribution
~ p) = I(k, p) by a direct calculation. Notice that the
(p) of p, it is easy to find that I(X,
~
entropy of the distribution of X is
~ = nH(p),
H(X)
H(p) = p log p (1 p) log(1 p)
proportional to n, whereas the entropy of k is bounded above by log(n + 1)7. Therefore
~ is irrelevant, and only a tiny fraction contains
most of the information in the sample X
relevant information on the generative model (in this case).
~ depends on it only through some combination T (X),
~
When the distribution of the data X
~
~
~
~
~
i.e. p(X|) = F [T (X, )], then it is easy to see that I(X, ) = I(T (X, )). Therefore T (X)
is sufficient statistics for .
It is not true, in general, that any parametric distribution p(x|) admits a sufficient
statistics for . In general it is not true that the information on the generative model can
be condensed into few empirical averages.
The distributions of maximal entropy discussed above are special in that the probability
of a sample (x1 , . . . , xn )
)
(
n
X 1X
p(x, . . . xn |~ ) / exp n
fk (xi )
k
n
k

i=1

depends on the data only through the empirical averages of fk (x). Therefore these averages
contain all the information that is needed to identify the parameters of the generative
model. All other information in the sample is noise.
References
[1]
[2]
[3]
[4]
[5]
[6]

Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, (J. Wiley & Sons 2006).
B. V. Gnedenko, Theory of Probability (CRC Press 1998).
W. Feller, An Introduction to Probability Theory and its Applications (J.Wiley & Sons 1968).
J. Galambos, Introductory Probability Theory (Dekker, 1984).
E. T. Jaynes Probability Thoery: the logic of science, Cambridge U. Press 2003.
IEEE Transactions on Systems Science and Cybernetics,
4,
227 (1968).
http://users.ictp.it/ marsili/teaching.
[7] Marinari Parisi, Trattatello di probabilit
a (on my web page).
[8] C.W. Gardiner, Handbook of stochastic methods. Springer-Verlag, 1985.
[9] W. Bialek, Biophysics: Searching for Principles (Princeton University Press, 2012).

7For n large, H(k) ' log[2enp(1

p)]/2.

On

You might also like