Professional Documents
Culture Documents
This material can be found in Feller Chapters VIII, IX and X and in [2] chapter VI. Here
I give a more concise discussion of these issues. You can go back to Feller and Gnedenko
for a more extended discussion.
Question: what are the limit laws of probability for events or random variables which
involve N events A1 , . . . , AN or random variables X1 , . . . , XN with N ! 1.
1. 0
1 laws
There are combinations of many events that occur almost surely (a.s.)1. 0 1 laws
concern events EN which depends on N events or random variables, and state that, under
some conditions, P (EN ) ! 0 or 1 as N ! 1. For example, one can show (using a theorem
by Kolmogorov)P
that if X1 , . . . , Xn . . . be a sequence of independent random variables, then
the event A = { i f (xi ) converges} has probability either zee or one. Here we only want
to focus on one result:
1.1. Borel-Cantelli lemmas. Let A1 , A2 , . . . be an infinite series of events. If
1
X
P (Aj ) < +1
j=1
then at most a finite number of events occur almost surely. This means the probability of
the event
E = \1
j=1 Akj
for any sub-sequence kj of integers is zero. In other words2
P (Aj i.o.) = 0
Proof: 8N > 0
E [1
j=N Aj
but then
P (E) P
[1
j=N Aj
1
X
j=N
P (Aj ) ! 0
1An event occurs almost surely (a.s.) if its probability is equal to one. Almost refer to the fact that the
complement of this event need not be the empty set.
2Events in an infinite sequence A , n = 1, 2, 3, . . . are said to occur infinitely often (i.o.) if A occurs
n
n
for an infinite subsequence of indices n.
1
as N ! 1.
note that no independence is needed.
Pn
Example: returns to the origin of a random walk. Let Sn =
i=1 Xi with P (Xi =
+1) = p = 1 P (Xi = 1) independent. Then
2n
1
P {An } = P {S2n = 0} =
[p(1 p)]n ' p e a(p)n , a(p) = log[4p(1 p)]
n
n
Then for all p 6= 1/2, a(p) > 0 and BC lemma applies, i.e. a biased random walker returns
to the origin only a finite number of time. In order to deal with the case p = 1/2 we need
a converse of BC.
1.2. The converse of Borel-Cantellis Lemma. If Ak are independent and
1
X
j=1
P (Aj ) = 1
then an infinite number of events occurs almost surely. This means there existe a subsequence kj of indices such that the event
E = \1
j=1 Akj
has P (E) = 1. In other words
P (Aj i.o.) = 1
Proof: for some n
c
E c = \1
k=n Ak
Aj does not occur i.o. if none of the Ak occurs for k > n. Hence
( 1
)
1
1
Y
Y
X
P (E c ) =
P (Ack ) =
[1 P (Ak )] < exp
P (Ak ) = 0
k=n
k=n
k=n
c.v.d.
(Hamlet)
Example: Ak = {SSS at 3k 2, 3k 1, 3k Bernoulli trials}. Or Ak = {Xk = Xk
}
(Hamlet)
th
where Xk
is the k bit in the binary representation of Shakespears Hamlet.
Random walks: Bernoulli trial with p = 1/2
Sn =
n
X
k=1
Xk , P {Xk = 1} =
p
P (An ) 1/ n
1
2
(a)
(1)
(d)
(1)
(d)
A(d)
n = {Sn = 0, . . . , Sn = 0}
d/2
P (A(d)
.
n )n
For d > 2 the random walk returns to the origin at most a finite number of times, i.e. it is
transient. For d 2 the random walk is recurrent, i.e. An occurs i.o., but thats harder to
show.
Refinement of BC lemma:
We can relax the independence assumption and replace it by the milder one
P
1i<jN P (Ai \ Aj )
P
lim
=2
n!1
[ 1iN P (Ai )]2
In physics this manifests for example in the fact that thermodynamic quantities such as
the free energy of disordered materials attain values which are sample independent. Indeed
the energy density can be considered to be a sum of many local terms, which may fluctuate
because e.g. of impurities. The quantities which satisfy laws of large numbers in physics
are called self-averaging.
Typically, the Law of Large Numbers is used when we want to estimate expected values
of random variables
X
(1)
E[f (X)] =
p! f [X(!)]
!2
and the way we do it is to take T samples !t , t = 1, . . . , T , that in the best of the possible
worlds can be thought of as independent draws from the distribution p! . Then we compute
the mean and argue that
(2)
T
1X
f [X(!t )] E[f (X)].
T
t=1
This is generically what goes under the name of Law of Large Numbers.
Now, at first sight, this cant be true! After all, the number of independent samples T
cannot be very large, whereas the sample space can consist of a very large number of
elements, and at worst it can even be an infinite or a continuous set!
There are simple cases where you can imagine || to be astronomically large. The
classical example in physics is a gas, where in principle ! should specify the position and
velocity of each and every particle. How can a handful of measures !1 , . . . , !T allow us to
compute expected values? So, if Eq. (2) is true, something quite peculiar must happen!
Indeed, the peculiar thing which happens is that particular combinations of sampled
random variables are not very random, their probability distribution concentrates on a
small neighbourhood of a single point.
2.1. Dierent types of limit. [From C. W. Gardiner, Handbook of stochastic methods.
Springer-Verlag, 1985; Chap. 2]
Before proceeding, its important to state clearly what we mean by limit.
Given a sequence of random variables Xn (!), there are dierent senses in which the
statement Xn (!) ! X(!) can be interpreted, because were talking about convergence of
functions.
Almost certain convergence: For fixed !, the statement Xn (!) ! X(!) reduces
n!1
X| > } = 0
n!1
)2 ] =
i=1
2
Proof:
V [X] =
dxp(x)(x
dxp(x)(x
|x | a
)2
a2 P {|x
a}
"
a}
V [X]
a2
#
n
1X
V [Xi ]
Xi =
n
n
i=1
1X
Xi
n
i=1
V [Xi ]
!0
na2
as n ! 1. Of course the same proof shows that the WLLN holds whenever the variables
Xi are uncorrelated, or e.g. when the correlation E[(Xi )(Xj )] decays fast enough
with |i j|, as for example when i is a time index.
Actually Kintchine showed that the WLLN applies for i.i.d. variables as long as |E[Xi ]| <
+1 (see[2]).
2.3. The strong law of large numbers (SLLN). A sequence X1 , . . . , Xn , . . . of independent random variables with E[Xi ] = satisfies the SLLN if, for all , > 0 there is an
N (a, ) such that
(
)
n
1X
P
Xi < a, 8n > N (a, )
1
n
i=1
The SLLN states the almost certain convergence of the mean to the expected value, whereas
the WLLN states the convergence in probability. Naively, while the WLLN concerns the
limit of the probability of deviations of the sample mean from the average, the SLLN is
a statement on the probability of the limit of the deviations, i.e. it says that deviations
larger than a can be made as rare as one wishes by taking n large enough.
Sufficient conditions for the SLLN are (see [2])
Xi i.i.d. and E[Xi ] = < +1
E[Xi ] = i < +1 and V [Xi ] =
2
i
X
k
(Kolmogorov).
< +1 with
2
k
k2
< +1
n
X
Sn
1 w. p. p
An =
p > , Sn =
Xi , Xi =
0 w. p. 1 p
n
i=1
P (Bn ) =
dxe
<
e
n a
p
2a log n
Then for a > 1
P (Bn ) < +1
and Borel-Cantelli lemma ensures us that the SLLN holds (note indeed that Bn An ).
This result gives quite precise bounds on the excursions of the frequency around the probability.
3. Shannons theorem and the Asymptotic Equipartition Property
Shannon theorem is one particularly important application of the law of large numbers.
Let C = {c1 , . . . , cL } be a message written in an alphabet of N symbols ci = 0, 1, . . . , N
1. Alice wants to send this message to Bob in the most efficient way. One simple way is to
compute
L
X
K(C) =
ci N i 1
i=1
and send the binary representation of this number. This makes L log2 N bits, i.e. log2 N
bits per character. Can Alice do better?
If Alice is sending a message in English (and we imagine Bob knows it) a sequence such
as . . . QBAT T X . . . will be very unlikely. Bob is not expecting such sequences. Likewise,
imagine that L
1 and that characters are drawn independently, but not all character
are equally likely, i.e.
pc = P {ci = c}, 8i
This distribution is known to both Alice and Bob. What is the optimal coding O(C), i.e.
the one using as few bits as possible?
Theorem (Shannon): If ci are i.i.d. with distribution pc , then there is a coding
function O(C) such that a.s. a message needs at most
X
H2 [p] =
pc log2 pc
c
H2 [p]
bits per character. Then one can say this is the information content of pc .
Proof (idea): For any message C, let
P (C) =
L
Y
pc i
i=1
Let O(C) be the rank of the messages according to their probability P (C). In other
words, O(C) is such that for any two messages C1 , C2 , O(C1 ) < O(C2 ) iif P (C1 ) > P (C2 ),
so small numbers O(C) correspond to highly probable messages [if P (C1 ) = P (C2 ) take
O(C1 ) < O(C2 ) if K(C2 ) < K(C2P
)].
For any function fc , let hf i = c pc fc and
L
f (C) =
1X
f ci
L
i=1
Then the Weak Law of Large Numbers (WLLN) implies that, for any > 0, P {|f (C)
hf i| > } ! 0 as L ! 1. Now take fc = log pc , then f (C) = L 1 log P (C) and hf i =
H[p]. The WLLN implies that, a.s.
e
L(H[p]+)
L(H[p] )
i.e. that all messages will typically have the same probability, P (C) ' e LH[p] . The number
of these messages is just the inverse of this probability, i.e. there are eLH[p] . This means
that the largest integer I need to encode such messages is of order O(C) eLH[p] . In order
to do this I need less than H2 [p] bits per character. It is easy to see that one cannot do
better.
In brief, what Shannon theorem states is that, for L large, the sequences which were
likely to observe are those with a probability p(C) eLH[p] and that their number is
eLH[p] . We shall call these the typical sequences. All non-typical sequences will occur with a
probability which is exponentially small (in L) with respect to typical ones. Whereas typical
sequences have the same probability. This is a property called Asymptotic Equipartition
Property.
A more detailed treatment of this issue is given in [1], chapter 3, which is mandatory
reading.
Exercise: A sport newspaper gives every friday the probabilities of the outcomes (1, X
or 2) of the 13 football matches in the italian league that are in the schedina, a popular
betting scheme in Italy. Take the examples where the probabilities of the three dierent
outcomes are 50%, 30% and 20% in any possible order. So the forecast may look like this:
1
50%
30%
20%
50%
..
.
X
30%
50%
50%
30%
..
.
2
20%
20%
30%
20%
..
.
2 w. p. 1/2
E[W1 ] = E[X1 ]W0 = (1 + q/2)W0 , X1 =
q w. p. 1/2
For any q > 0 this seems like a convenient game. Indeed, if the game is repeated n times,
with Xi , i = 1, . . . , n being i.i.d. random variables as X1 , and I invest W0 euros each time,
then the LLN ensures that I will get a positive gain, per bet equal to q/2 times W0 .
Now the gain is clearly proportional to W0 , so the more one invests the higher the gain.
In particular, the best thing to do is to invest all the capital accumulated, at each time.
In this way, the capital after n bets will be Wn = Xn X1 W0 and I expect my capital
to increase as
E[Wn ] = (1 + q/2)n W0
i.e. exponentially. Great!
However, it is easy to show that if 0 < q < 1/2 then, for all a > 0
P {Wn > a} ! 0 as n ! 1
Indeed
n
1
1X
1
log(Wn /W0 ) =
log Xi ! E[log Xi ] = log(2q)
n
n
2
i=1
cn
1
, c = | log(2q)| > 0
2
i.e. the capital will typically vanish exponentially. Here typical means with very high
probability, which tends to one as n ! 1. The discrepancy of this result with the behavior
of E[Wn ] becomes evident if one takes some care in evaluating the expected value
n
h
i
X
n
(3)
E[Wn ] =
2 n 2k (q/2)n k
k
k=0
Z 1
(4)
n
dxenf (x)
=
0
where in first we compute the expected value by averaging over the number k of lucky
outcomes (the term in [. . .] is the corresponding gain), and then we use Stirlings formula
for the binomial and change the sum on k into an integral on x = k/n, and the function
f (x) is
f (x) = x log x (1 x) log(1 x) + (1 x) log(q/2).
The integral can be performed with the saddle point method, which means that it is
dominated by the sequences of outcomes where the fraction x of lucky outcomes, which
is derived by imposing that f 0 (x ) = 0, is
x =
1
.
1 + q/2
Sequences of this type will occur only with exponentially small probability, but when
they occur they yield an exponentially large gain, which dominates the calculation of the
expected value. By the Asymptotic equipartition property, instead, were guaranteed to
see almost surely a typical sequence with a frequency x = 1/2, and this yields a gain
Wntypical (2q)n W0 , that vanishes as n ! 1, because 2q < 1.
Problem: The above is a bit frustrating, since a game which seems profitable leads us to
bankrupt. Maybe one should not invest all the capital. Consider the strategy of investing
a fraction 2 [0, 1] of the capital Wn each time. What is the best value of ?
For more on this issue, and for its relation to finance, see the discussion in the paper
http://arxiv.org/abs/0902.2965
4. Information
Mandatory reading: Chapter 2 and 3 [1], pages 279-287 of [9]
Suggested: the rest of Chapter 4.1 and 4.2 in [9].
10
to pose to B in order to know the answer !. This can be measured in bits. Take for
example the case = {a, b, c, d}. Then A may ask a first question
Q1: is ! 2 {a, b} or not?
and depending on the answer, A may ask
Q2: if ! 2 {a, b} is ! = a or not? Otherwise, if ! 62 {a, b} is ! = c or not?
The answers to these two questions reveal the correct outcome !. Hence the information
is H = 2 bits.
Now imagine that A knows that answers are not all equally likely, i.e. she knows the
prior probability p! of answers. Can she do better? Imagine that in the previous case
pa = 1/2, pb = 1/4 and pc = pd = 1/8. Then A can modify her questions as follows:
Q1: is ! = a or not?
only if ! 6= a A will need to pose a further question. Then she may ask:
Q2: is ! = b or not?
Only if the result is no, she will need to ask
Q3: is ! = c or not?
The number of questions A expects to be needed is
7
(5)
H = pa 1 + pb 2 + pc 3 + pd 3 =
4
which is less than 2. So the knowledge of p! carries some information on the answer, which
can be quantified (in 1/4 of a bit, which is what A gains).
Indeed this is the optimal way to elicit the answer, given p! , and it is also the optimal
way to encode it, i.e. it is the optimal representation. For imagine that A and B have
to decide on a code to transform the possible answers in bits, which are then send over a
binary channel. Then the simplest choice would be:
a ! 00; b ! 01; c ! 10; d ! 11
this will require 2 bits. But, if p! is as above and they know it, A and B can also decide
to use a dierent code:
a ! 0; b ! 10; c ! 110; d ! 111
In this case they expect that the number of bits which are needed will be 7/4 on average.
The fact that these two apparently dierent problems (eliciting and coding information)
are the same, is evident from the fact that As choice of the type of questions is in practice
deciding a binary coding (yes! 1, no! 0) of the possible answers.
Indeed, the optimal way to elicit the answer or equivalently the optimal way to encode
it, is given by the entropy
X
(6)
H[p] = Ep [log2 1/p] =
p! log2 p!
!
of the distribution. For a derivation of this result, see [9]. Note indeed, that in the example
above, we used bit strings of length exactly equal to log2 1/p! .
11
The entropy quantifies how much Bs reply can be surprising for A. Indeed if both A
and B knows that p! = !,a , for example, then Bs reply is not surprising at all. Actually
A doesnt even need to ask, so no bit need to be exchanged. As (you will see) in statistical
mechanics, the entropy quantifies the uncertainty. In this case, it quantifies the uncertainty
of Alice about Bobs answer before she hears the answer. After the hears the answer, she
knows that one answer occurs with probability one and the others with probability zero,
i.e. H = 0. Then H measures how much Alice hes decreased her degree of uncertainty.
The entropy above is a functional of the distribution p! . It can also be defined for a
random variable X(!) taking discrete values x 2 X as
X
H[X] =
px log px
x2X
where px = P {X(!) = x}. Well use the latter definition, which reduces to the former for
X = !.
The entropy is minimal when X is a constant and is maximal when X is maximally
uncertain: px = 1/|X . Accordingly 0 H[X] log |X |.
The entropy can be generalized to any number of random variables X1 , . . . , Xn in a
straight forward fashion. Likewise, we can define conditional entropies H(X|Y ) as the
entropy of the conditional distribution p(x|y), averaged over y. The law of conditional
probability imply that H(X|Y ) = H(X, Y ) H(Y ) (check!).
4.1. Relative entropy. Imagine now that A has a wrong estimate q! of the probability
p! of Bs answers !. How much this impacts on the efficiency of the questions shes going
to ask?
Given q, A is going to eectively encode Bs answers in such a way that answer ! will
require log2 1/q! bits3, so the number of questions she needs to ask, on average, is
X
Ep [log2 1/q] =
p! log2 1/q!
!
the dierence between this and the most efficient representation, which requires H[p] bits,
is
X
p!
DKL (p||q) =
p! log2
q!
!
which is known as the Kullback-Leibler divergence. It tells us how much we pay in terms
of bits our error in the estimate of probabilities. In this sense, DKL tells us how far
we are from the truth, which is why it is often considered as a distance, though it is not
symmetric and it does not satisfy the triangle inequality.
Though it is not evident, a priori, DKL (p||q) 0 and it vanishes only for q = p. The
way to prove it (do it!), is to use the convexity of the logarithm log2 x x 1 in the
definition of DKL .
3This argument assumes that the length of messages can be continuous, whereas it is discrete! Chapter
5 of [1] shows how the derivation can be carried out more precisely, and how optimal data compression
that achieves the Shannon bound of H[p] bits per answer can be achieved in long messages. We keep our
discussion at a non-rigorous level for the moment, to illustrate the main concepts.
12
4.2. Mutual information. Imagine you have two random variables X 2 X and Y 2 Y
with joint distribution p(x, y) and marginals p(x) and p(y)4. One way to quantify their
mutual dependence is to compute how much information is lost by assuming that they are
independent. This is given by
(7)
(8)
(9)
= H(X) + H(Y )
H(X, Y )
and it is called the mutual information between X and Y . The last equality, which follows
from simple algebra, with the positivity of DKL implies that H(X, Y ) H(X) H(Y ).
In order to illustrate the meaning of I consider the following problem. We are interested
in estimating a random variable X of which at present we know the distribution p(x), and
the corresponding entropy H(x) which quantifies our state of uncertainty about X. You
can think of X as a parameter of our theory of the world (or as the theory itself, if you
allow X to be multi-dimensional).
Now we have the possibility to perform an experiment, i.e. to measure a random variable
Y , of which we know, before doing the experiment, its distribution. We also know the joint
distribution p(x, y) of the two variables. How much information can I expect the experiment
will convey on X? The reduction in the uncertainty is given by
H(X)
H(X|Y ) = I(X, Y )
as can be shown by a direct calculation. So the mutual information tells us how much
we learn, on average, about X if we know Y . Note that I(X, Y ) = I(Y, X) = H(Y )
H(Y |X). This means that experiments cannot tell us more, about theories, than how much
information our theories contain about experiments.
Another important point is that knowledge of Y reduces a priori the uncertainty on H,
since H(X|Y ) H(X), but a posteriori this might not be the case! Take for example:
Y \X 1
2
1
0 3/4
2
1/8 1/8
Then H(X) = 0.544 and H(X|Y ) = 0.25 bits, but H(X|Y = 2) = 1 bit.
Exercise: Problem 77 of [9]. Let there be n + 1 boxes labeled ! = 0, 1, . . . , n. One of
the boxes contains a prize, the others are empty. The probability p0 that the prize is in
! = 0 is much larger than the probability that it is in any other box: p0
p! and we take
p!>0 = (1 p0 )/n for simplicity. Note that, if n is large, p0 can nonetheless be also small.
There are two possible options:
1) open the box ! = 0
4The abuse of notation of the symbol p() follows the notation of [1]. It should be understood that p(x)
and p(y) are in principle dierent functions of their arguments.
13
where we define p(xi ) = P {X = xi }. She can now give a precise estimate of the
information content of Bobs answer, which is H[X ]. Note that
X
(11)
H[X ] =
p(xi ) log [p(xi ) ]
i
(12)
XZ
i
(13)
(14)
(i+1)
log
where the approximation gets more an more precise as ! 0. Here h[X] is called dierential entropy and its meaning is that h[X] log is the amount of information needed to
specify X to a precision . Notice that h[X] need not be positive. For example, a uniform
14
random variable X 2 [0, a] has h[X] = log a which is negative if a < 1. If a = 1/8 and you
want to determine X up to the nth binary digit (i.e.
= 2 n ), you will need n 3 bits,
because the first three bits will be zero anyhow.
One property of the entropy that we used, is that H[X] does not actually depends on
what values X takes. It only depends on the value of the probabilities px = P {X = x}.
In particular, if we do a bijective transformation X ! Y = f (X) i.e. such that to every
possible value of X there corresponds one and only one value of Y then H[X] = H[Y ].
This is not true for the dierential entropy, because if Y = F (x), even when f (x) is
monotonous and hence to every X there correspond one and only one Y the pdf
transforms as pY (y) = pX (x)/|f 0 (x)|x=f 1 (y) . Therefore
(15)
in other fords, the dierential entropy is not reparametrization invariant. A simple application of this is that, if a is a constant, then h[X + a] = h[X] and h[aX] = h[X] + log |a|.
Exercise: compute the dierential entropy for a Gaussian with mean and variance
2 , for an exponential distribution p(x) = ae ax , a, x > 0, and for a multi-dimensional
Gaussian with mean
~ and covariance Cov[Xi , Xj ] = Ai,j .
The Kullback-Leibler divergence (or relative entropy) generalizes to continuous variables
as
Z
p(x)
(16)
DKL [p||q] = dxp(x) log
q(x)
Contrary to the dierential entropy, the dierential entropy is reparametrization invariant. If p and q represent two possible distributions for the random variable X, their
divergence remains the same if one changes parametrization Y = f (X). As for discrete
variables, it is easy to see that DKL [p||q]
0 with equality holding only if p = q (apart
from sets of measure zero).
In the same way, one can define the mutual information I[X, Y ] between continuous
variables as
Z
Z
(17)
I[X, Y ] = DKL [p(x, y)||p(x)p(y)], p(x) = dyp(x, y), p(y) = dxp(x, y).
This imply that I[X, Y ]
1
(n)
(19)
A = X 1 , . . . , X n :
log p(X1 , . . . , Xn ) h[X] <
n
15
(n)
(n)
In some cases this does not say much. For example if X 2 [0, a] is uniform, then V (A )
an which is the volume of all sequences in [0, a]5. This is useful, for example, for Gaussian
variables as it says that all sequences of n i.i.d. Gaussian variables with zero mean will be
in a volume (2e 2 )n/2 around the origin.
6. Distributions of maximal entropy
Now that we have a quantitative notion of information, we can address the problem of
funding distributions that are consistent with a given state of knowledge. Consider the
case of a discrete random variable X 2 for which we assume the expectation of some
functions fk (X) are known to take some values
X
(21)
E[fk (X)] =
px fk (x) = Fk ,
k = 1, . . . , K.
x2
Then the distribution that encodes this and only this information, is given by the one that
maximizes the entropy, subject to these constraints. This implies we have to solve the
problem:
(
"
#
"
#)
X
X
X
X
(22)
max
px log px +
px fk (x) Fk + 0
px 1
k
p,
The solution is
(23)
px =
1 Pk
e
Z( )
k fk (x)
Fk =
@ log Z
,
@ k
k = 0, 1, . . . , K
where we set F0 = 1.
There is a precise sense in which this is the correct choice for the probability of X, as
explained in [6]: If you take M
1 samples from this distribution, by the law of large
numbers you expect that the sample averages of the functions fk (X) will be very close to
5This is not surprising, given that, as youll see below, the uniform distribution is the one of maximal
entropy, for variables in X 2 [0, a].
16
the expected values Fk . In other words, if mx is the number of times that the outcome x
is observed, then
1 X
(25)
fk
mx fk (x)
k = 1, . . . , K
= Fk ,
M x
and among the samples with this property, there are
M!
W (m)
~ =Q
x mx !
~ ) I(X,
~ ).
I T (X,
unknown distribution. How can you find what are the quantities fk (X) that you should constrain in the
maximum entropy distribution? Apparently, if the sample average f` of a quantity that has not been
included in the constraints diers substantially from the expected value it takes on the distribution Eq.
(23), then f` (x) should be added to the constraints.
17
k successes has the same probability, which is independent of p. For any prior distribution
~ p) = I(k, p) by a direct calculation. Notice that the
(p) of p, it is easy to find that I(X,
~
entropy of the distribution of X is
~ = nH(p),
H(X)
H(p) = p log p (1 p) log(1 p)
proportional to n, whereas the entropy of k is bounded above by log(n + 1)7. Therefore
~ is irrelevant, and only a tiny fraction contains
most of the information in the sample X
relevant information on the generative model (in this case).
~ depends on it only through some combination T (X),
~
When the distribution of the data X
~
~
~
~
~
i.e. p(X|) = F [T (X, )], then it is easy to see that I(X, ) = I(T (X, )). Therefore T (X)
is sufficient statistics for .
It is not true, in general, that any parametric distribution p(x|) admits a sufficient
statistics for . In general it is not true that the information on the generative model can
be condensed into few empirical averages.
The distributions of maximal entropy discussed above are special in that the probability
of a sample (x1 , . . . , xn )
)
(
n
X 1X
p(x, . . . xn |~ ) / exp n
fk (xi )
k
n
k
i=1
depends on the data only through the empirical averages of fk (x). Therefore these averages
contain all the information that is needed to identify the parameters of the generative
model. All other information in the sample is noise.
References
[1]
[2]
[3]
[4]
[5]
[6]
Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, (J. Wiley & Sons 2006).
B. V. Gnedenko, Theory of Probability (CRC Press 1998).
W. Feller, An Introduction to Probability Theory and its Applications (J.Wiley & Sons 1968).
J. Galambos, Introductory Probability Theory (Dekker, 1984).
E. T. Jaynes Probability Thoery: the logic of science, Cambridge U. Press 2003.
IEEE Transactions on Systems Science and Cybernetics,
4,
227 (1968).
http://users.ictp.it/ marsili/teaching.
[7] Marinari Parisi, Trattatello di probabilit
a (on my web page).
[8] C.W. Gardiner, Handbook of stochastic methods. Springer-Verlag, 1985.
[9] W. Bialek, Biophysics: Searching for Principles (Princeton University Press, 2012).
p)]/2.
On