Professional Documents
Culture Documents
Fall 2016
Sinho Chewi
Contents
1 Probability Theory 6
1.1 Probability Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Fundamental Probability Facts . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Generalized Inclusion-Exclusion Principle . . . . . . . . . . . . . . . . 8
1.3 Discrete Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Uniform Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.1 Bayes Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.1 Correlated Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1
CONTENTS 2
5 Markov Chains 58
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Transition of Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Markov Chain Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Hitting Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2 Probability of S 0 Before S . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 Long-Term Behavior of Markov Chains . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Classification of States . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.2 Invariant Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.3 Convergence of Distribution . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.4 Convergence Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4.5 Balance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.6 Invariant Distribution & Hitting Times . . . . . . . . . . . . . . . . . 71
CONTENTS 3
6 Continuous Probability I 73
6.1 Continuous Probability: A New Intuition . . . . . . . . . . . . . . . . . . . . 73
6.1.1 Differentiate the CDF . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.2 The Differential Method . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Continuous Analogues of Discrete Results . . . . . . . . . . . . . . . . . . . 76
6.2.1 Tail Sum Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 Important Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . 78
6.3.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.3 The Minimum & Maximum of Exponentials . . . . . . . . . . . . . . 81
7 Continuous Probability II 83
7.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.1 Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.2 Conditional Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2.3 Ratios of Random Variables . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.1 Integrating the Normal Distribution . . . . . . . . . . . . . . . . . . . 87
7.3.2 Mean and Variance of the Normal Distribution . . . . . . . . . . . . . 88
7.3.3 Sums of Independent Normal Random Variables . . . . . . . . . . . . 89
7.3.4 Tail Probability Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.4.1 Confidence Intervals Revisited . . . . . . . . . . . . . . . . . . . . . . 93
7.4.2 de Moivre-Laplace Approximation . . . . . . . . . . . . . . . . . . . . 94
7.5 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.6 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.6.1 Flipping Coins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8 Moment-Generating Functions 99
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.1.1 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 100
8.1.2 Distribution of Random Sums . . . . . . . . . . . . . . . . . . . . . . 100
8.1.3 Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2 Common MGFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.3 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.5 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3 Probability Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3.1 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
CONTENTS 4
9 Convergence 105
9.1 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2.1 Borel-Cantelli Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.2.2 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . 106
9.3 Convergence of Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.4 CLT Proof Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.4.1 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.4.2 Proof Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11 Martingales 120
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
11.2 Martingale Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.3 Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
11.4 Martingale Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
References 147
Bibliography 148
Chapter 1
Probability Theory
It turns out that a wide variety of phenomena can be modeled effectively by the following
mathematical objects:
a set of all possible outcomes of interest, which we call (also known as the proba-
bility space or sample space),
subsets of the probability space, which we call events (sometimes denoted F),
a function that assigns values to sets, which will be our probability measure P .
Again, with less formality: contains all possible outcomes, and events are sets of possibil-
ities. We discuss events instead of individual outcomes because we are often interested in
more than one outcome at a time. (When we discuss continuous probability, we will also
see that it is not sufficient to only talk about individual outcomes.) Finally, our probability
measure is our way of assigning likelihoods to the events, with higher numbers corresponding
to more likely outcomes.
The question we are primarily interested in right now is: what properties do we want our
probability measure P to satisfy? Lets start with boundary conditions: it is natural to say
that the likelihood of nothing happening at all is 0, which we write as:
P () = 0 (1.1)
6
CHAPTER 1. PROBABILITY THEORY 7
The second condition is by convention: it is extremely useful to restrict the probability values
to lie in the range [0, 1]. This condition can be written as:
P () = 1 (1.2)
An important consequence of this choice is that we may interpret the probability value of
an event as the long-term proportion that the event occurs. As a first example, consider
the archetypal fair coin toss. Saying that the probability that the coin comes up heads is
1/2 can be interpreted as follows: if we flipped the coin from now until forever (an infinite
number of times), then we would expect a fraction 1/2 of the flips to come up heads. This
is the frequentist view of statistics. Briefly, the other major view of statistics is known as
the subjectivist view, in which the probability value is interpreted as the degree of your be-
lief in the outcome. Regardless of which view you adopt, the laws of probability are the same.
Finally, the last probability axiom is countable additivity, which is stated as follows: if for
each 1 i < we have an event Ai , such that the events Ai are pairwise disjoint (i.e. no
two events have an outcome in common), then the probabilities of the events must add, in
the following sense:
!
[ X
P Ai = P (Ai ) (1.3)
i=1 i=1
In other words, likelihoods add for disjoint events. Perhaps it is not so clear right now why
countable additivity is so fundamental that we make it an axiom, but without this axiom we
would not have much of a theory at all. Before we move on, we can verify a quick consequence
of countable additivity: if A1 and A2 are disjoint events, then P (A1 A2 ) = P (A1 ) + P (A2 ).
This is known as finite additivity, and it follows from countable additivity and (1.1) by
taking A3 = A4 = = . As a check, prove to yourself via induction P that if A1 , . . . , An
are disjoint events, then finite additivity implies that P (A1 An ) = ni=1 P (Ai ). (As a
side-remark: observe that countable additivity does not follow from finite additivity alone.)
Instead of boring you with the details, take a look at the picture:
A B
The proof of this fact involves splitting AB into disjoint events and applying finite additivity
again. As you can see, proving these fundamental facts all have a similar flavor: split the
event into disjoint events, so that we can apply countable additivity. If we take the inclusion-
exclusion principle and drop the last term on the right, then we have a simple inequality:
P (A B) P (A) + P (B). With induction, we can extend this to the union bound:
Finally, we will prove one more, S just to illustrate the concept. Suppose we have events
A1 A2 A, such that i=1 Ai = A. We want to show that P (An ) P (A). Define
0 0 0
Ai Sby A1 = A1 and Ai S = Ai Ai1 for i 2. Then, we can notice the following facts: for all
n, ni=1 A0i = An , and 0 0
i=1 Ai = A. Furthermore, the Ai are disjoint, so applying countable
additivity here yields
n
! n
[ X
0
P (A) = lim P Ai = lim P (A0i ) = lim P (An ). (1.7)
n n n
i=1 i=1
The inner summation ranges over all k-tuples of indices (i1 , . . . , ik ) satisfying the condition
1 i1 < < ik n. In words, the generalized inclusion-exclusion principle prescribes the
following rule for calculating P (A1 An ): First, sum each of the individual probabilities
CHAPTER 1. PROBABILITY THEORY 9
P (Ai ). Next, subtract the probabilities for all pairwise intersections of events, P (Ai Aj ).
Then, add the probabilities for all 3-wise intersections of events P (Ai Aj Ak ), and continue
with the same pattern (with alternating signs).
One can prove the generalized inclusion-exclusion principle with a tedious proof by induction
on n, the number of events. Instead, we will try to arrive at a better understanding of the
formula and develop a more elegant proof.
First, study the diagram for the case of n = 2, which reveals why the inclusion-exclusion
principle is necessary: if we sum P (A) and P (B), then we are double-counting the intersec-
tion P (A B). To avoid dealing with overlapping events, we can write A B as the disjoint
union of three events A B c , A B, and Ac B. We can write these events as length-2
binary strings: the first digit is 1 if A is included, and the second digit is 1 if B is included.
Then, the three disjoint events correspond to the strings 10, 11, and 01, respectively. We
have included all length-2 binary strings which contain at least one 1 (the string with all 0s
corresponds to Ac B c , which is clearly not contained in A B).
Consider a length-n binary string with exactly m 1s, where m 1. Call this event B. Let
us examine (1.8) and see how many times we count P (B). If we consider one event at a
time, then B is contained in exactly m of the events Ai1 , so P (B) appears m times with a
positive sign. If we consider pairwise intersections of events, then B is contained in exactly
m
2
of the events Ai1 Ai2 , and here P (B) appears with a negative sign. Continuing on,
if we consider k-wise intersections of events, then P (B) appears m k
times, with the sign
k+1
(1) . Therefore, the total number of times that P (B) is counted is
m m
k+1 m k m
X X
(1) =1 (1) .
k=1
k k=0
k
What we are saying here is that the probability of the event A is the sum of the probabilities
of each outcome in A. Perhaps this is intuitively clear and we are being overly fussy about
the details, but attention to detail will be quite important later in the course.
|A|
P (A) = (1.10)
||
To compute the probability of the event A, we count the number of ways in which A is
achieved, and divide by the total number of elements in .
The utility of this statement is that P (B) may be more difficult to compute than P (B Ai ),
because for the latter, we have more information: we know that both B and Ai have occurred,
and the knowledge of Ai may aid us in the computation of P (B). This line of reasoning is
also useful for calculating probabilities associated with sequential processes, which can be
represented in a tree structure. Each node of the tree represents a decision, with the differ-
ent branches from a node representing the different possible outcomes from the decision. To
fully analyze such situations, we will need another tool called conditional probability.
Conditional probability is the idea that events can affect each other; for example, observing
that there are more clouds in the sky today informs us that there is a greater chance for
rain. How do we formalize this idea mathematically? Suppose that P represents our beliefs
in the likelihoods of various events (adopting the subjectivist view for a moment), and then
we observe an event A. In a sense, P is outdated information; P is a probability law that
was formulated before we learned about A, and knowing A might change our beliefs. What
CHAPTER 1. PROBABILITY THEORY 11
we need is a new probability law, which we will call the conditional probability law given the
event A. Since we have a new probability law, we may call it by a different name, such as
PA , but this is not standard. Instead, we write P (B | A), read as the probability of B given
A, to mean the probability assigned to the event B under the new law.
It is important to recognize that the new law is a true probability law, i.e. it must satisfy the
three probability axioms. In addition, we would like the following property: P (A | A) = 1.
This is equivalent to saying that knowing the occurence of A, we are certain that A has
indeed occurred. These statements are intended to motivate the following definition:
P (A B)
P (B | A) = (1.12)
P (A)
The interpretation of this equation is that the conditional probability law is formed from
the previous probability law in two stages: first, we restrict our attention only to the events
which are subsets of A; second, we normalize the probability, that is, we divide by P (A) to
ensure that the probability of A itself under this new law is now 1.
The expression above can be written in a more symmetric way, by noting that
P (A | B)P (B) = P (A B) = P (B | A)P (A). (1.13)
Often, we can write down conditional probability laws quite easily in a cause-and-effect
fashion, such as the probability that a patient has a fever given that the patient has the
flu. Then, the equation P (A B) = P (A)P (B | A) states that we can directly apply our
cause-and-effect knowledge, by multiplying the unconditional probability P (A) with the con-
ditional probability P (B | A), to obtain the probability that both A and B occur. The rule
P (A B) = P (A)P (B | A) is often known as the chain rule of probability.
Example 1.1 (Birthday Problem I). Suppose that we throw n balls into m bins. What
is the smallest n such that the probability of a collision (two balls landing in the same
bin) is at least 1/2?
When m = 365 and we view the balls as people and the bins as birthdays, the
question becomes: how many people do you need to gather in a room before the proba-
bility that two people share a birthday is at least 1/2? This is commonly known as the
birthday problem.
Let Ai denote the event that ball i does not collide with balls 1, . . . , i 1. We have
P (A1 ) = 1 since the first ball cannot produce a collision. We would like to compute
P (A1 An ), the probability that we do not have any collisions. To do so, we will
use the chain rule of probability,
n
Y
P (A1 An ) = P (A1 ) P (Ai | A1 Ai1 ). (1.14)
i=2
CHAPTER 1. PROBABILITY THEORY 12
Here, P (Ai | A1 Ai1 ) is the conditional probability that ball i does not collide
with the first i 1 balls, given that none of the first i 1 balls produced a collision. If
none of the first i 1 balls produced a collision, then they must each be in separate bins;
therefore, P (Ai | A1 Ai1 ) is the probability of ball i landing in one of the other
m (i 1) bins, which is (m i + 1)/m.
m m1 mn+1 m!
P (A1 An ) = = n
m m m m (m n)!
Although we have obtained an exact answer, many times it is worth the effort to inves-
tigate if we can approximate the solution in order to better understand its properties.
We rewrite the desired probability in the following way:
n1
Y
1 n1 i
P (A1 An ) = 1 1 1 = 1
m m i=0
m
Next, we can use the approximation ex 1 x to replace the factor 1 i/m with ei/m .
(1 x is the first-order series expansion of ex .)
n1
Y Pn1
P (A1 An ) ei/m = e i=0 i/m
i=0
Pn1
Evaluating the summation, we have i=0 i = n(n 1)/2 n2 /2, so
2 /(2m)
P (A1 An ) en .
p
Setting P (A1 An ) = 1/2, we find that n2 = (2 ln 2)m, or n = (2 ln 2)m. If we
ignore constants, then in the language of computational complexity, we have m O(n2 ).
Here, the balls are inputs and the bins are the number of buckets in our hash
table. What we have found is that if we want the probability of collision between
two inputs to be less than 1/2 (in other words, for distinct inputs i and j, we want
P (h(i) 6= h(j)) 1/2), then we want the size of our hash table, m, to be roughly n2 .
Example 1.2 (Birthday Problem II). Using the union bound, one can obtain the results
of Example 1.1 more easily. Since there are n balls, there are n2 = n(n 1)/2 different
pairs of balls. Let Bi denote the event that the ith pair of balls collides, and write
CHAPTER 1. PROBABILITY THEORY 13
Sn(n1)/2
B= i=1 Bi as the event of a collision. By the union bound,
n(n1)/2
X n(n 1) 1 n2
P (B) P (Bi ) = .
i=1
2 m 2m
To calculate P (Bi ), we observed that regardless of where the first ball lands, the prob-
ability that the second ball in the pair lands in the same bin as the first ball is 1/m.
Again, if we set P (B) = 1/2, we find that m O(n2 ).
Although the union bound may seem crude, it is usually simple to apply and often yields
good results.
We use the Law of Total Probability, (1.11), along with the definition of conditional proba-
bility, to write the following derivation:
P (A B) P (A B) P (B | A)P (A)
P (A | B) = = =
P (B) P (A B) + P (Ac B) P (B | A)P (A) + P (B | Ac )P (Ac )
In one sense, we have derived a way to relate P (A | B) to P (B | A), so we have a computa-
tional tool: whenever we have the wrong conditional probabilities for a problem we face,
use Bayes Rule to turn the conditional probabilities around. More generally, if A1 , . . . , An
partition the sample space, then:
P (B | Ak )P (Ak )
P (Ak | B) = Pn , 1kn (1.15)
i=1 P (B | Ai )P (Ai )
The derivation of Bayes Rule was not so difficult, so one may wonder why such a big deal is
made out of the formula. Let us examine another way to view Bayes Rule: given multiple
possible explanations for an observation we have just made, we can discern the most likely
explanation. This is called probabilistic inference, and it concerns problems such as patient
diagnosis, or determining whether a blood test result is a false positive.
1.5 Independence
Speaking of information, what if we observe an event A, but knowing that A occurs tells
us exactly nothing about whether B will occur? In other words, our belief that B occurs is
unchanged after the observation of A. In this case, we say that A and B are independent:
P (B | A) = P (B). Again, this equation may not look symmetric, but we can write the
condition for independence as follows:
P (A B) = P (A)P (B) (1.16)
Independence is a truly subtle concept. If we view the independence of A and B as the
information about B you receive from observing A (or vice versa), then we should have that
Ac and B are also independent, and A and B c should also be independent, and so forth.
Pinning down the concept of independence as information requires a more formal setting
than we have described so far, but for now, concepts such as knowledge and information
provide a good intuitive guide. For example, we will make statements such as: if I know the
outcome of one coin toss, I still have no clue what the next coin toss will be, so we say that
coin tosses are independent of each other.
In the above example, you may have noticed that the independence of coin tosses requires
talking about the independence of multiple events, and it is not fully clear what we mean.
In fact, there are different forms of independence: we can say that the events A1 , . . . , An are
pairwise independent if every pair of events is independent. A stronger statement is the
statement of mutual independence, which is tricky to formalize. Let us give it a try: if we
have two sets of events, such that the sets are disjoint, then mutual independence requires
that any combination of events from the first set is independent of any combination of events
from the second event. For example, A1 A3 should be independent of A2 A5 A6 , and so
forth. An equivalent way of stating this condition is:
n
! n
\ Y
P A0i = P (A0i ) (1.17)
i=1 i=1
where each A0i is allowed to be either Ai or . The reason we allow A0i to be is because we
want to allow combinations such as A1 A2 (combinations which do not use every Ai ), which
we can write as A1 A2 . This is just a technical detail though. Importantly,
you should recognize that mutual independence is stronger than pairwise independence. In
the case of coin tosses, we would like the stronger condition of mutual independence to hold,
because we want to talk about sets of coin tosses as well.
How about an infinite set of events Ai , maybe even uncountably many events? The answer
is that we say Ai is independent if any finite subcollection of the Ai is independent.
second coin is heads, and S is the event that the two tosses produced the same outcome
(two heads or two tails).
We can check that the three events are pairwise independent. Conditioned on Hi ,
P (S | Hi ) is the probability that the other coin also lands heads, which is 1/2. Hence,
P (S | Hi ) = P (S) = 1/2, so Hi and S are independent. Clearly, H1 and H2 are indepen-
dent, so we have pairwise independence.
Example 1.4. One might ask if P (A1 A2 A3 ) = P (A1 )P (A2 )P (A3 ) alone is sufficient
for mutual independence. The answer is no.
Consider the probability space of a fair die, = {1, 2, 3, 4, 5, 6}, and define the following
events: A1 = A2 = {1, 2, 3} and A3 = {3, 4, 5, 6}. We have
1
P (A1 A2 A3 ) = P ({3}) = = P (A1 )P (A2 )P (A3 ),
6
but clearly A1 and A2 are not independent. Therefore, when checking for mutual inde-
pendence of events Ai , one must check that (1.17) holds for every subset of the Ai .
Intuitively, if P (A B) is large relative to P (A)P (B), then events A and B tend to occur
together; another way to see this is that P (AB) > P (A)P (B) implies that P (A|B) > P (A)
and P (B | A) > P (B). Hence, observing one of A or B increases the likelihood of observing
the other. If P (A B) is small relative to P (A)P (B), then observing one of A or B will
decrease the likelihood of observing the other; an extreme example is when A and B are
mutually exclusive (i.e. disjoint).
Chapter 2
Random variables are the tool of choice for modeling probability spaces for which the out-
comes are associated with a numerical value of interest. For instance, stock market prices
and the daily temperature are two examples of random numbers in our lives. Each random
variable has an associated distribution which describes its behavior; we will introduce com-
mon families of discrete distributions. We will see that the expectation of a random variable
is a useful summary statistic of its distribution which satisfies a crucial property: linearity.
X(HH) = 2
X(HT ) = 1
X(T H) = 1
X(T T ) = 0
16
CHAPTER 2. DISCRETE RANDOM VARIABLES 17
(Here, X represents the number of heads that we see in the two coin tosses.) What is the
probability that X takes on a particular value? We can write:
P (X = 0) = P ({T T }) = 1/4
P (X = 1) = P ({HT, T H}) = 1/2
P (X = 2) = P ({HH}) = 1/4
We can reasonably say that X assigns the probability 1/4 to the real number 0, the proba-
bility 1/2 to the real number 1, and the probability 1/4 to the real number 2. In this way, we
see that X induces a probability measure on the real line, which we call the distribution
of X. The distribution of X satisfies the probability axioms; for example, (1.2) applied to
the distribution of X is the equation:
X
P (X = x) = 1 (2.1)
x
In the example above, we were explicit about the probability space in order to show that
X is a function from to R. However, the utility of random variables is that we can often
forget about the underlying probability space and focus our attention on the distribution
of X. For a discrete random variable X, we can specify the distribution of X simply by
giving the probabilities P (X = x) for all x in the range of X, without reference to the
original probability space . Indeed, that is what we will proceed to do from here onwards.
There are many common distributions (described in detail below), which have special names.
We use the symbol to denote that a random variable has a known distribution, e.g.
X Bin(n, p) indicates that the distribution of X is the Bin(n, p) distribution.
P (X = x) = P (X x) P (X x 1) (2.2)
Similarly:
X
P (Y = y) = P (X = x, Y = y) (2.5)
x
The joint distribution contains all of the information about X and Y . From the joint dis-
tribution, we can recover the marginal distributions of X and Y . The converse is not true:
the marginal distributions are usually not sufficient to recover the joint distribution. The
reason for this is that the joint distribution captures information about the dependence of
X and Y . This leads us to formulate the following definition:
Definition 2.2. We say that two discrete random variables are independent if
Notice the utility of independence: if X and Y are independent, then we can write their joint
probability as a product of their marginal probabilities (sometimes, we say that the joint
distribution factors). Not only does independence allow us to write down the joint distribu-
tion using only information about the marginal distributions, it also simplifies calculations.
All of the results in this section generalize easily to multiple random variables.
2.2 Expectation
Knowing the full probability distribution gives us a lot of information, but sometimes it is
helpful to have a summary of the distribution.
Definition 2.3. The expectation (or expected value) of a discrete random variable
X is defined to be the real number:
X X
E[X] = X()P () = xP (X = x) (2.7)
x
Technical Remark: The expectation is not always well-defined. To avoid these issues,
throughout the course we will assume E[|X|] < .
as N (in a certain sense that we will formalize later). Therefore, E[X] is the long-run
average of an experiment in which you measure the value of X.
Often, the expectation values are easier to work with than the full probability distributions
because they satisfy nice properties. In particular, they satisfy linearity:
2. E[aX + c] = aE[X] + c
1.
X X X
E[X + Y ] = (X()P () + Y ()P ()) = X()P () + Y ()P ()
= E[X] + E[Y ]
2.
X X X
E[aX + c] = (aX() + c()) P () = a X()P () + c P ()
= aE[X] + c
Important: Notice that we did not assume that X and Y are independent. We will use
these properties repeatedly to solve complicated problems.
The definition can be extended easily to functions of multiple random variables using the
joint distribution:
X
E[f (X1 , . . . , Xn )] = f (x1 , . . . , xn )P (X1 = x1 , . . . , Xn = xn ) (2.9)
x1 ,...,xn
Next, we prove an important fact about the expectation of independent random variables.
CHAPTER 2. DISCRETE RANDOM VARIABLES 20
Proof.
X X
E[XY ] = xyP (X = x, Y = y) = xyP (X = x)P (Y = y)
x,y x,y
! !
X X
= xP (X = x) yP (Y = y) = E[X]E[Y ]
x y
The definition of independent random variables was used in the first line of the proof.
It is crucial to remember that the theorem does not hold true when X and Y are not
independent!
Theorem 2.6 (Tail Sum Formula). Let X be a random variable that only takes on values
in N. Then
X
E[X] = P (X x).
x=1
The formula is known as the tail sum formula because we compute the expectation by
summing over the tail probabilities of the distribution.
where to evaluate the sum, we have used the triangular number identity:
n
X n(n + 1)
k= (2.11)
k=1
2
Example 2.7. Suppose you have a box with n chocolates, one of which is a special
dark chocolate. You pick out chocolates randomly, one at a time, without putting them
back into the box, until you find the special dark chocolate. Let X denote the number of
chocolates that you remove from the box. What is the distribution and expectation of X?
As a consequence, if each person picks out just one chocolate from the box at random,
it does not matter whether you are the first, the last, or the kth person to pick out a
chocolate: the probability that you will end up with the special dark chocolate is always
the same, 1/n.
The distribution of X is
1 p,
x = 0,
P (X = x) = p, x = 1,
0, otherwise.
A concise way to describe the distribution is P (X = x) = (1 p)1x px for x {0, 1}. The
expectation of the Ber(p) distribution is
E[X] = 0 P (X = 0) + 1 P (X = 1) = 0 (1 p) + 1 p = p.
A quick example: the number of heads in one fair coin flip follows the Ber(1/2) distribution.
This is quite a difficult sum to calculate! (Try it yourself and see if you can make any
progress.) To make our work simpler, we will instead make a connection between the binomial
distribution and the Bernoulli distribution we defined earlier. Let Xi be the indicator random
variable for the event that trial i is a success. (In the language of the previous section,
Xi = 1Ai , where Ai is the event that trial i is a success.) The key insight lies in observing
X = X1 + + Xn .
Each indicator variable Xi is 1 or 0 depending on whether trial i is a success, so if we sum
up all of the indicator variables, then we obtain the total number of successes in all n trials.
Therefore, compute
E[X] = E[X1 + + Xn ] = E[X1 ] + + E[Xn ].
Notice that in the last line, we used linearity of expectation. Now we can see why linearity
of expectation is so powerful: combined with indicator variables, it allows us to break up
the expectation of a complicated random variable into the sum of the expectations of simple
random variables. Using our result from the previous section on indicator random variables,
E[Xi ] = P (trial i is a success) = p.
Each term in the sum is simply p, and there are n such terms, so therefore
E[X] = np.
The result should make intuitive sense: if you are conducting n trials, and the probability of
success is p, then you expect a fraction p of the trials to be successes, which is saying that
you expect np total successes. The expectation matches our intuition.
By the way, the random variables Xi are an example of i.i.d. random variables, which is a
term that comes up very frequently (so we might as well define it now): i.i.d. stands for
independent and identically distributed. Indeed, since each trial is independent of each other
by assumption, the variables Xi are independent, although we did not need this fact to com-
pute the expectation. Linearity of expectation is powerful: it holds even when the variables
are not independent! Also, the Xi variables are identically distributed, which means they all
had the same probability distribution: Xi Ber(p).
A strategy now emerges for tackling complicated expected value questions: when computing
E[X], try to see if you can break down X into the sum of indicator random variables. Then,
computing the expectation becomes much easier because you can take advantage of linearity.
CHAPTER 2. DISCRETE RANDOM VARIABLES 24
When working with the geometric distribution, it is often easier to work with the tail prob-
abilities P (X > x). In order for X > x to hold, there must be at least x failures; hence,
P (X > x) = (1 p)x .
Note that the tail probability is related to the CDF in the following way:
P (X > x) = 1 P (X x).
The clever way to find the expectation of the geometric distribution uses a method known as
the renewal method. E[X] is the expected number of trials until the first success. Suppose
we carry out the first trial, and one of two outcomes occurs. With probability p, we obtain
a success and we are done (it only took 1 trial until success). With probability 1 p, we
obtain a failure, and we are right back where we started. In the latter case, how many trials
do we expect until our first success? The answer is 1 + E[X]: we have already used one trial,
and we expect E[X] more trials since nothing has changed from our original situation (the
geometric distribution is memoryless). Hence,
E[X] = p 1 + (1 p) (1 + E[X]).
Here is a more computational way to obtain the formula. We want to evaluate the sum
X
X
x1
E[X] = x(1 p) p=p (x + 1)(1 p)x .
x=1 x=0
CHAPTER 2. DISCRETE RANDOM VARIABLES 25
We can show that the minimum of independent geometric random variables is geometric:
Proof. We can use the tail probabilities to simplify the derivation. If the minimum of X
and Y is greater than z, then both X and Y are greater than z. Hence,
We recognize the last expression as the tail probability of a geometric random variable
with parameter p + q pq.
Another way of thinking about the above result is that on each trial, we have a success from
X with probability p and a success from Y with probability q. By the inclusion-exclusion
rule, the probability that we have a success from either X or Y is p + q pq, and the trials
are independent from each other. Hence, we meet the description for a geometric distribution.
Example 2.9 (Coupon Collectors Problem). There are n different coupons that you
would like to collect. Every time you buy an item from the store, you receive a random
coupon (each of the n coupons is equally likely to appear). What is the expected number
of items you must buy before you collect every coupon?
Let Ti be the number of items it requires to collect the ith new coupon. In other words,
CHAPTER 2. DISCRETE RANDOM VARIABLES 26
starting from when you have seen i 1 distinct coupons, Ti represents the additional
number of items you must purchase before youP see a coupon you have not seen before.
T , the total time to collect all coupons, is T = ni=1 Ti .
Once we have collected i 1 coupons, there are n i + 1 coupons we have not seen yet,
so the probability that the next item we buy comes with a coupon we have not seen is
(n i + 1)/n. If we regard each object bought as an independent trial, then we see that
Ti Geom(p), where p = (n i + 1)/n. By linearity of expectation,
n n n
X X n X1
E[T ] = E[Ti ] = =n = nHn ,
i=1 i=1
ni+1 i=1
i
has the numerical value 0.577. Using this approximation, E[T ] n(ln n + ).
Proof.
Intuitively, the theorem says: suppose you have already tried flipping a coin s times, without
success. The probability that it takes you at least t more coin flips until your first success is
the same as the probability that your friend picks up a coin and it takes him/her at least t
coin flips. Moral of the story: the geometric distribution does not care how many times you
have already flipped the coin, because it is memoryless.
Consider the probability that we require x trials: we have k successes and x k failures,
and the probability of any sequence of k successes and x k failures is pk (1 p)xk . Now, a
counting argument gives the number of such sequences: the last success must occur on the
xth trial, and there are x1
k1
ways to distribute the k 1 remaining successes among the
other trials. Hence, the negative binomial distribution is
x1 k
P (X = x) = p (1 p)xk , x = k, k + 1, . . .
k1
(Random tidbit: In many cases, the power series is taken as the definition of ex . The power
series converges everywhere.)
The expectation of the Poisson distribution is, as we would expect, . But lets prove it:
X e x X e x
X x1
E[X] = x = x =e = e e =
x=0
x! x=1
x! x=1
(x 1)!
X + Y Pois( + ).
Proof. We will compute the distribution of X + Y and show that it is Poisson (using
independence).
z z
X X e j e zj
P (X + Y = z) = P (X = j, Y = z j) =
j=0 j=0
j! (z j)!
z z
e(+) X z! e(+) X z j zj
= j zj =
z! j=0 j! (z j)! z! j=0 j
e(+) ( + )z
= = P (Pois( + ) = z)
z!
In the last line, we have used the Binomial Theorem.
Theorem 2.12 (Poisson Splitting). Suppose that X Pois() and that conditioned on
X = x, Y follows the Bin(x, p) distribution. Then Y Pois(p).
Proof. We use the definition of conditional probability and show that Y has the correct
distribution. Notice that the sum starts from x = y because X Y .
X
X
P (Y = y) = P (X = x, Y = y) = P (X = x)P (Y = y | X = x)
x=y x=y
e x x y x
X
xy
X x!
= p (1 p) =e py (1 p)xy
x=y
x! y x=y
x! y! (x y)!
e (p)y X ((1 p))xy e (p)y (1p)
= = e
y! x=y
(x y)! y!
ep (p)y
= = P (Pois(p) = y)
y!
As an example: suppose that the number of calls that a calling center receives per hour
is distributed according to a Poisson distribution with mean . Furthermore, suppose that
each call that the calling center receives is independently a telemarketer with probability p
(therefore, the distribution of telemarketing calls is binomial, conditioned on the number of
calls received). Then, the number of telemarketing calls that the calling receives per hour
(unconditional) follows a Poisson distribution with mean p. This property of the Poisson
distribution is also rather intuitive because it says that if a Poisson random variable is thinned
out such that only a fraction p remains, then the resulting distribution remains Poisson.
However, take time to appreciate that the Poisson splitting property is not immediately
obvious without the mathematical proof.
Chapter 3
Previously, we have discussed the expectation of a random variable, which is a measure of the
center of the probability distribution. Today, we discuss the variance, which is a measure of
the spread of the distribution. Variance, in a sense, is a measure of how unpredictable your
results are. We will also cover some important inequalities for bounding tail probabilities.
3.1 Variance
Suppose your friend offers you a choice: you can either accept $1 immediately, or you can
enter in a raffle in which you have a 1/100 chance of winning a $100 payoff. The expected
value of each of these deals is simply $1, but clearly the offers are very different in nature! We
need another measure of the probability distribution that will capture the idea of variability
or risk in a distribution. We are now interested now in how often a random variable will
take on values close to the mean.
Perhaps we could study the quantity X E[X] (the difference between what we expected
and what we actually measured), but we quickly notice a problem:
The expectation of this quantity is always 0, no matter the distribution! Every random
variable (except the constant random variable) can take on values above or below the mean
by definition, so studying the average of the differences is not interesting. To address this
problem, we could study |X E[X]| (thereby making all differences positive), but in prac-
tice, it becomes much harder to analytically solve problems using this quantity. Instead, we
will study a quantity known as the variance of the probability distribution:
30
CHAPTER 3. VARIANCE & INEQUALITIES 31
useful because it has the same units as X, allowing for easier comparison. As an example, if
X represents the height of an individual, then X would have units of meters, while var(X)
has units of meters2 .
Proof. We use linearity of expectation (note that E[E[X]] = E[X] since E[X] is just a
constant):
= E[X 2 ] E[X]2
This formula will be extremely useful to us throughout the course, so please memorize it!
Observe that adding a constant does not change the variance. Intuitively, adding a constant
shifts the distribution to the left or right, but does not affect its shape (and therefore its
spread). On the other hand, scaling by a constant does scale the variance.
CHAPTER 3. VARIANCE & INEQUALITIES 32
Proof. The corollary follows immediately from taking the square root of the result in
Theorem 3.3.
We saw that linearity of expectation was an extremely powerful tool for computing expec-
tation values. We would like to have a similar property hold for variance, but additivity of
variance does not hold in general. However, we have the following useful theorem:
Theorem 3.5 (Variance of Sums of Random Variables). Let X and Y be random vari-
ables. Then:
We will reveal the importance of the term E[XY ] E[X]E[Y ] later in the course. For now,
we are more interested in the corollary:
(n + 1)(2n + 1) (n + 1)2 n2 1
var(X) = E[X 2 ] E[X]2 = =
6 4 12
Observe that p(1 p) is maximized when p = 1/2, and it attains a maximum vaue of 1/4
(to convince yourself, draw the parabola). We will make use of this observation later.
CHAPTER 3. VARIANCE & INEQUALITIES 34
X = X 1 + + Xn
Hence,
where we have used the independence of the indicator random variables to apply Corol-
lary 3.6. Since each indicator random variable follows the Ber(p) distribution,
Let X be written as the sum of identically distributed indicators, which are not assumed to
be independent:
X = 1A1 + + 1An
We first note that the expectation is easy, thanks to linearity of expectation (which holds
regardless of whether the indicator random variables are independent or not):
n
X n
X
E[X] = E[1Ai ] = P (Ai ) (3.12)
i=1 i=1
Using the fact that the indicator variables are identically distributed:
1. There are like-terms, such as 12A1 and 12A3 . However, we know from the properties of
indicators that 12Ai = 1Ai . There are n of these terms in total:
n
X n
X
12Ai = 1Ai = 1A1 + + 1An = X (3.14)
i=1 i=1
CHAPTER 3. VARIANCE & INEQUALITIES 35
2. Then, there are cross-terms, such as 1A2 1A4 and 1A1 1A2 . There are n2 total terms in the
square, and n of those terms are like-terms, which leaves n2 n = n(n 1) cross-terms.
We usually write the sum:
X
1Ai 1Aj = 1A1 1A2 + + 1An1 1An
i6=j
We can discover more about the cross-terms by examining their meaning. Consider the term
1Ai 1Aj : it is the product of two indicators. Each indicator 1Ai is either 0 or 1; therefore, their
product is also 0 or 1, which suggests that the product is also an indicator! The product is
1 if and only if each indicator is 1, which in the language of probability is expressed as
We have arrived at a crucial fact: the product of two indicators 1Ai and 1Aj is itself an
indicator for the event Ai Aj . Therefore, we can rewrite the sum:
X X
1Ai 1Aj = 1Ai Aj (3.16)
i6=j i6=j
(For simplicity, we made the assumption that all of the intersection probabilities P (Ai Aj )
are the same.) Finally, the variance is E[X 2 ] E[X]2 , or:
Although the resulting formula looks rather complicated, it is a remarkably powerful demon-
stration of the techniques we have developed so far. The path we have taken is an amusing
one: when the indicators are not independent, additivity of variance fails to hold, so the tool
we ended up relying on was... linearity of expectation and indicators!
We must compute
X
2
E[X ] = p x2 (1 p)x1 .
x=1
Setting x = 1 p yields
X 2p
k 2 (1 p)k1 = .
k=1
p3
Finally, we obtain
2p 2p
E[X 2 ] = p 3
= .
p p2
The variance is computed to be
2p 1 1p
var(X) = 2
2 = .
p p p2
var(X) =
3.3 Inequalities
Often, probability distributions can be difficult to compute exactly, so we will cover a few
important bounds.
E[f (X)]
P (X a) (3.22)
f (a)
Proof. Let 1Xa be the indicator that X a and define h(X) = f (a)1Xa . We claim
that h(X) f (X) always:
Then, we have
Example 3.9. Here is a quick example showing that Markovs inequality is tight. Let X
take on the value a with probability p, and 0 otherwise. Then E[X] = ap and Markovs
inequality gives P (X a) p, which is tight.
Theorem 3.10 (Chebyshevs Inequality). Let X be a random variable and a > 0. Then:
var(X)
P (|X E[X]| a) (3.24)
a2
1
P (|X E[X]| k) (3.25)
k2
Notably, Chebyshevs Inequality justifies why we call the variance a measure of the spread
of the distribution. The probability that X lies more than k standard deviations away from
the mean is bounded by 1/k 2 , which is to say that a larger standard deviation means X is
more likely to be found away from its mean, while a low standard deviation means X will
remain fairly close to its mean.
Proof. Consider
Observe that the above expression is a quadratic equation in the variable x, which is
always non-negative. Visualize the parabola: the parabola is always non-negative, which
means it has either 0 or 1 root. This is equivalent to the condition b2 4ac, or
measurements Xi will be close to the true parameter .) We can use linearity of expectation
to quickly check that this holds:
= 1 1
E[X] (E[X1 ] + + E[Xn ]) = n =
n n
an unbiased estimator of .
We therefore call X
The next question to ask is: on average, we expect X to estimate . But for a given
experiment, how close to do we expect X to be? How long will it take for X to converge
to its mean, ? These are questions that involve the variance of the distribution. First, let
us compute
= 1 1 2 2
var(X) (var(X 1 ) + + var(X n )) = n = . (3.28)
n2 n2 n
(Notice that the answer is not 2 !) In calculating our answer, we have used the scaling
property of the variance and the assumption of independence. The dependence of X , the
is
standard deviation of X,
1
X , (3.29)
n
a dependence that is well-worth remembering. In particular, this result states that the more
samples we collect, the smaller our standard deviation becomes! This result is what allows
the scientific method to work: without it, gathering more samples would not make us any
more certain of our results. We next state and prove a famous result, which shows that X
converges to its expected value after enough samples are drawn.
Theorem 3.13 (Weak Law of Large Numbers). For all > 0, in the limit as n ,
) 0.
P ( X (3.30)
a, X
P ( (X + a)) 0.95 (3.31)
2 1
1 = 0.95 = a=
na2 0.05n
4.472/n, X
Therefore, our confidence interval is (X + 4.472/n).
Example 3.15. As a specialization of the previous example, we will consider the prob-
lem of estimating the bias of a coin. A coin is biased with P (H) = p, and our goal is to
find a 95% confidence interval for p. How can we construct a confidence interval here?
Flip the coin n times and let Xi be the indicator that the ith flip came up heads. Observe
that in this case, = E[Xi ] = p and 2 = var(Xi ) = p(1 p), since we are in the setting
of Xi Ber(p). Now, we can apply our bound from the previous example:
1 p 1 2.236
a= p(1 p)
0.05n 2 0.05n n
used p(1
where we have p) 1/4. Hence, our 95% confidence interval for the bias p is
(p 2.236/ n, p + 2.236/ n).
E[eX ]
P (X x) (3.32)
ex
CHAPTER 3. VARIANCE & INEQUALITIES 42
What value of should we choose? The obvious answer is: the best possible one! In other
words, we can optimize over the values of in search of the best possible bound. This is
known as a Chernoff bound and it can be used to prove that the tail probabilities of certain
distributions decay exponentially fast. Lets take a look at an example:
Example 3.16. We will bound the tail probability of a Poisson distribution. Consider
X Pois(). First, we compute
x
x e
X e X
E[eX ] = ex = e = e ee = e(e 1) .
x=0
x! x=0
x!
P (X x) xx e(1x) ,
A fundamental question in statistics is: given a set of data points {(Xi , Yi )}, how can we
estimate the value of Y as a function of X? First, we will discuss the covariance and corre-
lation of two random variables, and use these quantities to derive the best linear estimator
of Y , known as the LLSE. Then, we will define the conditional expectation, and proceed
to derive the best general estimator of Y , known as the MMSE. The notion of conditional
expectation will turn out to be an immensely useful concept with further applications.
4.1 Covariance
We have already discussed the case of independent random variables, but many of the vari-
ables in real life are dependent upon each other, such as the height and weight of an indi-
vidual. Now we will consider how to quantify the dependence of random variables, starting
with the definition of covariance.
Definition 4.1. The covariance of two random variables X and Y is defined as:
The covariance is the product of the deviations of the two variables from their respective
means. Suppose that whenever X is larger than its mean, Y is also larger than its mean;
then, the covariance will be positive, and we say the variables are positively correlated. On
the other hand, if whenever X is larger than its mean, Y is smaller than its mean, then
the covariance is negative and we say that the variables are negatively correlated. In other
words: positive correlation means that X and Y tend to fluctuate in the same direction, while
negative correlation means that X and Y tend to fluctuate in opposite directions.
43
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 44
Let us connect the correlation of events with the correlation of random variables. Let 1A
and 1B be the indicators for events A and B respectively. Note that E[1A 1B ] = P (A B),
E[1A ] = P (A), and E[1B ] = P (B).
1. If A and B are positively correlated, then E[1A 1B ] > E[1A ]E[1B ], so cov(1A , 1B ) > 0.
3. If A and B are negatively correlated, then E[1A 1B ] < E[1A ]E[1B ], so cov(1A , 1B ) < 0.
The three cases above demonstrate that correlation of events corresponds to correlation of
the corresponding indicator random variables.
Just as we had a computational formula for variance, we have a computational formula for
covariance.
Theorem 4.2 (Computational Formula for Covariance). Let X and Y be random vari-
ables. Then:
cov(X, Y ) = E[XY ] E[X]E[Y ] (4.2)
Example 4.4. Pick a point uniformly randomly from {(1, 0), (0, 1), (1, 0), (0, 1)}.
Let X be the x-coordinate and Y be the y-coordinate of the point that we choose.
Observe that XY = 0 always, so E[XY ] = 0. By symmetry about the origin, we have
E[X] = E[Y ] = 0. Hence, cov(X, Y ) = E[XY ] E[X]E[Y ] = 0. However, X and Y are
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 45
Next, we will show how the covariance and variance are related.
Corollary 4.5 is not very useful for calculating var(X), but it shows that variance can be
seen as a special case of covariance.
Corollary 4.6 (Variance of Sums of Random Variables). Let X and Y be random vari-
ables. Then:
var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y ) (4.4)
Proof. The majority of the work for this proof was completed in the previous set of
notes, in which we found
The utility of the last result is that it holds true for any random variables, even ones that
are not independent.
Proof. The proofs are straightforward using linearity of expectation. To prove (4.5),
X E[X]
X := (4.8)
X
is called the standard form of X. In statistics books, X is also called the z-score of
X.
E[X ] = 0 (4.9)
E[X] E[X]
E[X ] = =0
X
Next, since E[X ] = 0, then var(X ) = E[(X )2 ] E[X ]2 = E[(X )2 ]. Using the
properties of variance,
X E[X] var(X) var(X)
var(X ) = var = 2
= = 1.
X X var(X)
Standardizing the random variable X is equivalent to shifting the distribution so that its
mean is 0, and scaling the distribution so that its standard deviation is 1. The random
variable X is rather convenient because it is dimensionless: for example, if X and Y repre-
sent measurements of the temperature of a system in degrees Fahrenheit and degrees Celsius
respectively, then X = Y .
4.1.3 Correlation
We next take a slight detour in order to define the correlation of two random variables, which
appears frequently in statistics. Although we will not use correlation extensively, the expo-
sition to correlation presented here should allow you to interpret the meaning of correlation
in journals.
Definition 4.10 (Correlation). The correlation of two random variables X and Y is:
cov(X, Y )
Corr(X, Y ) := (4.11)
X Y
Theorem 4.11 (Covariance & Correlation). The correlation of two random variables
X and Y is:
Corr(X, Y ) = cov(X , Y ) = E[X Y ] (4.12)
(Remember that the covariance of a constant with a random variable is 0.) The sec-
ond equality follows easily because cov(X , Y ) = E[X Y ] E[X ]E[Y ] and we have
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 48
E[X ] = E[Y ] = 0.
The result states that we can view the correlation as a standardized version of the covariance.
As a result, we can also prove a result about the possible values of the correlation:
Proof. From the above proof, Corr(X, Y ) = 1 if and only if 0 = E[(X Y )2 ], which
can only happen if X Y = 0. This implies that Y = X , which is true if and only
if Y = aX + b for constants a, b R. (What are the constants a and b?)
Now, we can see that correlation is a useful measure of the degree of linear dependence
between two variables X and Y . If Corr(X, Y ) is 1 or 1, then X and Y are perfectly linearly
correlated, i.e. a plot of Y versus X would be a straight line. The closer the correlation is to
1, the closer the data resembles a straight-line relationship. If X and Y are independent,
then the correlation is 0 (but the converse is not true). As a final remark, the square of
the correlation coefficient is called the coefficient of determination (usually denoted R2 ).
The coefficient of determination appears frequently next to best-fit lines on scatter plots as
a measure of how well the best-fit line fits the data.
4.2 LLSE
We will immediately apply our development of the covariance to the problem of finding the
best linear predictor of Y given X. We begin by presenting the main result, and then proceed
to prove that the result satisfies the properties we desire. 1
1
The material for the sections on regression rely heavily on Professor Walrands notes, although I have
inserted my own interpretation of the material wherever appropriate.
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 49
Definition 4.14 (Least Linear Squares Estimate). Let X and Y be random variables.
The least linear squares estimate (LLSE) of Y given X is defined as:
cov(X, Y )
L(Y | X) := E[Y ] + (X E[X]) (4.14)
var(X)
Proof. The proofs are actually relatively straightforward using linearity. Proof of (4.15):
cov(X, Y )
E[Y L(Y | X)] = E Y E[Y ] (X E[X])
var(X)
cov(X, Y )
= E[Y ] E[Y ] (E[X] E[X]) = 0
var(X)
Proof of (4.16):
cov(X, Y )
E[(Y L(Y | X))X] = E X Y E[Y ] (X E[X])
var(X)
cov(X, Y )
= E[XY ] E[X]E[Y ] (E[X 2 ] E[X]2 )
var(X)
cov(X, Y )
= cov(X, Y ) var(X) = 0
var(X)
If you have not studied linear algebra, then the rest of the section can be safely skipped.
Linear algebra is not necessary to understand the properties of the LLSE, although linear
algebra certaintly enriches the theory of linear regression. We discuss linear algebra concepts
solely to motivate the Projection Property.
Given a probability space , then the space of random variables over is a vector space (that
is, random variables satisfy the vector space axioms). Indeed, we have already introduced
how to add and scalar-multiply random variables. Specifically, since random variables are
functions, the vector space of random variables is the vector space of functions X : R,
which is also called the free space of R over the set (denoted Rhi).
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 50
In other words, when we estimate Y , L(Y | X) has the lowest mean squared error out of
any other linear function of X.
Proof. According to the Projection Property, we can combine (4.15) and (4.16) to obtain
In the second line of the proof, the term E[(Y L(Y | X))(L(Y | X) aX b)] vanishes
by the Projection Property because L(Y | X) aX b L(X). The quantity above
represents the mean squared error of aX + b as a predictor of Y , so we seek to minimize
this quantity. Since the expectation of a non-negative random variable is always non-
negative, the mean squared error is minimized when aX + b = L(Y | X).
The least-squares line is used everywhere as a visual summary of a trend. Now you have
seen the theory behind this powerful tool!
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 51
In other words, E[X | Y = y] is the expectation of X with respect to the probability distribu-
tion of X conditioned on Y = y. It is important to stress that E[X | Y = y] is just a real
number, just like any other expectation.
Notice that for every possible value of Y , we can assign a real number E[X | Y = y]. In
other words, this is a function of y:
f :RR given by f (y) = E[X | Y = y]
Hence, let us define E[X | Y ] to be a function of Y in the following manner:
This point cannot be stressed enough: E[X | Y ] is a random variable! Although the con-
ditional expectation may seem mysterious at first, there is an easy rule for writing down
E[X | Y ]. Let us consider an example for concreteness.
2
The material presented in this section appears slightly differently from the presentation in Professor
Walrands lecture notes, but the lecture notes are still the source for these discussion notes.
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 52
Example 4.18. Suppose that we roll a die N times, where N is some non-negative
integer-valued random variable. Let X be the sum of the dice rolls. Conditioned on
N = 1 (that is, we roll one die), then the expected value is 7/2, i.e. E[X | N = 1] = 7/2.
Similarly, conditioned on N = 2, we roll two dice, so E[X | N = 2] = 7 (the expected
sum of two dice is 7). In general, conditioned on N = n, we roll n dice and we have
E[X | N = n] = 7n/2.
If N = n, then the random variable E[X | N ] has the value E[X | N = n] = 7n/2; hence,
we can write E[X | N ] = 7N/2 (which is a function of N , following the discussion above).
At first, it may appear that in going from the expression for E[X | N = n] to E[X | N ],
we merely replaced the n with N . Of course, there is more going on than just a simple
substitution (E[X | Y ] is a random variable and E[X | Y = y] is just a real number),
but substituting y 7 Y is exactly the procedure for writing down E[X | Y ]. Dont worry if
conditional expectation is difficult to grasp at first. Mastery of this concept requires practice,
and we will soon see how to apply conditional expectation to the problem of prediction.
Proof. As noted above, we compute E[E[X | Y ]] with respect to the probability distri-
bution of Y .
!
X X X
E[E[X | Y ]] = E[X | Y = y]P (Y = y) = xP (X = x | Y = y) P (Y = y)
y y x
!
XX X X
= xP (X = x, Y = y) = x P (X = x, Y = y)
y x x y
X
= xP (X = x) = E[X]
x
What a marvelous proof! Once you understand this proof, you will have understood most
of the concepts we have covered so far. In the P first line, we use the definition of the ex-
pectation of a function of Y , i.e. E[f (Y )] = y f (y)P (Y = y); then, we use the definition
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 53
of E[X | Y = y], which was given in the previous section. In the second line, we use our
knowledge of conditional probability; then, we recall that summing over all possible values
of Y in the joint distribution P (X = x, Y = y) yields the marginal distribution P (X = x).
Finally, in the last line, we come back to the definition of E[X].
Example 4.20. Let us return to Example 4.18, where we found that E[X | N ] = 7N/2.
For concreteness, suppose that N Geom(p). We have
7 7
E[X] = E[E[X | N ]] = E[N ] = .
2 2p
Observe that we have found the expectation of a sum of a random number of random
variables, a task that would be far more difficult without the tool of conditioning.
Example 4.21 (Random Walk). For a second example, let us consider a drunk indi-
vidual is walking along the number line. At the start, the drunkard starts at the origin
(X0 = 0); at successive time steps, the drunkard is equally likely to take a step forward
(+1) or backward (1). In other words, given that the drunkard is at position k at
time step n, at time step n + 1 the drunkard will be equally likely to be found at either
position k + 1 or k 1.
Denoting the drunkards position at time step n as Xn , we first compute E[Xn ]. We can
write Xn+1 = Xn + 1+ 1 , where 1+ is the indicator that the drunkard takes a step
forward and 1 is the indicator that the drunkard takes a step backward. Taking the
expectation of both sides yields E[Xn+1 ] = E[Xn ] + 1/2 1/2 = E[Xn ] (remembering
that the probability of moving forward, which is the same as the probability of moving
backward, is 1/2). We have found that the expected position of the drunkard is always
the same for all n, so the expected position of the drunkard at any later time is the same
as the starting position: E[Xn ] = E[X0 ] = 0.
2 1
P (Xn+1 = (k + 1)2 | Xn = k) = P (Xn+1
2
= (k 1)2 | Xn = k) = .
2
Taking the expectation yields
2 1 1
E[Xn+1 | Xn = k] = (k + 1)2 + (k 1)2 = k 2 + 1.
2 2
Substituting Xn for k, we have
2
E[Xn+1 | Xn ] = Xn2 + 1.
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 54
Example 4.22 (Galton-Watson Branching Process I). Let us consider a simple model
of population growth. Suppose we start with a population of X0 = N people, and let Xn
be the number of people in the population on the nth time step. Furthermore, assume:
2. At each time step, independently of the rest of the population, each individual is
expected to leave behind offspring.
The challenge of this problem is that the number of people in the population at any given
time step is a random variable, so we have a random number of individuals who repro-
duce randomly. However, we have already seen in Example 4.18 a systematic method for
dealing with this kind of randomness! First, we condition on the number of individuals
in the population at time n 1, and we compute the population size at time n.
Conditioned on Xn1 = m, we have m people, who are each expected to leave behind
offspring. Therefore, E[Xn | Xn1 = m] = m and we have E[Xn | Xn1 ] = Xn1 .
Then, we can use the law of iterated expectation:
We have obtained a recursive solution to the problem: at each time step, we expect the
population size to be multiplied by . Therefore,
E[Xn ] = n E[X0 ] = n N.
We can see that if < 1, then the expected population size tends to 0, whereas if > 1,
the expected population size grows exponentially fast.
4.5 MMSE
4.5.1 Orthogonality Property
Before we apply the conditional expectation to the problem of prediction, we first note a
few important properties of conditional expectation. The first is that the conditional ex-
pectation is linear, which is a crucial property that carries over from ordinary expectations.
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 55
For example, E[X + Y | Z] = E[X | Z] + E[Y | Z]. The justification for this, briefly, is that
E[X + Y | Z = z] = E[X | Z = z] + E[Y | Z = z].
The second important property is: let f (X) be any function of X and g(Y ) any function
of Y . Then the conditional expectation E[f (X)g(Y ) | Y ] = g(Y )E[f (X) | Y ]. The intuitive
idea behind this property is that when we condition on Y , then any function of Y is treated
as a constant and can be moved outside of the expectation by linearity. In other words,
E[f (X)g(Y ) | Y = y] = E[f (X)g(y) | Y = y] = g(y)E[f (X) | Y = y]. Then, to obtain
E[f (X)g(Y ) | Y ], we perform our usual procedure of substituting y 7 Y .
Let us now prove the analogue of the Projection Property for E[Y | X].
Theorem 4.23 (Orthogonality Property). Let X and Y be random variables, and let
(X) be any function of X. Then we have:
Proof. We first calculate E[(Y E[Y | X])(X) | X] using the properties of conditional
expectation.
In our proof, we used the useful trick of conditioning on a variable first in order to use the law
of total expectation. Compare with the Projection Property: the Orthogonality Property is
stronger in that (X) is allowed to be any function of X, whereas the Projection Property
was proven for linear functions of X.
Compared to the task of finding the best linear estimator of Y given X, finding the best
general estimator of Y given X seems to be an even more difficult task. However, the MMSE
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 56
will simply turn out to be our new friend, the conditional expectation. In fact, the proof is
virtually the same as the proof for the LLSE.
Theorem 4.25 (MMSE). Let X and Y be random variables. Then the MMSE of Y
given X is E[Y | X], i.e. for any function g(x),
Proof. Let Y = E[Y | X] for simplicity of notation. We have, using the Orthogonality
Property,
The term E[(Y E[X | Y ])(E[X | Y ] g(X))] vanishes by the Orthogonality Property
since E(X | Y ) g(X) is just another function of X.
We have come a long way, and the answer is surprisingly intuitive. We have just found the
best estimator of Y given X (in the mean squared error sense) is simply the expected value
of Y given X!
Theorem 4.27 (Law of Total Variance). Let X and Y be random variables. Then:
The formula above is the analogue of the Law of Iterated Expectation for variance.
Example 4.29 (Galton-Watson Branching Process II). Let us revisit Example 4.22 with
the additional assumption that the number of offspring has variance 2 . Using our new
tools, we can now compute var(Xn ).
As we computed above, E[Xn |Xn1 ] = Xn1 . Given that there are m individuals in the
population, the variance of the population at the next time step is m 2 by linearity of
variance (recall that we assume that the individuals offsprings are independent). Hence,
var(Xn | Xn1 ) = 2 Xn1 . Now, we may apply (4.25):
For 6= 1, the solution to the recurrence is obtained by finding a pattern after a few
iterations:
Markov Chains
Markov chains are an important class of probabilistic models that lend themselves favorably
to analysis. First, we describe how to calculate useful properties of Markov chains: expected
hitting times and absorption probabilities. Then, we discuss the limiting behavior of Markov
chains and classify their long-term behavior.
5.1 Introduction
Markov chains are an important probabilistic method of modeling processes over time. Con-
cretely, consider a sequence of random variables X1 , X2 , X3 , . . . . (Note that we are only
considering discrete-time Markov chains.) We say that Xn represents the state of a system
at time n, and we are interested in the behavior of the system. In general, a system could
depend on the entire history of the system until that point, that is, the distribution of Xn
could depend on X1 , . . . , Xn1 . It is very difficult to analyze such systems, however, so we
make a powerful simplifying assumption:
We call this assumption the Markov property. In words, it states that the distribution
of Xn only depends on the time immediately before, not on any of the previous times. A
common way to describe the Markov property is: the future depends on the past only through
the present. In many situations, the Markov property is a very reasonable assumption to
make. Even when the validity of the assumption is questionable, the Markov property may
still be necessary to make calculations tractable and analysis possible.
One can deduce an extended version of the Markov property. We rewrite the probability
P
Q(X n+1 = xn+1 , . . . , Xn+m = xn+m | X1 = x1 , . . . , Xn = xn ) using the Chain Rule to obtain
m
k=1 P (Xn+kQn= xn+k | X1 = x1 , . . . , Xn+k1 = xn+k1 ). We can forget the first n 1 states,
so we have k=1 P (Xn+k = xn+k | Xn = xn , . . . , Xn+k1 = xn+k1 ). Finally, another appli-
cation of the Chain Rule yields P (Xn+1 = xn+1 , . . . , Xn+m = xn+m | Xn = xn ), which shows
that the first n 1 time steps are not relevant once we know the nth time step.
Let us be explicit in what a Markov chain entails. A Markov chain consists of:
58
CHAPTER 5. MARKOV CHAINS 59
The probabilities of transitioning between states: for every pair of states i and j, we
must specify P (i, j), the probability of moving from state i to state j. Once we are in
state i, we know that we must transition to P some state at the next time step (even if
we transition back to state i), so we require N j=1 P (i, j) = 1.
From the three components of a Markov chain, we can define a sequence of random variables
{Xn : n N} such that
P (X0 = i) = 0 (i) (5.2)
and
P (Xn = j | Xn1 = i) = P (i, j). (5.3)
Notice that (5.3) uses the Markov property discussed above. The interpretation of the ran-
dom variables is that the system transitions from state to state at every time step, and Xn
represents the state of the system at time n.
There are N 2 transition probabilities P (i, j), and we often organize them into a matrix P
such that the (i, j) entry of P
the matrix is P (i, j). We call P the transition probabil-
ity matrix. The condition N j=1 P (i, j) = 1 means that each row of P must sum to 1.
(Remark: There is another convention in which the columns of the transition matrix must
sum to 1. We will stick to the convention that the rows sum to 1, but do not be surprised if
you encounter different notation in the literature.)
Instead of writing the cumbersome notation P (Xn = i) for the distribution of Xn , we will
use the notation n to mean the distribution of Xn : n (i) = P (Xn = i).
Suppose that there are currently i white marbles in the first box. Then, the probability
that the number of white marbles will decrease to i 1 is i/k (the probability of choosing
a white marble from the first box), multiplied by i/k (the probability of choosing a black
marble from the second box). By a similar argument, the probability that the number
CHAPTER 5. MARKOV CHAINS 60
of white marbles will increase to i + 1 is (k i)2 /k, so our transition probabilities are
2 2
i /k ,
j = i 1,
2
P (i, j) = 2i(k i)/k , j = i,
(k i)2 /k 2 , j = i + 1.
Example 5.2 (Random Walk I). A random walk is an example of a Markov chain
in which the state space K is infinite. Take K = Z and let P (i, i 1) = 1 p and
P (i, i + 1) = p for all i. If p = 1/2, we say the Markov chain is a symmetric random
walk.
1p 1p 1p 1p
p i1 p i p i+1 p
P (i, j) = P (Yn = j i + 1)
Along with P (0, j) = P (1, j), we have specified the transition probabilities.
Actually, the equation above has exactly the same form as matrix multiplication. If we write
n as a row vector and P as a matrix, then we have found:
n = n1 P (5.4)
CHAPTER 5. MARKOV CHAINS 61
Iterating this, we can obtain the distribution at time n starting from the initial distribution.
n = 0 P n (5.5)
Passing from one time step to the next is simply multiplication by the transition matrix.
In practice, it is extremely easy for computers to carry out this matrix multiplication, so
Markov chain models are often very tractable. In theory, we can use the power of linear
algebra in order to analyze a probabilistic situation. Hopefully, you can begin to see why
Markov chains are such a useful family of models.
We can write down the first-step equations in order to compute (i). They are:
N
X
(i) = 1 + P (i, j)(j) (5.6)
j=1
The first-step equations say: in order to reach s0 , we must first take a step (that is the +1
in the equations above). After we take a step, we transition to some other step j, and once
we are in state j, we expect to take (j) more steps until we reach s0 . Of course, we must
weight each possiblity by the probability of transitioning to the state j, and we sum over all
states to account for all possibilities.
When we carry out this procedure, we end up with N equations, one for each state. Each
equation is linear in the (i), so we can solve the system just as we solve any other linear
system (either through repeated substitution or Gaussian elimination). Once we solve the
system, we have (s), which is the expected time to reach s0 from s.
Recall that a geometric random variable represents the number of flips of a biased coin
until we see heads. Therefore, we will define a two-state Markov chain with states
K = {1, 2}, where state 1 represents the state of we are still flipping the coin and state
2 represents the terminal state (we are done flipping the coin). Then, we can see that
the expected time to reach state 2 from state 1 is precisely the expectation of X. We
CHAPTER 5. MARKOV CHAINS 62
since with probability 1p, we see tails and we have to keep flipping, and with probability
p, we see heads and we move to the terminal state. Now, we can solve the equation for
(1) and we obtain p(1) = 1, or (1) = 1/p.
We can generalize the above method. Suppose that each state i gives you a reward R(i) for
visiting that state, and we want to find the expected reward that we accumulate before we
reach any of the nodes in a set S 0 . Observe that the first-step equations above correspond
to the special case where each state gives a reward of 1 and the goal is the set S 0 = {s0 }. In
general, define (i) to be the expected sum of rewards that we accumulate until we reach a
state in S 0 , starting from state i. Then (i) = R(i) for any i S 0 , and for the other states:
N
X
(i) = R(i) + P (i, j)(j) (5.7)
j=1
Here, we are saying that the accumulated rewards consist of the reward we obtain now, plus
the reward that we expect at state j (weighted by the probability of transitioning to j).
Caution: Do not memorize the equations in this section! Instead, carefully examine their
meaning and try to understand why they solve the problem we are trying to solve. Make
these equations part of your intuition.
These equations are similar to the first-step equations in that they look ahead one time
step. For each possible state j, weighted by the probability of transitioning to j, we sum
up the probability of reaching S before S 0 starting at j. Solving these equations amounts to
solving another system of linear equations.
Example 5.5 (Gamblers Ruin). Consider a model of a gambler who starts with a dol-
lars and wishes to leave the casino with b dollars. The set of states is K = {0, . . . , b}
representing the amount of money that the gambler possesses. The gambler wins a bet
with probability p, and all bets are for one dollar. What is the probability of {b} before
CHAPTER 5. MARKOV CHAINS 63
{0}, that is, what is the probability that the gambler will succeed before going bankrupt?
Denote by (i) the probability of reaching b before 0, starting with i dollars. We have
(0) = 0 and (b) = 1. For i
/ {0, b}, we have the equations
First, we assume that p = 1/2 (the odds are even). We search for a solution of the form
(i) = c1 i + c0 . From (0) = 0, we see that c0 = 0, and from (b) = 1, we see that
c1 = b1 . Hence, our result is (i) = i/b. To verify that (i) represents a valid solution,
we plug the result into the difference equation:
i 1 i1 i+1
= +
b 2 b b
Hence, we have found the correct solution: the probability of success increases linearly
with the amount of money with which we start.
Now, suppose that p 6= 1/2. We search for a solution of the form (i) = c1 i + c0 .
Plugging in the form of the solution, we obtain:
p2 + 1 p = 0.
1 i
(i) =
1 b
If p < 1/2, then < 1, and this spells bad news for the gambler. (You should try
plugging in various values of , i, and b above to obtain numerical results.)
CHAPTER 5. MARKOV CHAINS 64
Now, we can repeat the entire argument above, switching p and q (which amounts to
replacing with 1 ) and i with b i. This corresponds to computing the probability of
reaching 0 before b starting from state i, which we denote as (i). We have the results
bi
, p = 1/2,
(i) = 1 b (bi)
, p 6= 1/2.
1 b
It can be seen that (i) + (i) = 1, which means that we are guaranteed to hit b before
0 or 0 before b. It is impossible for the gambler to be stuck at the casino forever.
Example 5.6 (Gamblers Ruin II). In Example 5.5, we found that (i), the probability
of reaching b before 0, starting from state i, is
i
,
p = 1/2,
(i) = b1 i
, p 6= 1/2,
1 b
as i , since i 0.
CHAPTER 5. MARKOV CHAINS 65
Example 5.7. There is a simpler way of computing the probability of one event before
another which is conceptually interesting. As a concrete example, suppose that we have
two machines which fail with probabilities p1 and p2 respectively. When either one of
the machine fails, the entire factory shuts down. What is the probability that the first
machine has failed when the factory shuts down?
The question is really asking: given that the factory shuts down at some time step, what
is the probability that the first machine has failed? We can apply the law of conditional
probability here. Let Mi denote the event that machine i fails on the specified time step.
P (M1 ) p1
P (M1 | M1 M2 ) = =
P (M1 M2 ) p1 + p2 p1 p2
(Apply the inclusion-exclusion rule.) See if you can obtain the same answer using a
Markov chain model.
The other possibility is a recurrent state, which is the opposite of the situation described
above. Suppose that we start from state i, and regardless of the next state, there is always
a non-zero probability that we will return to state i. In this case, since a Markov chain runs
forever, we must necessarily return to state i infinitely many times, which means we visit
state i infinitely often during the course of our Markov chain.
We now formalize our observations. Let fn (i, j) denote the probability that a system starting
P
at state i first visits the state j at time n. Let f (i, j) = n=1 fn (i, j) denote the probability
that a system starting at state i eventually visits state j.
CHAPTER 5. MARKOV CHAINS 66
Here, i.o. means infinitely often. The following theorem requires knowledge of the
Borel-Cantelli Lemma, Lemma 9.2.
Theorem 5.9 (Classification of States). For a state i in a Markov chain, there are two
possibilities:
P n
n=1 P (i, i) < , P (Xn = i i.o. | X0 = i) = 0, and i is transient.
P n
n=1 P (i, i) = , P (Xn = i i.o. | X0 = i) = 1, and i is recurrent.
Proof. The probability that, starting from state i, we return to state i at least k times
is f (i, i)k . Letting k , we see that
(
0, f (i, i) < 1,
P (Xn = i i.o. | X0 = i) = (5.9)
1, f (i, i) = 1.
Suppose that n
P
n=1 P (i, i) < . By the Borel-Cantelli Lemma (Lemma 9.2), we have
P (Xn = i i.o. | X0 = i) = 0.
We sum over n:
j j n1 j1 j
X X X X X
P n (i, i) = fnm (i, i)P m (i, i) = P m (i, i) fnm (i, i)
n=1 n=1 m=0 m=0 n=m+1
j1 jm j
X X X
m
= P (i, i) fn (i, i) f (i, i) + f (i, i) P m (i, i)
m=0 n=1 m=1
| {z }
f (i,i)
CHAPTER 5. MARKOV CHAINS 67
Remark: The classification of states does not require the Markov chain to be finite.
Definition 5.10. A Markov chain is irreducible if from any state we can reach any
other state.
Irreducibility has consequences for the distribution of a Markov chain. We say that a distri-
bution is an invariant distribution if:
= P (5.10)
Observe that if at any time n, n = , then the distribution does not change at the next time
step, which implies that the distribution will forever be the same at every time step from
that point onwards (by induction, of course). If we start off with the invariant distribution
(0 = ), then n = for all n.
Theorem 5.11. A finite, irreducible Markov chain has a unique invariant distribution.
Of course, this has the hidden implication that a Markov chain that is not irreducible need
not have a unique invariant distribution. Consider, for example, P = I (the identity matrix).
Then any distribution is an invariant distribution.
It turns out that the invariant distribution tells us more. Let vn (i) be the number
Pn1 of times
that the state i is visited in the first n time steps. (We can also write vn (i) = j=0 1(Xj =i) .)
Then the fraction of time spent in state i is vn (i)/n, and we have the following theorem:
Theorem 5.12. Suppose we have a finite, irreducible Markov chain with invariant dis-
tribution . Then, for all states i,
vn (i)
(i)
n
CHAPTER 5. MARKOV CHAINS 68
as n .
Thus, the invariant distribution also tells us the long-term fraction of time that we will spend
in each state.
Consider a Markov chain which, with probability 1, transitions from 1 to 2 and from 2 to
1. Clearly, the distribution will not tend toward a limit in this case, because (1) and (2)
will swap at every time step. In fact, as long as we can rule out this case, we are guaranteed
convergence to a limiting distribution.
Define the period of a state i to be the largest integer d(i) 1 (if such a number exists)
such that the number of time steps it takes to return to state i is necessarily a multiple of
d. We say that a Markov chain is aperiodic if the period if every state is 1. Intuitively,
aperiodicity is related to the tendency of the Markov chain to mix its states. To give an idea
of how aperiodicity is used, we can prove an important lemma using a little number theory.
Lemma 5.13. Let S be a set of positive integers which is closed under addition, such
that gcd(S) = 1. There exists an integer n0 such that for all integers n n0 , n S.
Proof. If gcd(S) = 1, there is a sequence of positive integers (an ) with greatest common
divisor 1. Let dn = gcd{a1 , . . . , an }. Note that dn is non-increasing with n and dn 1,
so there exists some m such that dm = 1. Therefore, we have found a finite subset
{a1 , . . . , am } S with gcd{a1 , . . . , am } = 1. We can find integers c1 , . . . , cm such that
c1 a1 + + cm am = 1.
By relabeling the an , we can assume that the coefficients c1 , . . . , ck are positive and
ck+1 , . . . , cm are negative.
c1 a1 + + ck ak (ck+1 ak+1 cm am ) = 1
| {z } | {z }
b1 b2
Lemma 5.14. For i, j K for an irreducible, aperiodic Markov chain, there exists a
positive integer n0 such that P n (i, j) > 0 for all n n0 .
Proof. Let S be the set {n > 0 : P n (j, j) > 0}. If n1 , n2 S, then n1 + n2 S since
P n1 +n2 (j, j) P n1 (j, j)P n2 (j, j) > 0, so S is closed under addition. Since the Markov
chain is aperiodic, gcd(S) = 1. Now, Lemma 5.13 gives an integer n0 such that for all
n n0 , P n (j, j) > 0.
Since the Markov chain is irreducible, there is some n1 such that P n1 (i, j) > 0. Take
n0 = n0 + n1 . For any n n0 , observe that n n1 n0 , so from the result above,
P n (i, j) P n1 (i, j)P nn1 (j, j) > 0, which completes the proof.
The lemma says that if we continue to raise the transition matrix to successively higher pow-
ers, then eventually we will reach a transition matrix with all positive entries. A transition
matrix with this property is known as a regular transition matrix; we have just proven
that the transition matrix for an irreducible, aperiodic Markov chain is regular.
Theorem 5.15. For a finite, irreducible Markov chain, the period of every state is the
same.
Furthermore:
Theorem 5.16. For a finite, irreducible, and aperiodic Markov chain, the distribution
converges to the limiting distribution.
n as n
n (i) (i) as n .
Proof. First, assume that P (i, j) > 0 for all i and j. Write = mini,jK P (i, j) > 0.
Since 1 = N
P
j=1 P (i, j) N , then we must have 0 1 N < 1. We will show that
= 1 N satisfies |P n (i, j) (j)| n .
Fix j and write mn (j) = miniK P n (i, j) and Mn (j) = maxiK P n (i, j). Let i1 and i2
be such that P n+1 (i1 , j) = mn+1 (j) and P n+1 (i2 , j) = Mn+1 (j). Divide the states into a
disjoint union K = K1 K2 , where K1 consists of the states k for which P (i1 , k) P (i2 , k).
P
The first step is to bound kK1 (P (i1 , k) P (i2 , k)). Since every entry of a row is at
least , the worst case is when P (i2 , k) = and P (i1 , k) is as large as possible. Since the
rows must sum to 1, the largest P (i1 , k) can be is 1 (N 1) (think of having in the
other N 1 entries). Therefore, the sum is bounded by
X
(P (i1 , k) P (i2 , k)) 1 (N 1) = 1 N = .
kK1
Therefore, Mn (j) mn (j) n , which shrinks to 0 as n . This implies that for each
i, P n (i, j) converges to (j), and the rate of convergence satisfies |P n (i, j) (j)| n .
CHAPTER 5. MARKOV CHAINS 71
When not every P (i, j) is positive, we can always find a finite integer m such that
P m (i, j) > 0 for all i and j (this follows from Lemma 5.14). Applying the above case to
P m , we see that we have exponential convergence in this case as well.
Example 5.18 (Random Walk on a Circle). Consider i.i.d. random variables n which
take on
PNvalues in {0, . . . , N 1},
P with distribution P (n = i) = pi . (Of course, we must
1
have i=0 pi = 1.) Let Xn = ni=0 i mod N . Then Xn describes a random walk on a
circle. It is possible to model the Xn as a Markov chain.
and aperiodic Markov chains converge, we have shown that the distribution of Xn con-
verges to as n , which is to say that if the transition probabilities are posi-
tive, a random walk on a circle will always converge to the uniform distribution over
{0, . . . , N 1}.
i. Informally, if the distribution converges to , then (i) represents the probability that
Xn will be in state i for large n. Therefore, we may expect the expected return time to be
similar to the geometric distribution, that is, we may expect E[(i)] = 1/(i). Indeed, we
have the following result (which does not require the Markov chain to be finite):
Theorem 5.19. Suppose that state i is persistent and limn P n (i, i) = p. Then we
have p = E[(i)]1 , where 1/ is interpreted as 0.
Proof. Fix n and suppose that we start at state i. Since state i is persistent, the proba-
bility that we return to state i eventually is 1. We decompose this event in the following
way: either we have returned to state i at time n, or the last time that we return to
state i is time k, and then it takes longer than n k steps before we reach state i again.
n1
X n
X
n k
1 = P (i, i) + P (i, i)P ((i) > n k) = P nk (i, i)P ((i) > k) (5.11)
k=0 k=0
In order to cover the case that p = 0, we must prove that p > 0 if and only if E[(i)] < .
1 P (nj) (i, i)P ((i) > j) + + P n (i, i)P ((i) > 0).
Letting n , since limn P n (i, i) = p, we have 1 p jk=0 P ((i) > k), which puts
P
a bound on the partial sums: jk=0 P ((i) > k) p1 . Since the RHS is a constant, we
P
let j and obtain
X 1
E[(i)] = P ((i) > k) < .
k=0
p
Continuous Probability I
Our study of discrete random variables has allowed us to model coin flips and dice rolls, but
often we would like to study random variables that take on a continuous range of values (i.e.
an uncountable number of values). At first, understanding continuous random variables will
require a conceptual leap, but most of the results from discrete probability carry over into
their continuous analogues, with sums replaced by integrals.
Therefore, let us begin with a few definitions. It is natural if you find that you cannot inter-
pret these definitions immediately, since our intuition from discrete probability will require
some updates. Over the course of working with continuous probability, you will start to
build a new intuition.
The key to resolving the issues raised above is to recall our original definition of a probabil-
ity measure: a probability measure assigns real numbers to sets (not individual outcomes).
Even in this new setting, as long as we can consistently assign probability values to sets,
then probability theory will continue to work as before. One way of assigning consistent
probabilities to sets is through a density function, which will be the focus of our studies.
The density function of a continuous random variable X (also known as the probability
73
CHAPTER 6. CONTINUOUS PROBABILITY I 74
1. fX is non-negative: x R fX (x) 0.
Compare the normalization condition here to the normalization condition in discrete prob-
ability, which says X
P (X = x) = 1.
x
We can interpret the continuous normalization condition to mean that the probability that
X R is 1. Similarly, we can define the probability that X lies in some interval [a, b] as:
Z b
P (X [a, b]) := fX (x) dx (6.2)
a
Remark: It does not matter whether I write the interval as [a, b] (including the endpoints)
or (a, b) (excluding the endpoints). The endpoints themselves do not contribute to the prob-
ability, since the probability of a single point is 0 (as discussed above).
The probability of an interval in R is interpreted as the area under the density function above
the interval. Similarly, we can calculate the probability of the union of disjoint intervals by
adding together the probabilities of each interval.
When we discuss continuous probability, it is also extremely useful to use the cumulative
distribution function of X, or the CDF, defined as:
Z x
FX (x) := P (X x) = f (x0 ) dx0 (6.3)
Remark: Once again, it makes no difference whether I write P (X x) or P (X < x), since
we have that P (X = x) = 0. From now on, I will be sloppy and use the two interchangeably.
To obtain the PDF from the CDF, we use the Fundamental Theorem of Calculus:
d
fX (x) = FX (x) (6.4)
dx
We have given an intepretation of the area under the density function fX (x) as a probability.
The natural question is: what is the interpretation of fX (x) itself? In the next two sections,
we present an interpretation of fX (x) and introduce two ways of computing distributions in
continuous probability.
CHAPTER 6. CONTINUOUS PROBABILITY I 75
The first method is to simply work with the CDF and to obtain fR (r) by differentiating
FR (r). Since FR (r) := P (R < r), and we have chosen a point uniformly randomly inside
of the circle, then the probability we are looking for is the ratio of area of the inner circle
(which has radius r) to the area of the total circle (which has radius 1).
To briefly motivate the procedure, let us consider FR (r + dr). Using a Taylor expansion,
d
FR (r + dr) = FR (r) + FR (r) dr + O((dr)2 ),
dr
where the notation O((dr)2 ) includes terms of order (dr)2 or higher. Recalling that we have
FR (r) = P (R < r) and the derivative of the CDF is the density function,
Let us immediately apply the formula we have derived to the motivating example. From the
CDF, we have that the probability of picking a point with R < r + dr is (r + dr)2 , and the
probability of picking a point with R < r is r2 . Therefore,
The expression above must equal fR (r) dr + O((dr)2 ). Therefore, by looking at the term
which is proportional to dr, we can identify fR (r) = 2r and we obtain the same answer!
CHAPTER 6. CONTINUOUS PROBABILITY I 76
Initially, the differential method seems to require more calculations and is not as familiar
as working with the CDF instead. However, through this discussion we have obtained an
interpretation for the density function:
In words: the probability that the random variable X is found in the interval (x, x + dx)
is proportional to both the length of the interval dx and the density function evaluated in
the interval. The interpretation can be rephrased in the following way: the density function
fX (x) is the probability per unit length near x.
The basic procedure for obtaining the density function directly is:
1. Use the information given in the problem to find P (X (x, x + dx)).
We can continue to use the formula var(X) = E[X 2 ] E[X]2 to obtain the variance of a
continuous random variable. Since integrals are linear, linearity of expectation still holds.
The joint distribution of two continuous random variables X and Y is fX,Y (x, y). The
joint distribution represents everything there is to know about the two random variables.
The joint distribution must satisfy the normalization condition:
Z Z
fX,Y (x, y) dx dy = 1 (6.8)
CHAPTER 6. CONTINUOUS PROBABILITY I 77
We say that X and Y are independent if and only if the joint density factorizes:
To obtain the marginal distribution of X from the joint distribution, integrate out the
unnecessary variables (in this case, we integrate out Y ):
Z
fX (x) = fX,Y (x, y) dy (6.10)
The joint distribution can be extended easily to multiple random variables X1 , . . . , Xn . The
joint density satisfies the normalization condition:
Z
fX1 ,...,Xn (x1 , . . . , xn ) dx1 dxn = 1 (6.11)
Rn
In other words, to find the probability that (X, Y ) is in a region J, we integrate the joint
density over the region J. As in multivariable calculus, it is often immensely helpful to draw
the region of integration (J) before actually computing the integral.
Theorem 6.1 (Continuous Tail Sum Formula). Let X be a non-negative random vari-
able. Then: Z
E[X] = (1 FX (x)) dx (6.14)
0
Proof.
Z Z Z x Z Z
E[X] = xfX (x) dx = fX (x) dt dx = fX (x) dx dt
0 0 0 0 t
Z Z
= P (X > t) dt = (1 FX (t)) dt
0 0
CHAPTER 6. CONTINUOUS PROBABILITY I 78
The proof is quite similar to the discrete case. Interchanging the order of integration is
justified by Fubinis Theorem because the integrand is non-negative.
We use the Tail Sum Formula: if x < c, then P (min(c, X) x) = 1 FX (x), and if
x c, then P (min(c, X) c) = 0.
Z Z c
E[min(c, X)] = P (min(c, X) x) dx = (1 FX (x)) dx
0 0
In this example, X is partly discrete and partly continuous. We say that X is an example
of a mixed random variable.
Similarly, suppose that X and Y are i.i.d. Unif(0, 1) random variables. Since they are
independent, the joint distribution is simply the product of their respective density functions:
The uniform distribution is especially simple because the density is constant. Suppose we
want to find the probability that (X, Y ) will lie in a region J R2 . We can integrate the
joint density to obtain:
Z Z
P ((X, Y ) J) = fX,Y (x, y) dx dy = dx dy = Area(J)
J J
The procedure for computing probabilities is therefore very simple. To find the probability
of an event involving two i.i.d. Unif(0, 1) random variables X and Y , draw the unit square,
CHAPTER 6. CONTINUOUS PROBABILITY I 79
and shade in the region in the square which corresponds to the given event. The area of
the shaded region is the desired probability. As a result, many questions involving uniform
distributions have very geometrical solutions.
To compute the expectation, we can simply observe that the distribution is symmetric about
x = 1/2, so we can immediately write down 1/2 as the expectation. To make the argument
slightly more formal, consider the random variable Y = 1 X. Observe that Y and X have
identical distributions (but they are not independent). Therefore, E[X] = E[Y ] = E[1 X],
or E[X] = 1 E[X] using linearity. Hence,
1
E[X] = . (6.17)
2
One question you may have is: why are X and Y identically distributed? To answer that
question in a rigorous way, we will need the change of variables formula in order to show that
X and Y have the same density function (and hence the sameR distribution). For now, accept
1
the intuition! (Alternatively, carry out the integral E[X] = 0 x dx, but thats boring.)
Theorem 6.3. Let T be a random variable satisfying the Memoryless Property. Then
T has the density function
fT (t) = et , t > 0, (6.19)
where > 0 is a parameter. We say that T follows the exponential distribution, or
T Expo().
Proof. We search for a differentiable CDF which satisfies the Memoryless Property, which
will uniquely specify the density.
CHAPTER 6. CONTINUOUS PROBABILITY I 80
fT (t) dt = P (T (t, t + dt)) = P (T > t, T < t + dt) = P (T < t + dt | T > t)P (T > t)
= (1 P (T > t + dt | T > t))P (T > t) = (1 P (T > dt))P (T > t)
= (1 GT (dt))GT (t).
Notice the use of the Memoryless Property. Consider the initial condition of GT (t): we
know that GT (0) = P (T > 0) = 1. Additionally, since dt is a small quantity, we can take
the Taylor expansion of GT (dt) about t = 0 and drop terms of order (dt)2 and higher.
fT (t) dt (1 (GT (0) +G0T (0) dt))GT (t) = G0T (0)GT (t) dt
| {z }
=1
Recall that
d d
GT (t) = (1 FT (t)) = fT (t).
dt dt
We obtain a differential equation for GT (t):
d
GT (t) = G0T (0)GT (t)
dt
The solutions to the differential equation are all of the form GT (t) = cet , where c R
is a constant and we wrote = G0T (0). Using the initial condition GT (0) = 1, we see
that c = 1, so that GT (t) = et . The CDF is
The density is easily obtained by differentiating. Note that the condition > 0 arises
because if < 0, then the density function does not satisfy the condition of being non-
negative. When = 0, the density is fT (t) = 0 and cannot normalize to 1, so = 0 is
also not a valid choice. Through this derivation, we have also showed the uniqueness of
the exponential distribution.
In the lecture notes and slides, it was shown that the exponential distribution has the Mem-
oryless Property. Here, we have shown that a density with the Memoryless Property is
exponential. Hence, we have proven both directions: a continuous probability distribution is
exponential if and only if it satisfies the Memoryless Property! This is a remarkable result,
and it is also true for the geometric distribution in the discrete case (although we have not
proved it). The Memoryless Property should be thought of as the defining characteristic of
the exponential distribution.
We will proceed to compute the basic properties of the exponential distribution. First, check
for yourself that the exponential distribution is propertly normalized:
Z
et dt = 1 (6.21)
0
We claim that the normalization condition itself is already most of the work necessary to
CHAPTER 6. CONTINUOUS PROBABILITY I 81
Example 6.4. For the exponential distribution, we can also calculate the median, that
is, the t such that P (T t) = 1/2 = P (T t). We have 1/2 = et , so ln 2 = t
and we find that the median is (ln 2)/. We can see that, compared to the mean, there is
an additional ln 2 factor. In the subject of chemistry, this is also known as the half-life
of a radioactivate substance.
Proof. The easiest way to prove this is to once again consider the survival function
GT (t) = P (T > t).
Example 6.6. Suppose that we have n i.i.d. Expo(1) random variables. What is the
expectation of the maximum?
View the exponential random variables as representing the lifetimes of n light bulbs. The
expectation of the maximum is the expected time for all n light bulbs to die. This is
simply the expected time until the first light bulb dies, plus the expected time it takes for
the remaining n1 light bulbs to die. We can compute each of these quantities separately.
The first light bulb to die is the minimum of n Expo(1) random variables. By Theo-
rem 6.5, the minimum of the light bulbs is Expo(n), with mean 1/n. Let Sn be the
time for all n light bulbs to die. Once the first light bulb dies, we wait for n 1 light
bulbs to die. By the Memoryless Property, however, this is the same as if we had started
with n 1 i.i.d. Expo(1) random variables and asked for the maximum. In other words,
E[Sn ] = 1/n + E[Sn1 ]. By solving this recurrence, we obtain
n
X 1
E[Sn ] = = Hn ln n + .
k=1
k
For the next example, imagine a store with two service lines. Two customers are already
waiting in each of the two lines; call these customers A and B respectively. A third customer,
C, arrives, and waits for the first available teller. Assume that the two lines have service
times which are exponentially distributed, with parameters 1 and 2 respectively.
Consider the time at which the first customer leaves the store. The other customer who
is still waiting in line, by the Memoryless Property, has a service time distributed as a
brand-new Expo() distribution. Customer C enters the service line which is now free,
so the service time for customer C is also Expo(). By symmetry, the probability that
customer C will leave first is 1/2.
Chapter 7
Continuous Probability II
We continue our study of continuous random variables with more advanced tools: conditional
probability, change of variables, convolution. We then introduce more distributions, includ-
ing the all-important normal distribution (with the associated Central Limit Theorem).
Example 7.1. Suppose X Unif[0, 1] and Y Unif[0, X]. That is, conditioned on
X = x, Y has a Unif[0, x] distribution. What is P (Y > 1/2)?
x 1/2 1
P (Y > 1/2 | X = x) = =1 , x > 1/2.
x 2x
We integrate over values of x > 1/2.
Z Z 1
1
P (Y > 1/2) = P (Y > 1/2 | X = x)fX (x) dx = 1 dx
1/2 2x
1 x=1 1
= x ln x = (1 ln 2)
2 x=1/2 2
83
CHAPTER 7. CONTINUOUS PROBABILITY II 84
Proof. The method I prefer to use is to first manipulate the CDF and then differentiate.
First, we consider the case when g is strictly increasing.
My advice is to not bother remembering the change of variables formula. Instead, remember
the basic outline of the proof: write down the CDF of Y , then write the expression in terms
of the CDF of X, and then differentiate.
Why did we assume that the function g was one-to-one? Only out of convenience: the
condition that g is one-to-one is not necessary for change of variables to work, although the
change of variables formula is somewhat more complicated:
X
fY (y) = fX (x)|h0 (y)| (7.7)
x:g(x)=y
The idea is that since g is no longer one-to-one, there may be many values of x such that
g(x) = y, so we must sum up over all x such that g(x) = y. Can we define change of variables
in the discrete case? Actually, the discrete case is rather easy.
X
P (Y = y) = P (X = x) (7.8)
x:g(x)=y
If you think about the above equation, you will realize that we have been using the formula
all along without knowing it.
We will use change of variables in the section about the normal distribution.
CHAPTER 7. CONTINUOUS PROBABILITY II 86
7.2.2 Convolution
Here, we explain how to compute the density of Z := X + Y .
P (Z (z, z + dz)) = fZ (z) dz
Note that X + Y = Z (z, z + dz) is equivalent to the event that Y (z x, z x + dz),
where x ranges over all values. Hence, we can find P (Z (z, z + dz)) by integrating over
the joint density of X and Y .
Z Z zx+dz
fZ (z) dz = fX,Y (x, y) dy dx
zx
Since dz is an infinitesimal length, we can assume that fX,Y (x, y) does not change over the
interval of integration for y. The inner integral therefore equals the value of the function,
fX,Y (x, z x), multiplied by the length of the interval, dz.
Z
fZ (z) dz = fX,Y (x, z x) dz dx
Therefore, we can identify:
Z
fZ (z) = fX,Y (x, z x) dx (7.9)
We assume that dz is small so that fX,Y (x, y) is effectively constant over the inner integral.
Z 0 Z Z
fZ (z) dz = fX,Y (yz, y)(y dz) dy + fX,Y (yz, y)(y dz) dy = |y|fX,Y (yz, y) dy dz
0
Therefore, we obtain:
Z
fZ (z) = |y|fX,Y (yz, y) dy (7.11)
Example 7.3. Let X and Y be i.i.d. Expo(). What is the density of Z = X/Y ?
2
In fact, we will solve a slightly more general integral by considering ex /2 instead. (We can
always set = 1 at the end of our computations, after all.) The reason for doing so may
not be clear, but it will actually save us some time later on.
2
As I mentioned above, ex /2 cannot be integrated normally, so we will need to use another
trick (hopefully you find all of these tricks somewhat interesting). The trick here is to
consider the square of the integral:
Z Z Z Z
x2 /2 y 2 /2 2 2
2
I = e dx e dy = e(x +y )/2 dx dy
R R
Notice that the integral depends only on the quantity x2 + y 2 , which you may recognize as
the square of the distance from the origin. (The variables x and y were chosen suggestively
to bring to mind the picture of integration on the plane R2 .) Therefore, it is natural to
change to polar coordinates with the substitutions
x2 + y 2 = r 2 , (7.13)
dx dy = r dr d. (7.14)
CHAPTER 7. CONTINUOUS PROBABILITY II 88
(Dont forget the extra factor of r that arises due to the Jacobian. For more information,
consult a multivariable calculus textbook which develops the theory of integration under
change of coordinates.)
E[X] = 0.
var(X) = 1.
CHAPTER 7. CONTINUOUS PROBABILITY II 89
Hopefully, it should be clear by now why we need the : it is simply a tool that we use to
integrate more easily, and then we discard it after we finish the actual integration. In any
case, we have verified that the mean and variance are indeed 0 and 1 respectively.
We now apply the change of variables technique to the standard normal distribution, both as
an illustration of the technique, and also to obtain the general form of the normal distribution.
Consider the function g(x) = + x (with R and > 0). Let Y = g(X). We proceed
to find the density of Y . First, note that the inverse function h = g 1 is
y
h(y) =
and
1
|h0 (y)| = .
Then we have that
0 1 y
fY (y) = fX (h(y))|h (y)| = fX .
Plugging in (y )/ into the standard normal density yields
1 2 2
fY (y) = e(y) /(2 ) . (7.17)
2
What are the mean and variance of Y ? Recall that Y = + X. Using the basic properties
of linearity and scaling,
E[Y ] = ,
var(Y ) = 2 .
Let X, Y be i.i.d. standard normal random variables. Since they are independent, their joint
density is the product of the individual densities.
1 2 1 2 1 (x2 +y2 )/2
fX,Y (x, y) = fX (x)fY (y) = ex /2 ey /2 = e
2 2 2
Observe that the joint density only depends on x2 + y 2 = r2 , which is the square of the
distance from the origin. (In fact, we already noticed this property when we were integrat-
ing the Gaussian.) In polar coordinates, the density only has a dependence on r, not on ,
CHAPTER 7. CONTINUOUS PROBABILITY II 90
which exhibits an important geometric property: the joint density is rotationally symmetric,
that is, the Gaussian looks exactly the same if you rotate your coordinate axes. How can we
utilize this geometric property to prove our result?
Since the joint density is invariant under rotations, changing variables from x 7 x0 and
y 7 y 0 should not affect the value of the integral. Hence,
Z Z z/ 2 Z
0 0 0 0 0 0
FZ (z) = P (Z < z) = fX,Y (x , y ) dx dy = fX (x ) dx fY (y 0 ) dy 0
J
Z z/ 2
= fX (x0 ) dx0 = P (X < z/ 2) = P ( 2X < z).
We see that Z has the same distribution as 2X, where X N (0, 1). From our pre-
vious section, we saw that scaling a standard normal random variable by 2 yields the
N (0, 2) distribution. However, X N (0, 1) and Y N (0, 1), and we have just found
that Z = X +Y N (0, 2). Could summing up independent normal random variables really
be as easy as adding their parameters?
Lemma 7.4 (Rotational Invariance of the Gaussian). Let X, Y be i.i.d. standard normal
random variables and let , [0, 1] so that 2 + 2 = 1. Then X + Y N (0, 1).
CHAPTER 7. CONTINUOUS PROBABILITY II 91
Proof. We simply extend the previous argument to any arbitrary rotation. By assump-
tion, we can write down
sin() =
cos() =
where
J = {(x, y) : x cos() + y sin() < z} R2 .
Under the change of coordinates
x = x0 cos() y 0 sin()
y = x0 sin() + y 0 cos()
J = {(x0 , y 0 ) : x0 < z} R2 .
n n
!
X X
X := X1 + + Xn N i , i2 (7.18)
i=1 i=1
Proof. We will prove the result for the sum of two independent normal random variables.
The general result follows as a quick exercise in induction. Let Zi = (Xi i )/i be the
standardized form of Xi . Then Zi N (0, 1). Additionally, observe that
s s
2
1 X1 1 22 X2 2 X1 + X2 (1 + 2 )
Z= 2 2
+ 2 2
= p .
1 + 2 1 1 + 2 2 12 + 22
CHAPTER 7. CONTINUOUS PROBABILITY II 92
Z N (1 + 2 , 12 + 22 ).
The theorem is beautiful, so do not misuse it! The most common mistake that students
make is that they forget the important rule: variances add, standard deviations do not.
Proposition 7.6.
Z
1 3 x2 /2 2 /2 2 /2
(x x )e ez dz x1 ex (7.19)
x
Proof. For the left inequality, we integrate the following integral by parts.
Z Z Z
4 z 2 /2 z 2 /2 2
(1 3z )e dz = e dz + (3z 4 )ez /2 dz
x
Zx x
Z
z 2 /2 3 z 2 /2 2
= e dz + z e + z 2 ez /2 dz
z=x
Zx x
Z
z 2 /2 3 x2 /2 1 z 2 /2 2
= e dz x e z e ez /2 dz
x z=x x
1 3 x2 /2
= (x x )e
We have shown that, asymptotically, the tail probability of the Gaussian distribution goes
2
as P (Z > z) (2)1/2 z 1 ez /2 as z .
CHAPTER 7. CONTINUOUS PROBABILITY II 93
First, let us make the statement precise. Suppose (Xi ) is a sequence of i.i.d. random variables
with mean and finite moments. Define
n
n := 1
X
X Xi .
n i=1
n ) converges in distribution to the normal distribution.
Then as n , n(X
Remember that the Weak Law of Large Numbers tells us that as we increase the number
of samples, the sample mean converges in probability to the expected value. The Central
Limit Theorem is stronger states that the distribution of the sample mean also converges to
a particular distribution, and it is our good friend the normal distribution! The power of
the theorem is that we can start from any distribution at all, discrete or continuous, yet the
sample mean will still converge to a single distribution. In fact, there are even versions of the
Central Limit Theorem with weaker assumptions. Of all of the theorems in mathematics,
the Central Limit Theorem is one of my favorites.
The Central Limit Theorem is the basis for much of modern statistics. When statisticians
carry out hypothesis testing or construct confidence intervals, they do not care that they do
not know the exact distribution of their data. They simply collect enough samples (30 is
usually sufficient) until their sampling distributions are roughly normally distributed.
We have not answered questions such as what does it mean for a distribution to converge?
and why is the Central Limit Theorem true? The former question requires a more rigorous
study of distributions. The latter question will only be answered partially below.
Example 7.7. Suppose that Xi are i.i.d. with mean and standard deviation . Let X
)/(/ n) N (0, 1)
be the sample average as before. The CLT tells us that Zn = (X
as n . Now, view the Xi as observations, and we will construct a 95% confidence
interval for . First, we relate the desired probability to Zn , and then make use of the
CHAPTER 7. CONTINUOUS PROBABILITY II 94
P (a X b) = P (a np X np b np)
!
a np X np b np
=P p p p
np(1 p) np(1 p) np(1 p)
! !
b np a np
p p .
np(1 p) np(1 p)
1. k 1 points must lie in the interval (, x), which has probability (FX (x))k1 .
2. One point must lie in the interval (x, x + dx), which has probability fX (x) dx.
3. n k points must lie in the interval (x + dx, ), which has probability (1 FX (x))nk .
In addition, there are n ways to choose which of the n points lies in the interval (x, x + dx),
and out of the remaining n 1 points, there are n1
k1
ways to choose which points lie in the
interval (, x). Hence, we have
(k) n1
fX (k) dx = P (X (x, x + dx)) = n fX (x)(FX (x))k1 (1 FX (x))nk dx,
k1
so our result is:
n1
fX (k) =n fX (x)(FX (x))k1 (1 FX (x))nk (7.21)
k1
In the special case when the Xi are i.i.d. Unif[0, 1] random variables, we have fX (x) = 1 and
FX (x) = x for 0 < x < 1, so the density of X (k) is
n!
fX (k) (x) = xk1 (1 x)nk , 0 < x < 1. (7.22)
(k 1)! (n k)!
where B(r, s) is a normalizing constant. First, we observe that if X (k) is the kth smallest point
out of n i.i.d. U [0, 1] random variables (see the previous section), then X (k) (k, n k + 1).
This provides an interpretation for the parameters of the (r, s) distribution: there are k
points less than or equal to X (k) , and n k + 1 points greater than or equal to X (k) . By
setting n = r + s 1 and k = r in (7.22), we can obtain an expression for B(r, s) for integer
values of r and s:
1 (r + s 1)!
= (7.24)
B(r, s) (r 1)! (s 1)!
Remember that B(r, s) is a normalizing constant for the density (7.23), which implies
Z 1
(r 1)! (s 1)!
B(r, s) = xr1 (1 x)s1 dx = . (7.25)
0 (r + s 1)!
Remembering (7.25) can actually be useful in solving integrals quickly.
CHAPTER 7. CONTINUOUS PROBABILITY II 96
so we have
r(r + 1) r2 rs
var(X) = = .
(r + s)(r + s + 1) (r + s)2 (r + s)2 (r + s + 1)
Of course, the answer is Bayes Rule, but with continuous random variables. To be more
clear, let X (r, s) be a random variable which represents our prior belief about the bias
of the coin, and let A be the event that we flip h heads and t tails. What is the conditional
distribution of X after observing A, fX | A (x)?
fX (x)P (A | X = x) dx
fX | A (x) dx = ,
P (A)
fX (x)P (A | X = x)
fX | A (x) = .
P (A)
CHAPTER 7. CONTINUOUS PROBABILITY II 97
We know that fX (x) = (B(r, s))1 xr1 (1 x)s1 for 0 < x < 1. P (A | X = x) is the
probability of flipping h heads and t tails if the bias of our coin is x, which is the binomial
distribution with h + t trials and probability of success x. Therefore,
(h + t)! h
P (A | X = x) = x (1 x)t .
h! t!
To obtain P (A), we should integrate P (A | X = x) against the density of X:
Z Z 1
(h + t)! h 1
P (A) = P (A | X = x)fX (x) dx = x (1 x)t xr1 (1 x)s1 dx
0 h! t! B(r, s)
Z 1
(h + t)! 1 (h + t)! B(r + h, s + t)
= xr+h1 (1 x)s+t1 dx =
h! t! B(r, s) 0 h! t! B(r, s)
Putting these pieces together, we obtain
We stated that the expectation of the (r, s) distribution is r/(r + s), which would be our
best guess for the bias of the coin if we had observed r heads and s tails in r + s coin flips.
The result above states that if we observe h more heads and t more tails, then we now have
the (r + h, s + t) distribution, which is consistent with the following interpretation: the first
parameter represents how many heads we have seen, and the second parameter represents
how many tails we have seen. The amazing part is that if we start with a belief according to
the beta distribution, then we never have to use another distribution: repeated applications
of Bayes Rule will always yield another beta distribution.
Furthermore, our result almost provides an algorithm for estimating the bias of a coin: start
with a beta distribution, and keep flipping coins. If we observe heads, we increment the first
parameter of the beta distribution. If we observe tails, we increment the second parameter
of the beta distribution. Of course, this is an extremely cheap computation, which makes
the beta distribution a convenient family of distributions for the estimation problem.
This leads to a minor problem, which is: how should we initialize the parameters of the
beta distribution? Regardless of the choice of the initial parameters, after enough flips, the
beta distribution will converge to the correct probability. It is true that our choice of initial
parameters certainly influences how long it takes for our distribution to converge. Instead
of worrying about this problem, however, we may view it as another advantage of the beta
distribution: by initializing our initial parameters, we can incorporate our prior belief into
the algorithm. Here are some examples to illustrate this point:
CHAPTER 7. CONTINUOUS PROBABILITY II 98
If we take an ordinary coin, we may be reasonably confident that the coin is fair, so we can
start with a (500, 500) distribution. The fact that r = s reflects the fact that we think
the coin is fair, and our choice of the number 500 represents the strength of our belief: we
belief strongly in the fairness of the coin, so we initialize the beta distribution with 500
observations of heads and 500 observations of tails. On the other hand, if the coin was given
to you by a friend, perhaps you believe that your friend is not malicious (so you still believe
that the coin is fair), but you do not trust the coin as much as you would an ordinary coin.
In this case, you may decide to initialize the algorithm with a (20, 20) distribution. Finally,
suppose that the coin was given to you by a gambler, and you suspect that the coin may be
loaded (so that it is more likely to come up heads). In this case, you may encode your belief
with a (40, 20) distribution (your suspicions amount to the same information as having
observed 40 heads and 20 tails prior to flipping the coin).
Example 7.8. Now, let us approach the coin-flipping problem from a different perspec-
tive. We flip a coin n times and we are looking for the distribution of the number of
heads. However, we do not know the bias of the coin, so we assume a uniform prior over
the bias of the coin. Let X Bin(n, Y ) be the number of heads, where Y Unif[0, 1].
What is the distribution of X?
Z Z 1
n x
P (X = x) = P (X = x | Y = y)fY (y) dy = y (1 y)nx dy
0 x
(n
n! x! x)! 1
P (X = k) = = ,
(n x)! (n + 1)!
x!
n+1
so we have found that X Unif{0, . . . , n}, which agrees with intuition. If your prior
belief about the bias is uniform, then the number of heads is also uniform.
Chapter 8
Moment-Generating Functions
Here, we introduce transforms, which provide powerful conceptual and computational tools
for understanding the distributions of random variables.
8.1 Introduction
Definition 8.1. Let X be a random variable. The moment-generating function of
X is:
MX () = E[eX ] (8.1)
In other words, the MGF is the expectation of the random variable eX . We will return to
the computation of the MGF shortly. Our immediate goal is to explain the meaning behind
the name moment-generating function.
(k)
E[X k ] = MX (0) (8.2)
Setting = 0, we obtain
X
(k)
MX (0) = xk P (X = x) = E[X k ].
x=0
99
CHAPTER 8. MOMENT-GENERATING FUNCTIONS 100
If X is continuous, the same proof goes through with sums replaced by integrals.
Z Z
(k) dk x
MX () = k e fX (x) dx = xk ek fX (x) dx
d
Z
(k)
MX (0) = xk fX (x) dx = E[X k ]
Technical Remark: In the proof above, we interchanged the order of differentiation and
summation/integration. This is not always valid; there are conditions that must be met in
order to justify this step. However, for the purpose of this exposition, we will assume that
all random variables under consideration are well-behaved so that the above proof holds.
f 0 (0) f 00 (0) 2
f (x) = f (0) + x+ x +
1! 2!
The above theorem implies that the expansion of the MGF is (note that MX (0) = 1)
E[X] E[X 2 ] 2
MX () = 1 + + +
1! 2!
Therefore, we can see that the MGF is indeed a generating function, where the kth coefficient
is given by E[X k ]/k!. The values E[X k ] are often known as the moments of a distribution,
which explains the name.
Proof.
Compare the above expression to the formula for the MGF for N , MN () = E[eN ]. We have
the following result: MSN () is found by evaluating MN () at = ln MXi (). Alternatively,
take MN () and replace every occurrence of e with MXi ().
MSN () = MN () (8.4)
=ln MXi ()
8.1.3 Cumulants
Definition 8.4. The cumulant-generating function of a random variable X is:
CX () = log MX () (8.5)
It seems like all we did was take the logarithm of the MGF. Why is this useful?
1. CX (0) = 0.
0
2. CX (0) = E[X].
00
3. CX (0) = var(X).
0 d MX0 ()
CX () = log MX () =
d MX ()
The cumulant-generating function is often useful because taking the logarithm before taking
the derivative often makes calculations simpler. Intuitively, logarithms convert products into
sums, and we would much rather differentiate sums rather than products.
CHAPTER 8. MOMENT-GENERATING FUNCTIONS 102
MX () = (1 p) e0 + p e = 1 p + pe (8.6)
MX () = (1 p + pe )n , (8.7)
CX () = n log(1 p + pe ). (8.8)
Evaluating the sum using the formula for a geometric series, we have
pe
MX () = , (8.9)
1 (1 p)e
CX () = + log p log(1 (1 p)e ). (8.10)
Let us compute the mean and variance of the Poisson distribution. Taking the logarithm,
we have CX () = (e 1). Differentiating, we obtain
0
E[X] = CX (0) = e =
=0
and
00
var(X) = CX (0) = e = .
=0
Theorem 8.6. Suppose that X and Y are independent random variables with distribu-
tions X Pois() and Y Pois(). Then X + Y Pois( + ).
Proof. Observe:
1) 1) 1)
MX+Y () = MX ()MY () = e(e e(e = e(+)(e
Actually, if you were especially observant, you will have noticed that we cheated in the
last line of the above proof. After all, we did not answer the question: does the moment-
generating function completely determine the distribution? Is there potentially another
probability distribution besides Pois( + ) which happens to have the same MGF? The
answer is no, which is to say that the MGF completely specifies the probability distribution.
We use (8.4).
pe p( )1 p
MSN () = = =
1
1 (1 p)e e =MXi () (1 (1 p)( ) ) (1 p)
p
=
p
X () := E[X ] (8.14)
Just like the MGF, the probability generating function (PGF) of the sum of i.i.d. random
variables is the product of the PGFs:
1 dk
P (X = k) = () .
X
k! dk =0
Convergence
Here, we take up the subject of convergence of random variables. In the context of probability
theory, convergence is a subtle concept because there are numerous modes of convergence:
convergence in probability, convergence in distribution, almost sure convergence, convergence
in Lp , and more. In addition to discussing the relationships between the various modes of
convergence, we will introduce two of the most famous theorems in probability theory: the
Strong Law of Large Numbers and the Central Limit Theorem.
for every > 0. When the Xn are i.i.d. withPnfinite second moment, then the WLLN (The-
1
orem 3.13) assures us that the average n i=1 Xi converges to their common mean E[Xi ]
in probability.
In words: the set of outcomes such that Xn converges to X has probability 1. The meaning
is rather subtle and powerful. In practice, almost sure convergence guarantees that you can
never be unlucky enough to see a sequence Xn that does not converge.
105
CHAPTER 9. CONVERGENCE 106
Proof. Since Xn X a.s., we know that P (XT n 6 X) = 0. Fix > 0 and define
An = {m n |Xm X| }. Write A for n=1 An , so A1 A2 A. By
monotonicity, P (An ) P (A). Since A is the event that |Xn X| infinitely many
times, A {Xn 6 X}, so
as n . Hence, Xn X in probability.
P
Corollary 9.3. Suppose that for all > 0, n=1 P (|Xn | ) < . Then Xn 0 a.s.
Proof. Apply Lemma 9.2 to An = {|Xn | }. The probability that |Xn | i.o. is 0,
which means that |Xn | finitely many times a.s. Since this holds true for all > 0,
we must have Xn 0 a.s.
1
A small remark: suppose that P (|X
1
Pn | 1) n . Then, Xn 0 in probability since
n 0 as n , but the sum n=1 n diverges, which means that this is too weak to
apply Corollary 9.3. We will have to be a little more clever!
Theorem
Pn 9.4 (Strong Law of Large Numbers). Let Xi be i.i.d. with E[|Xi |] < . Then
1
n i=1 Xi E[Xi ] a.s.
Notice that the only assumption we made is that E[|Xi |] < . In particular, we do not
require var(Xi ) to be finite.
The SLLN, as stated above, is complicated to prove. However, we can prove the SLLN under
weaker assumptions more easily:
Theorem 9.5 (4th Moment SLLN). Let Xi be i.i.d. with E[Xi4 ] < . Write Sn for
n
n1 i=1 Xi . Then n1 Sn E[Xi ] a.s.
P
Proof. By considering Xi E[Xi ] instead of Xi , we can assume that the Xi are zero-mean.
The first step is a calculation of the fourth moment:
n X
X n X
n X
n
E[Sn4 ] = E[Xi Xj Xk Xl ]
i=1 j=1 k=1 l=1
Although this expression looks complicated, observe that only terms of the form E[Xi4 ]
and E[Xi2 Xj2 ] survive; all other terms vanish because of independence and the zero-
mean assumption. There are n terms of the form E[Xi4 ] and 3n(n 1) terms of the form
E[Xi2 Xj2 ]. By the Cauchy-Schwarz Inequality (3.26),
q
E[Xi2 Xj2 ] E[Xi4 ]E[Xj4 ] = E[Xi ]4
so a bound on the fourth moment is E[Sn4 ] 3n2 E[Xi4 ]. Now, we can apply Markovs
Inequality (3.22) with the function f (x) = x4 to obtain
E[Sn4 ] 3E[Xi4 ]
P (|Sn | n) = P (Sn4 n4 4 ) .
n 4 4 n2 4
Summing across n, we obtain
3E[Xi4 ] X 1
X Sn
P 4 2
<
n=1
n n=1
n
Example 9.6. Let Mn denote the maximum of n i.i.d. Expo() random variables. In-
tuitively, taking the maximum of an increasing number of exponential random variables
will grow without bound, so we should not expect Mn to converge in distribution. How-
ever, if we define Xn = Mn 1 log n, then we will see that Xn converges in distribution.
To verify this, we check
Perhaps, you may recognize the characteristic function of a probability density as the Fourier
transform of the probability distribution. The characteristic function has many nice prop-
erties, and the one that we will use in particular is that if the characteristic functions of
a sequence of random variables converges to a single characteristic function X , then the
sequence of random variables converges in distribution to X. The result we are quoting is
commonly known as L evys Continuity Theorem.
We can make the theorem plausible by making a few non-rigorous appeals to intuition: the
characteristic function is another representation of the probability distribution, just like the
CDF or the survival function. From the characteristic function, it is possible to use the
inverse Fourier transform to recover the density function. The characteristic function, in
a sense, captures all of the information in a probability distribution, so convergence of the
CHAPTER 9. CONVERGENCE 109
x2 1 2 t2
+ itx = (x it)
2 2 2
We have that
Z Z Z
1 2 1 x2 /2+itx 1 2 2 /2
X (t) = itx
e ex /2 dx = e dx = et /2 e(xit) dx.
R 2 2 R 2 R
You may expect the integral to come out to be 2 since the integral looks like the integral
over
density of a normal distribution with mean it and variance 1. The integral is indeed
2, although some facts from complex analysisR are necessary to establish this rigorously.
2
(The rigorous argument would say: the integral R e(xa) /2 dx depends analytically on the
parameter a andequals the value 2 for all real a. By the uniqueness principle, the integral
will also equal 2 for complex values of the parameter a.)
Theorem 9.8. Let X1 , . . . , Xn be i.i.d. random variables with X being their sum. Then:
Proof.
Theorem 9.9 (Central Limit Theorem). Let (Xn ) be a sequence i.i.d. random variables
CHAPTER 9. CONVERGENCE 110
In the last line, we have carried out several steps at once. We used the fact that E[Xi ] = 0
and that E[Xi2 ] = var(Xi ) = 1. Also, since we will be taking the limit as n , we
dropped terms which were of order 1/n2 or higher. What were are left with is
n
t2
lim Zn (t) = lim 1 .
n n 2n
Recalling that x n
ex = lim 1+ , (9.5)
n n
we can see that
2 /2
lim Zn (t) = et .
n
Stochastic Processes
If the interarrival times are exponentially distributed, then why is the process named after
the Poisson distribution? It is a famous theorem that the number of customers which arrive
in a certain interval of time has the Poisson distribution:
Theorem 10.1 (Poisson Process). In a Poisson process with rate , the number of
customers which arrive in the first t seconds has the Pois(t) distribution.
Remark: The probability that we have no customers in the first t seconds is et (using
the CDF of the exponential distribution), and et = P (Pois(t) = 0), so the theorem is
certainly true for this special case.
Proof. Let N (t) represent the number of customers who arrive in the first t seconds. Our
goal is to compute P (N (t) = k).
If there are exactly k arrivals in the first t + dt seconds, this could arise from two
possibilities:
There are exactly k 1 arrivals in the interval (0, t) and one arrival in the interval
(t, t+dt). The former has probability P (N (t) = k1) and the latter has probability
dt (conditioned on the kth exponential surviving t seconds, by the Memoryless
Property the probability it will die in the next dt seconds is dt).
111
CHAPTER 10. STOCHASTIC PROCESSES 112
There are k arrivals in the interval (0, t) and no arrivals in the interval (t, t + dt).
The former has probability P (N (t) = k) and the latter has probability 1 dt.
Here, we are regarding the interval (t, t+dt) as infinitesimally small, so there is no chance
of multiple arrivals. Putting together the two pieces, we have
P (N (t + dt) = k) P (N (t) = k)
= (P (N (t) = k 1) P (N (t) = k)).
dt
Fixing k and viewing f (t) = P (N (t) = k) as a function of t, we recognize the LHS as
the derivative w.r.t. t, so we have obtained a differential equation for P (N (t) = k).
d
P (N (t) = k) = (P (N (t) = k 1) P (N (t) = k))
dt
To complete the proof, we need only show that the Pois(t) distribution satisfies this
equation.
et (t)n1
fTn (t) dt = P (Tn (t, t + dt)) = dt
(n 1)!
n tn1 et
fTn (t) = , t>0 (10.1)
(n 1)!
Initially, this function does not seem to resemble the factorial function at all, so let us first
see that it satisfies some of the familiar properties of the factorial function. Using integration
by parts, we observe:
x= Z
1 x
() = x e + ( 1) x2 ex dx = ( 1)( 1) (10.5)
x=0 0
Using (10.5) and (10.6), we can see that (2) = 1, (3) = 2, (4) = 6, and in general, for n
a non-negative integer,
(n) = (n 1)! , nN (10.7)
Therefore, the gamma function is indeed a continuous, shifted analog of the factorial function.
Let us now define the Gamma distribution of shape and rate in the following way:
if X (, ), then the density of X is
x1 ex
fX (x) = , x>0 (10.8)
()
We must check that the density is properly normalized:
Z Z
1 1 x 1 ()
x e dx = (x)1 ex d(x) = =1
() 0 () 0 ()
Now, we proceed to calculate the MGF.
Z 1 x Z
x x e ( ) x1 e()x
MX () = e dx = dx
0 () ( ) 0 ()
However, we recognize the integral on the right as the integral of a gamma distribution with
shape and rate , so it must integrate to 1. Hence, we have found
MX () = (10.9)
CHAPTER 10. STOCHASTIC PROCESSES 114
and
CX () = (log() log( )). (10.10)
We can now compute the moments of the distribution:
0
E[X] = CX (0) = = (10.11)
=0
00
var(X) = CX (0) = = (10.12)
( )2 =0 2
N (s, t) represents the number of arrivals in the interval (s, t). Therefore, we know that
N (s, t) Pois((t s)).
Example 10.3. What is the probability that we see no arrivals by time s and at most
k arrivals by time t?
Example 10.4. What is the probability that it takes more than t seconds for k arrivals?
This is equivalent to the probability that in the first t seconds, we have at most k 1
CHAPTER 10. STOCHASTIC PROCESSES 115
arrivals.
k1
t
X (t)j
P (N (0, t) k 1) = e
j=0
j!
One can also integrate the Erlang distribution, because this is equivalent to P (Tk > t).
Example 10.5. What is the probability that the kth arrival occurs in the interval (s, t)?
One way to write down the answer is the integral of the Erlang distribution of order k:
Z t k k1 x
x e
P (s < Tk < t) = dx
s (k 1)!
Example 10.6. What is the probability that the kth arrival happens within t seconds
of the first arrival?
The difference between the first and the kth arrival has the Erlang(k 1, ) distribution.
Z t k1 k2 x
x e
P (Tk1 t) = dx
0 (k 2)!
Alternatively, this is the complement of the event that the kth arrival takes longer than
t seconds after the first arrival, in which case the number of arrivals in (T1 , T1 + t) must
be at most k 2.
k2 t
X e (t)j
P (Tk1 t) = 1 P (Tk1 > t) = 1 P (N (0, t) k 2) = 1
j=0
j!
Example 10.7. Suppose s < t and we have k arrivals by time t. What is the probability
that we have j arrivals by time s?
Each of the k arrivals has probability s/t of landing in the interval (0, s) and can be
regarded as an independent Bernoulli trial. Hence, we have
kj
k s j ts
P (N (0, s) = j | N (0, t) = k) = .
j t t
CHAPTER 10. STOCHASTIC PROCESSES 116
Example 10.8. Given that we have seen k arrivals by time t, what is the probability
that the first arrival occurs before 1 second has passed?
Note that P (T1 1 | N (0, t) = k) is the complement of the probability that in the first
second, we see no arrivals. Each of the k arrivals is uniformly distributed over the interval
(0, t), so the probability that an arrival lands in the interval (1, t) is (t 1)/t = 1 1/t.
The probability for all k arrivals to land after 1 second is (1 1/t)k , so
k
1
P (T1 1 | N (0, t) = k) = 1 P (N (0, 1) = 0 | N (0, t) = k) = 1 1 .
t
Example 10.9. Given that the first arrival takes longer than s seconds to arrive, what
is the probability that the kth arrival takes longer than t seconds?
By the Memoryless Property, once we condition on the first arrival surviving at least s
seconds, we can treat s as the beginning of our process. Hence,
Example 10.10. Given that in the first s seconds, there is one arrival, what is the
probability that the kth arrival takes longer than t seconds?
The first arrival appears in (0, s), and we know that the second exponential survives
until time s. Conditioned on this fact, the Memoryless Property tells us that we need
only find the probability that k 1 arrivals survive for t s more seconds.
Example 10.11. Suppose that there are m arrivals in the first s seconds and n arrivals
in the first t seconds, where n m and t > s. What is the probability that the kth
arrival (where k > n) takes longer than seconds?
The key idea is that knowing there are m arrivals in the first s seconds tells you no more
information if you already know that there are n arrivals in t seconds (because t is the
later time). Hence, we need only find the probability that k n arrivals survive for t
seconds.
Example 10.12. Suppose that the jth arrival occurs at time s. What is the probability
that the kth arrival (for k > j) takes longer than t seconds?
CHAPTER 10. STOCHASTIC PROCESSES 117
There are k j additional arrivals which must survive for at least t s seconds.
Example 10.13. Suppose that the jth arrival happens before time s. What is the
probability that the kth arrival happens after time t, where k > j and t > s?
Apply the definition of conditional probability. Note that P (Tj < s) can be written as
the probability that in the first s seconds, we have at least j arrivals.
P
P (Tk > t, N (0, s) j) n=j P (Tk > t, N (0, s) = n)
P (Tk > t | Tj < s) = = P
P (N (0, s) j) n=j P (N (0, s) = j)
Pk1
n=j P (Tk > t | N (0, s) = n)P (N (0, s) = n)
= P
n=j P (N (0, s) = j)
Pk1
n=j P (Tkn > t s)P (N (0, s) = n)
= P
n=j P (N (0, s) = j)
Example 10.14. Suppose that the nth arrival appears at time t. What is the proba-
bility that the jth arrival (for j n) appears before time s?
If the nth arrival lands exactly at time t, there are n 1 arrivals in the interval (0, t).
Each of these arrivals has the distribution Unif(0, t). In particular, the jth arrival must
be the jth order statistic. Recall that if X1 , . . . , Xn are i.i.d. and X (k) denotes the kth
smallest point, then
n1
fX (k) (x) = n fX (x)(FX (x))k1 (1 FX (x))nk .
k1
Example 10.15. Suppose that there is one arrival in (0, s) and the time of the kth
arrival is t, for t > s + 1. What is the expected number of arrivals between s and s + 1?
Since the time of the kth arrival is t, we have k 1 arrivals in (0, t), and one arrival in
(0, s). Therefore, we have k2 arrivals in (s, t), so each of the k2 arrivals has probability
1/(t s) of landing in the interval (s, s + 1). Since this is a binomial distribution, we
have
k2
E[N (s, s + 1) | N (0, s) = 1, Tk = t] = .
ts
Consider P (Tj (x, x+dx), Tk (y, y +dx)). We need j 1 points in (0, x), one point in
(x, x + dx) (with probability dx), k j 1 points in (x, y), and one point in (y, y + dy)
(with probability dy). Hence,
k y xj1 (y x)kj1
fTj ,Tk (x, y) = e , 0 < x < y < .
(j 1)! (k j 1)!
Example 10.17. Suppose that T has the density fT (t) = tet for t > 0, and let
X Unif[0, T ]. What is the joint density of X and T X?
Observe that T has the density of the second arrival time of a Poisson process with rate
1. Conditioned on T = t, X is uniform in the interval [0, t], so X represents the first
arrival time. Therefore, we should expect X and Y = T X to be independent Expo(1)
random variables. We can verify this idea with direct calculation:
Z Z Z
1 t
fX,Y (x, y) = fX,Y | T (x, y | t)fT (t) dt = te dt = et dt
0 x+y t x+y
= e(x+y) , x, y > 0
CHAPTER 10. STOCHASTIC PROCESSES 119
We have recovered the joint density of i.i.d. Expo(1) random variables. Here, we see that
a Poisson process interpretation can sometimes lead to intuitive answers.
Chapter 11
Martingales
11.1 Introduction
Definition 11.1. A sequence of random variables {Xn : n N} is a martingale if
E[|Xn |] < and the sequence satisfies
E[Xn+1 | X1 , . . . , Xn ] = Xn . (11.1)
The gambling interpretation is that Xn is your fortune at time n, and you are making a series
of bets. The martingale property says that, conditioned on your fortune being X1 , . . . , Xn
in the first n time steps, your expected fortune at time n + 1 is Xn , that is, you dont expect
to win or lose any money on the (n + 1)th bet. A concise way of describing a martingale is:
a martingale describes a fair game. Note that the martingale property implies
so that E[Xn ] = E[Xn1 ] = = E[X0 ], i.e. your expected fortune is always the same.
Let us see a few examples of martingales before we prove results about them.
120
CHAPTER 11. MARTINGALES 121
= E[Sn | X1 , . . . , Xn ] + E[Xn+1 ] = Sn
| {z }
=0
First, we split Sn+1 into Sn + Xn+1 and apply linearity of expectation. Since Sn is a
function of X1 , . . . , Xn , then E[Sn | X1 , . . . , Xn ] = Sn . (Think about the MMSE prop-
erty: the best function of X1 , . . . , Xn to estimate Sn is simply Sn itself.) Since Xn+1 is
independent of X1 , . . . , Xn , then E[Xn+1 | X1 , . . . , Xn ] = E[Xn+1 ] = 0.
= Pn E[Xn+1 ] = Pn
| {z }
=1
Here, the interpretation is that your fortune is multiplied by Xn at time n, and E[Xn ] = 1
ensures that the bets are still fair.
E[Xn+1 Xn | X1 , . . . , Xn ] = 0. (11.2)
Saying that Yn is a martingale means that Yn is also a fair game. Under the new betting
system, no matter how inventive, you stand to do no better on average than your original
Xn ! Therefore, we can interpret the theorem as saying: you cannot cheat a fair game.
Think of T as the time at which you stop betting. The condition that 1(T =n) is a function of
X1 , . . . , Xn is interpreted as: your decision to stop betting after the nth bet can only depend
on information up until the nth bet, and not on the future.
CHAPTER 11. MARTINGALES 123
Proof. Sufficient:
and both 1(T n) and 1(T n1) are functions of X1 , . . . , Xn by assumption, so 1(T =n) is a
function of X1 , . . . , Xn .
Example 11.7. Fix a constant c and let T = min {n N : Xn c} be the first time
that Xn is at least c. Note that {T > n} is the event that none of X1 , . . . , Xn exceed c,
so we see that
Can we start with a martingale and profit by choosing a judicious stopping time? Quit
while youre ahead, in a nutshell. The next result says no:
Theorem 11.8. Suppose that Xn is a martingale and let T be a stopping time. Define
Yn to be the stopped process, that is Yn = Xmin(T,n) (so Yn = Xn until T , and then
Yn = XT afterwards). Then Yn is a martingale.
Proof. Define Hn = 1(nT ) . This reflects the following betting strategy: bet one dollar at
the start, and continue until time T , and then bet no more afterwards. We claim that Hn
is predictable: note that the event {n T } is the complement of {n > T } = {T n1},
Pn) = 1 1(T n1) ,Pwhich
so Hn = 1(nT is a function of X1P, . . . , Xn1 . Next, we observe
that Yn = i=1 (Yi Yi1 ) = i=1 (Xi Xi1 )1(iT ) = ni=1 Hi (Xi Xi1 ), so Yn is
n
Example 11.9 (Polyas Urn). Suppose we have an urn with one black ball and one
white ball. We repeatedly pick a ball from the urn, and put back in its place two balls
of the same color. Let Xn be the fraction of black balls in the urn at the nth draw.
We verify that Xn is a martingale. Suppose that there are x black balls at time n, so
Xn = x/(n + 1). With probability x/(n + 1), we choose a black ball, and the new fraction
of black balls is (x + 1)/(n + 2). With probability 1 x/(n + 1), we choose a white ball,
and the new fraction of black balls is x/(n + 2). Hence,
x x+1 nx+1 x x
E[Xn+1 | Xn ] = + = = Xn .
n+1 n+2 n+1 n+2 n+1
Let Bn be the number of black balls at time n. We compute the distribution of Bn as
follows: first, we compute the probability that the next k draws are all black, and then
the remaining n k draws are all white. The probability is
1 k 1 nk k! (n k)!
= .
2 k+1 k+2 n+1 (n + 1)!
If we want P (Bn = k +1), then we must draw k balls as above, but there are nk different
sequences of k black and n k white balls to achieve the same outcome. Hence,
(n
n! k! k)! 1
P (Bn = k + 1) = = .
(n k)! (n + 1)!
k!
n+1
We see that Bn Unif{0, . . . , n}. Therefore, we have found P (Xn = k/n) = (n + 1)1
for all k {0, . . . , n}. The limit is X Unif[0, 1].
Chapter 12
12.1 Introduction
Let X and Y be vector-valued random variables in Rm and Rn respectively (in particular,
it is not necessary to assume that m = n). We denote by E[X] the vector whose ith entry
is the expectation of the ith entry of the random variable X. We define the covariance
matrix of X and Y, X,Y , to be
and write X for X,X . The (i, j) entry of X,Y is E[Xi Yj ] E[Xi ]E[Yj ] = cov(Xi , Yj ),
where Xi is the random variable for the ith coordinate of X and Yj is the random variable
for the jth coordinate of Y. Hence, the entries of X,Y are the pairwise covariances between
the entries of X and the entries of Y.
TheoremP 12.1
Pn(Linearity of Expectation). Suppose f is a linear function of X, that is,
f (X) = m
i=1 j=1 i,j Xi,j . Then f (E[X]) = E[f (X)].
Proof.
m X
X n
E[f (X)] = i,j E[Xi,j ] = f (E[X])
i=1 j=1
The result also works for matrix-valued functions F(X), by applying Theorem 12.1 to each
entry of F. For some examples of linear functions, consider:
X + Y (the (i, j) entry is Xi,j + Yi,j ).
125
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 126
Pm
AX (left-multiplication by a constant matrix; the (i, j) entry is k=1 Ai,k Xk,j ).
Pn
XA (right-multiplication by a constant matrix; the (i, j) entry is k=1 Xi,k Ak,j ).
Next, we prove the scaling property for variance. Recall that (AX)T = XT AT .
Proof. We follow the definition of AX , using the fact that the transpose is a linear
operator.
In other words, Q1 = QT .
An important fact is that symmetric matrices can be diagonalized with orthogonal eigenvec-
tors. This is a key result in linear algebra, known as the spectral theorem for real, symmetric
matrices. We state and prove the result below:
Au = u. (12.2)
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 127
A u,
u = (12.3)
T Au uT A
u Tu
u = ( )u .
Let 1 and u1 be an eigenvalue and the corresponding normalized eigenvector. Let Vn1
denote the subspace of vectors which are orthogonal to u1 , and take any u Vn1 .
(Au)T u1 = uT Au1 = 1 uT u1 = 0,
Fix a vector x. Using the alternate form X = E[(X E[X])(X E[X])T ], we have
is positive semi-definite.
Proof. One can see that A is symmetric because A2,2 is symmetric and A2,1 = AT . To
1,2
is positive semi-definite, let x
show that A Rn1 be a vector. Define
A1,2 x
/a1,1
x := .
x
Definition 12.11. Let X and Y be random vectors such that X is non-singular, i.e.
X is invertible.
L(Y | X) := E[Y] + Y,X 1
X (X E[X]) (12.6)
where 1 is the vector of ones. The first equation is evident because E[Y L(Y | X)] = 0
and the second follows from
= (Y,X Y,X 1 T
X X )A = 0.
Since the trace is a linear operator, we have tr(E[A]) = E[tr(A)]. Applying this to the
matrix (Y L(Y | X))T (L(Y | X) AX b) (which we abbreviate as ZT W) and the
cylic property of the trace operator, tr(AB) = tr(BA), we find that
The first-term corresponds to irreducible error, and the last term is clearly minimized
when AX + b = L(Y | X).
yi = 0 + xTi (12.11)
Assume for simplicity that we have normalized all of the features so that they are zero-
mean, that is, N
P
x
i=1 i = 0. Then, observe that cov(Y, Xj ) (where Xj is the jth feature and
Y is the response) is E[Y Xj ] = N 1 N T T
P
i=1 yi xi,j = (y X)j , so N Y,X = y X. Similarly,
N 1 1 T 1
X = (X X) . However, (12.6) and (12.11) have opposite conventions regarding the
weight vector, so we must take the transpose again. This yields the closed-form solution
= (XT X)1 XT y.
(12.12)
d
dx1 f (x)
d ..
f (x) = . (12.13)
dx
d
f (x)
dxn
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 131
The same principle applies for differentiation w.r.t. a matrix X: the result is a matrix of the
same size as X, whose (i, j) entry is the derivative of f (X) w.r.t. the entry Xi,j . How about
the derivative of a vector y w.r.t. a vector x? The result is a matrix, whose (i, j) entry is
the derivative of yi w.r.t. xj .
In principle, computations can be carried out using the above definitions, but it is helpful
to have a toolbox of common matrix calculus identities in order to speed up computations.
since all but the ith term in the summation vanish. Since the derivative of xT y w.r.t. x is
the vector whose ith entry is yi , we see that
d T
x y = y. (12.14)
dx
(Compare the above equation with the derivative of f (x) = cx.) Using the symmetry of the
inner product, we also have
d T d T
x y= y x = x. (12.15)
dy dy
Now, suppose that both x and y are functions of z. If we take the derivative w.r.t. z, the
ith entry is
n n X n !
d T d X X dxk dyk dx dy
x y= xk y k = yk + xk = yk + xk ,
dzi dzi k=1 k=1
dzi dzi
k=1
dz k,i dz k,i
Suppose that we have collected data {(x1 , y1 ), . . . , (xN , yN )}. Our goal is to predict f(x0 ) at
a target point x0 , with the intention that f(x0 ) should be close to the true value y0 . Local
regression is based on the idea that when we make our prediction f(x0 ), we should pay more
attention to the training points (xi , yi ) which are near x0 . In other words, the closer that xi
lies to x0 , the more weight we should place on the training point (xi , yi ).
One example of a kernel is the k-nearest neighbors kernel, which places a weight of 1 on
the k closest points to x0 , and a weight of 0 for other points. Other times, we would like a
smoother kernel (a smoother kernel yields a smoother fit). Many kernels are of the form
|x x0 |
K (x0 , x) = D , (12.20)
where D(w) is a function that determines the shape of the kernel. For example,
(
(3/4)(1 w2 ), |w| 1,
D(w) = (12.21)
0, |w| > 1,
is the Epanechikov quadratic kernel, and
(
(1 |w|3 )3 , |w| 1,
D(w) = (12.22)
0, |w| > 1,
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 133
is the tri-cube kernel. The tri-cube kernel is differentiable at w = 1, since the derivative
vanishes at 1. Another option is the Gaussian kernel D(w) = (w), where is the
standard Gaussian density, but for computational purposes, we typically favor kernels with
bounded support to avoid keeping negligible contributions from distant points.
The goal is to find parameters 0 and 1 that solve the minimization problem
N
X
(0 , 1 ) = arg min K (x0 , xi )(yi 0 1 xi )2 , (12.23)
0 ,1 i=1
which is a least-squares criterion, weighted by the kernel function. Note that the parameters
0 and 1 depend on the choice of target point x0 , which means they are not suitable for
predictions at other target points.
Let us rewrite (12.23) in matrix notation. Let W denote a diagonal matrix whose (i, i) entry
is K (x0 , xi ). We will denote bi to be the column vector with entries 1 and xi , and denote
by the column vector with entries 0 and 1 . Then, we can write 0 + 1 xi as bTi . If we
denote by B the N 2 matrix whose ith row is bTi , then (12.23) is
= arg min (y B)T W(y B),
(12.24)
where y is the N 1 vector whose ith entry is yi . We write y B = z and compute the
derivative w.r.t. using (12.19) (note that W is symmetric):
n
X zT Wz zk X n
d T
z Wz = = 2zk Wk,k (Bk,i )
di k=1
z k i
k=1
Define the Jacobian matrix JT (x) to be the n n matrix whose (i, j) entry is xj ti . In-
tuitively, the Jacobian at a point x describes how the change of variable T behaves in an
infinitesimally small neighborhood of x, and the determinant of the Jacobian describes the
local change in volume. (The sign of the determinant does not matter in this context; a
negative sign indicates that the orientation of the axes has flipped.)
Although the full details of the theorem require a more formal study of integration, we
present the theorem and sketch the proof below.
Proof Sketch. Due to key results in integration theory, it suffices to check (12.27) in the
case when A is a rectangle. The idea is that any invertible linear mapping T can be
represented as the composition of three different types of elementary operations:
We must verify that in each of the three cases, the determinant and the change in volume
of the rectangle agree.
1. Permutation of the coordinates does not change the volume of the rectangle. Since
the determinant is multiplied by 1 in this case, |det(T )| is not affected.
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 135
2. The second case corresponds to translating one of the dimensions of the rectangle,
so the volume does not change. The determinant is also unchanged.
3. The third case corresponds to multiplying one of the side lengths of the rectangle
by c, where multiplication by a negative number produces a reflection. Therefore,
the volume is multiplied by |c| and the determinant is multiplied by c.
Proof Sketch. Again, due to key results in integration theory, it suffices to check (12.28)
in the case when f = 1T (A) , the indicator function for the set T (A), where A is a rectangle
lying inside U . Then, (12.28) reduces to
Z Z
|det(JT (x))| dx = dx.
A T (A)
The idea is to break up A into finitely many rectangles which are small enough so that
T (x) can be locally approximated (around the point x0 ) by T (x0 ) + JT (x0 )(x x0 ).
However, this is an invertible linear transformation, so the change in volume of a small
rectangle is given by Lemma 12.14. Specifically, if we partition A into the rectangles
A1 , . . . , Ak with xi Ai , then
k
X k
X Z
vol(T (A)) vol(T (Ai )) |det(JT (xi ))| vol(Ai ) |det(JT (x))| dx.
i=1 i=1 A
Technical Remark: The proof sketches of Lemma 12.14 and Theorem 12.15 both begin
with the words it suffices to check. . . , but there are different reasons behind each statement.
In Lemma 12.14, the rectangles form a generating class, that is, other sets can be formed
out of intersections and unions of rectangles, and a result about uniqueness of measures
concludes the proof. In Theorem 12.15, one can extend the proof from indicator functions to
simple functions (linear combinations of indicator functions) by linearity; and then to non-
negative functions by taking an increasing sequence of simple functions fn f and using the
Monotone Convergence Theorem. (The latter is a very common style of proof: prove the
statement for indicators, and extend the argument to general functions.)
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 136
Since each coordinate is a standard Gaussian, the expected value of each coordinate is 0:
E[Z] = 0 (12.30)
In place of the variance, we have the variance-covariance matrix = X,X . (See (12.1).)
By the assumption of independence, for i 6= j, we have cov(Xi , Xj ) = 0, and for i = j, we
have cov(Xi , Xi ) = var(Xi ) = 1. Hence, the variance-covariance matrix is
= I, (12.31)
From Theorem 12.15, we can compute the density by taking the standard density (12.29),
replacing z by A1 (x ) and dividing by det(JT (x)) = A. In differential notation,
dx
(x) dx = ( + Az) dz = (z) dz,
dz
where
dx
= det(A).
dz
or equivalently,
1 1 1 T 1
(x) = exp (A (x )) (A (x )) .
(2)n/2 det(A) 2
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 137
Now, since det() = det(AAT ) = det(A) det(AT ) = det(A)2 , we can replace det(A) by
det()1/2 . Also, we can simplify the argument of the exponential as
(x )T (A1 )T A1 (x ) = (x )T (AAT )1 (x ) = (x )T 1 (x ),
which gives the density of the general multivariate Gaussian:
1 1 T 1
fX (x) = exp (x ) (x ) (12.32)
(2)n/2 ||1/2 2
By the Cholesky Decomposition (Theorem 12.10), we can also start with any positive definite
matrix and decompose it into AAT . (Note: here, we want to be positive definite, that
is, the inequality xT x > 0 is strict for any x 6= 0. This guarantees that A has strictly
positive diagonal entries, i.e. A is non-singular.)
In the general case, when X has the multivariate Gaussian distribution and X is diagonal
(so the pairwise covariances are all 0), then the components of X are independent.
Example 12.16. Pick N points (assume that N is even) uniformly at random inside
a p-dimensional unit ball. What is the distance from the origin of the median of the
closest point, d(p, N )?
Let Xi denote the distance from the origin of point i and X = mini Xi . For a single
point, the CDF of Xi should be rp for 0 < r < 1, by the following argument: the volume
of a p-dimensional ball of radius r is crp (for some constant c) and the volume of the unit
ball is c 1p = c (with the same constant c), so the probability of lying within a radius r
is crp /c = rp . Then, for 0 < x < 1, P (X > x) = P (Xi > x)N = (1 xp )N . We seek the
value of x such that P (X > x) = 1/2, so d(p, N ) = (1 21/N )1/p .
The result does not scale favorably with large p; there are very few points located near the
origin and most points lie near the boundary of the unit ball.
Collectively, the adverse conditions of high-dimensional statistics are termed the curse of
dimensionality.
Chapter 13
Here, we explore the most direct application of probability theory: the study of how to draw
inferences about the population given sampled data.
The answer can be found using the change of variables formula. There is a minor difficulty
in that f (x) = x2 is not a one-to-one function, but as long as we remember to account for
this, the derivation is relatively straightforward.
FZ 2 (z) = P (Z 2 z) = P ( z Z z) = ( z) ( z) = 1 2( z)
Although this distribution may seen contrived, observe that fZ 2 (z) = (2)1/2 z 1/2 ez/2
2
is actually the gamma distribution with shape 1/2 and rate 1/2, i.e. Z (1/2, 1/2).
Incidentally, this tells us that (1/2)1/2 (1/2)1 = (2)1/2 , so (1/2) = . We can find
the mean and variance using the formulae we derived for the gamma distribution:
E[Z 2 ] = 1 (13.2)
var(Z 2 ) = 2 (13.3)
139
CHAPTER 13. STATISTICAL ESTIMATORS & HYPOTHESIS TESTING 140
E[X] = n, (13.5)
var(X) = 2n, (13.6)
MX () = (1 2)n/2 . (13.7)
Also as a consequence of the definition, if X 2m and Y 2n , then X+Y 2m+n . Since the
chi-square distribution with one degree of freedom has the (1/2, 1/2) distribution and the
sum of independent gamma random variables is also gamma, we see that 2n = (n/2, 1/2).
Using the formula for the density of the gamma distribution,
xn/21 ex/2
fX (x) = n/2 , x > 0. (13.8)
2 (n/2)
13.1.3 t Distribution
Let Z N (0, 1) and X 2n be independent. The distribution of T = Z/
p
X/n is known
as the t distribution with n degrees of freedom, denoted as T tn .
Differentiating, we obtain
2 2
2 xn2 ex /2 xn1 ex /2
fX (x) = 2xf X (x ) = 2x = , x > 0.
2n/2 (n/2) 2n/21 (n/2)
Next, we use (7.12) to obtain the density of Z/ X. Here, the equation simplifies further
since X 0.
Z Z 2
1 x2 z2 /2 xn1 ex /2
fZ/ X (z) = xfZ (xz)f X (x) dx = x e n/21 dx
0 0 2 2 (n/2)
Z
1 2 2
= (n1)/2 xn ex ((z +1)/2) dx
2 (n/2) 0
CHAPTER 13. STATISTICAL ESTIMATORS & HYPOTHESIS TESTING 141
Write (z 2 + 1)/2 = and use the change of variables = x2 (so we have xn = n/2 and
dx = (2)1/2 d). The integral we obtain has a familiar form:
Z
1
fZ/X (z) = (n+1)/2 (n1)/2 e d
2 (n/2) 0
This is precisely the integral of an un-normalized gamma distribution with rate and shape
= (n + 1)/2. Using (10.1) and the normalization condition, we see that
((n + 1)/2)
fZ/X (z) = (1 + z 2 )(n+1)/2 .
(n/2)
Finally,
we make the change
of variables T = n(Z/ X), which amounts to replacing z by
t/ n and dividing by n in the expression above. Our final result is
(n+1)/2
t2
((n + 1)/2)
fT (t) = 1+ . (13.9)
(n/2) n n
We can observe that the distribution is symmetric w.r.t. t, which implies
E[T ] = 0. (13.10)
Although unbiased estimators are desirable, sometimes we are content with estimators that
are unbiased in the long-run:
The reason why consistency is stronger than the property of being asymptotically unbiased
is that bias relates to the expectation of the estimator, while consistency enforces the addi-
tional requirement that the variance goes to 0.
Example 13.4. Suppose that we are trying to measure the mean of a quantity, and
we collect measurements x1 , . . . , xn . An example of an unbiased estimator is simply the
first element of the sample, x1 . However, x1 is not consistent, because it is not the case
that x1 in probability.
Having taken the time to interpret linear regression as a projection onto a subspace, the rest
is easier than you may expect. The predictions y are the projection of y onto the (p + 1)-
dimensional feature space, which means that y y (the residuals) is the projection of y onto
the orthogonal complement of the feature space, R, which has N (p + 1) dimensions. Using
(13.11), we observe that the projection of X onto R is 0 (this is a manifestation of (12.7)),
so we are left with the projection of onto R (call this PR ). Since the projection of a
N -dimensional Gaussian onto N p 1 dimensions is a (N p 1)-dimensional Gaussian,
we see that PR N (0, 2 I).
Observe that RSS() is the squared norm of the residuals, i.e. RSS()
= ky y k22 . Since
each of the N p 1 components of y y is a Gaussian with mean 0 and variance 2 , we
see that the squared norm represents the sum of N p 1 independent random variables,
which are each the square of a Gaussian. This should sound familiar, and we will make the
connection explicit:
1 2
2
RSS() N p1 (13.13)
CHAPTER 13. STATISTICAL ESTIMATORS & HYPOTHESIS TESTING 143
(The 2 factor is to convert the Gaussians into standard Gaussians, of course.) Taking
expectations, we find that
= 2 (N p 1)
E[RSS()]
2 for 2 is:
so that an unbiased estimator
N
2 1 X
:= (yi yi )2 (13.14)
N p 1 i=1
Specializing to the case where p = 0, the linear regression is simply y, the average of the
response values. This suggests that if we collect N samples y1 , . . . , yN , assumed to be drawn
from a N (, 2 ) distribution, then an unbiased estimator for 2 is the sample standard
deviation:
N
2 1 X
:= (yi y)2 (13.15)
N 1 i=1
PN
Note that, contrary to what you might have expected, N 1 )2
i=1 (yi y is not an unbiased
estimator for 2 .
Although the expression seems complicated, the numerator Pnhas 2the standard normal distri-
2
bution, and from the discussion in the previous section, z
i=1 i has the n1 distribution.
In symbols, z N (0, 1)/ n1 /(n 1), which means that z tn1 .
p 2
If we knew somehow, then we could use the estimate n1 ni=1 (yi y)/, and this would
P
have the N (0, 1) distribution. However, since we do not have information about , we must
estimate with , and as a result, our estimator instead has the tn1 distribution. Compared
to the normal distribution, the t distribution places slightly more area on its tails, reflecting
a greater degree of uncertainty in our estimate. More importantly, knowing the distribution
of z allows us to construct confidence intervals as a measure of quantifying our uncertainty.
CHAPTER 13. STATISTICAL ESTIMATORS & HYPOTHESIS TESTING 144
where P (X) is the probability of drawing the samples X under the parameters . In the
second equality, we have made the common assumption that the data points are i.i.d., so
P (X) decomposes into the product of the likelihoods for each point.
Often, for both computational and analytical ease, we prefer to work with the logarithm of
the probability. (Maximizing log(P (X)) occurs under the same value of as maximizing
P (X).) In this case, we seek
N
X
= arg max
log(P (xi )). (13.18)
i=1
Equivalently, since (13.18) is the sum over the data points, we can think of (13.18) as
maximizing the expectation
= arg max E[log(P (xi ))],
(13.19)
where the expectation is taken w.r.t. the empirical distribution of the sampled data.
Chapter 14
Randomized Algorithms
14.1 EM Algorithm
The EM algorithm (short for Expectation and Maximization, the two stages of the algo-
rithm) is a popular algorithm for fitting probabilistic models to data when there are hidden
or latent variables that are not observed as part of the data. The two stages proceed as
follows: in the Expectation step, we produce tentative assignments to the hidden variables;
in the Maximization step, we use the assignments of the hidden variables to fit the other
parameters. We will use a mixture of Gaussians model to illustrate the idea.
145
CHAPTER 14. RANDOMIZED ALGORITHMS 146
0 is the sample average of the data points which were generated from N (0 , 02 ), and
1 is
the sample average of the data points which were generated from N (1 , 1 ). Similarly, we
2
Now, let us turn to the problem of finding i . Although we do not know i , we can replace
i with its expectation, conditioned on the estimated parameters. Let ,2 denote the
density function of a N (, 2 ) random variable. We have
P (i = 1)1 ,12 (xi )
i := E[i | xi ] = P (i = 1 | xi ) = , (14.3)
P (i = 0)0 ,02 (xi ) + P (i = 1)1 ,12 (xi )
1. Initialize 02 ,
0 , 12 , and
1 , .
N
1 X
= i . (14.7)
N i=1
References
Markov Chains: The Bernoulli-Laplace model of diffusion is from [1]. The queueing model
is from [7]. The gamblers ruin analysis is mainly taken from [1]. The proofs of the theorems
are mostly from [1], as well as the random walk on a circle.
Continuous Probability I: The queueing example with exponential service times is from
[7]. The bounds on the tail probability of the Gaussian is from [2].
Moment-Generating Functions: The use of cumulants was taken from [1]. The proba-
bility generating functions are from [7].
Random Vectors & Linear Algebra: The notation for the non-Bayesian multivariate
regression is from [4]. The discussion on kernel regression is also heavily inspired by [4].
The multivariate change of variables theorem is a simplified treatment of the corresponding
theorem in [1]. The discussion on the curse of dimensionality is from [4].
Statistical Estimators & Hypothesis Testing: The discussion of the 2n and tn distri-
butions were inspired by the exposition in [6]. The example of an inconsistent estimator is
taken from [3]. The distribution of the linear regression coefficients follows the discussion in
[4], with added details.
147
Bibliography
[1] Patrick Billingsley. Probability & Measure. New York, New York: John Wiley & Sons,
Inc., 1995.
[2] Rick Durrett. Probability: Theory & Examples. New York, New York: Cambridge Uni-
versity Press, 2010.
[3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.
deeplearningbook.org. MIT Press, 2016.
[4] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical
Learning: Data Mining, Inference, & Prediction. Springer, 2008.
[5] Jim Pitman. Probability. New York, New York: Springer-Verlag New York, Inc., 1997.
[6] John A. Rice. Mathematical Statistics & Data Analysis. Belmont, California: Thomson
Higher Education, 2007.
[7] Howard M. Taylor and Samuel Karlin. An Introduction to Stochastic Modeling. San
Diego, California: Academic Press, 1998.
148