You are on page 1of 149

Probability Notes

Fall 2016

Sinho Chewi
Contents

1 Probability Theory 6
1.1 Probability Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Fundamental Probability Facts . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Generalized Inclusion-Exclusion Principle . . . . . . . . . . . . . . . . 8
1.3 Discrete Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Uniform Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.1 Bayes Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.1 Correlated Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Discrete Random Variables 16


2.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Combining Random Variables . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 The Distribution of a Random Variable . . . . . . . . . . . . . . . . . 16
2.1.3 Multiple Random Variables . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Tail Sum Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Indicator Random Variables . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.4 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.5 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.6 Memoryless Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.7 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . 26
2.3.8 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.9 Sums of Poisson Random Variables . . . . . . . . . . . . . . . . . . . 28
2.3.10 Poisson Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Variance & Inequalities 30


3.1 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 The Computational Formula . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Properties of the Variance . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Probability Distributions Revisited . . . . . . . . . . . . . . . . . . . . . . . 33

1
CONTENTS 2

3.2.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


3.2.2 Bernoulli Distribution & Indicator Random Variables . . . . . . . . . 33
3.2.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 Computing the Variance of Dependent Indicators . . . . . . . . . . . 34
3.2.5 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.6 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . 36
3.2.7 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Markovs Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Chebyshevs Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Cauchy-Schwarz Inequality . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 Bonus: Chernoff Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Regression & Conditional Expectation 43


4.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Bilinearity of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 Standardized Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 LLSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Projection Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Quadratic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 The Law of Iterated Expectation . . . . . . . . . . . . . . . . . . . . 52
4.5 MMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5.1 Orthogonality Property . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5.2 Minimizing Mean Squared Error . . . . . . . . . . . . . . . . . . . . . 55
4.6 Conditional Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 Markov Chains 58
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Transition of Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Markov Chain Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Hitting Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2 Probability of S 0 Before S . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 Long-Term Behavior of Markov Chains . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Classification of States . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.2 Invariant Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.3 Convergence of Distribution . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.4 Convergence Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4.5 Balance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.6 Invariant Distribution & Hitting Times . . . . . . . . . . . . . . . . . 71
CONTENTS 3

6 Continuous Probability I 73
6.1 Continuous Probability: A New Intuition . . . . . . . . . . . . . . . . . . . . 73
6.1.1 Differentiate the CDF . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.2 The Differential Method . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Continuous Analogues of Discrete Results . . . . . . . . . . . . . . . . . . . 76
6.2.1 Tail Sum Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 Important Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . 78
6.3.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.3 The Minimum & Maximum of Exponentials . . . . . . . . . . . . . . 81

7 Continuous Probability II 83
7.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.1 Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.2 Conditional Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2.3 Ratios of Random Variables . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.1 Integrating the Normal Distribution . . . . . . . . . . . . . . . . . . . 87
7.3.2 Mean and Variance of the Normal Distribution . . . . . . . . . . . . . 88
7.3.3 Sums of Independent Normal Random Variables . . . . . . . . . . . . 89
7.3.4 Tail Probability Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.4.1 Confidence Intervals Revisited . . . . . . . . . . . . . . . . . . . . . . 93
7.4.2 de Moivre-Laplace Approximation . . . . . . . . . . . . . . . . . . . . 94
7.5 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.6 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.6.1 Flipping Coins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

8 Moment-Generating Functions 99
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.1.1 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 100
8.1.2 Distribution of Random Sums . . . . . . . . . . . . . . . . . . . . . . 100
8.1.3 Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2 Common MGFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.3 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.5 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3 Probability Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3.1 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
CONTENTS 4

9 Convergence 105
9.1 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2.1 Borel-Cantelli Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.2.2 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . 106
9.3 Convergence of Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.4 CLT Proof Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.4.1 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.4.2 Proof Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

10 Stochastic Processes 111


10.1 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
10.1.1 Erlang Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.1.2 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.1.3 Poisson Process Examples . . . . . . . . . . . . . . . . . . . . . . . . 114

11 Martingales 120
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
11.2 Martingale Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.3 Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
11.4 Martingale Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

12 Random Vectors & Linear Algebra 125


12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
12.2 Symmetric Matrices & Spectral Theorem . . . . . . . . . . . . . . . . . . . . 126
12.2.1 Positive Semi-Definite Matrices . . . . . . . . . . . . . . . . . . . . . 127
12.2.2 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 128
12.3 Multivariate Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
12.3.1 Non-Bayesian Multivariate Regression . . . . . . . . . . . . . . . . . 130
12.4 Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
12.4.1 Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
12.4.2 Matrix-Vector Products . . . . . . . . . . . . . . . . . . . . . . . . . 131
12.4.3 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
12.5 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
12.5.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
12.5.2 Local Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 133
12.5.3 Local Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . 134
12.6 Multivariate Change of Variable . . . . . . . . . . . . . . . . . . . . . . . . . 134
12.7 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 136
12.7.1 Uncorrelated Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . 137
12.8 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
CONTENTS 5

13 Statistical Estimators & Hypothesis Testing 139


13.1 Important Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . 139
13.1.1 Squared Normal Distribution . . . . . . . . . . . . . . . . . . . . . . 139
13.1.2 Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 140
13.1.3 t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
13.2 Statistical Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
13.2.1 Bias & Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
13.2.2 Standard Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
13.2.3 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
13.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 144

14 Randomized Algorithms 145


14.1 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
14.1.1 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

References 147

Bibliography 148
Chapter 1

Probability Theory

We introduce the axioms of probability and illustrate the fundamentals.

1.1 Probability Axioms


Not everything in life goes as smoothly as anticipated: packets are dropped during trans-
mission, light bulbs burn out prematurely, and doctors must diagnose patients who display
symptoms of illness. (By the way, we will discuss all of these situations using probabilistic
models soon enough.) We have to face uncertainty in the outcomes of our actions, and we
have to reason logically about incomplete information. Probability theory isnt always de-
pressing though: we also use probability to gamble intelligently and estimate our chances of
success at various tasks.

It turns out that a wide variety of phenomena can be modeled effectively by the following
mathematical objects:
a set of all possible outcomes of interest, which we call (also known as the proba-
bility space or sample space),
subsets of the probability space, which we call events (sometimes denoted F),
a function that assigns values to sets, which will be our probability measure P .
Again, with less formality: contains all possible outcomes, and events are sets of possibil-
ities. We discuss events instead of individual outcomes because we are often interested in
more than one outcome at a time. (When we discuss continuous probability, we will also
see that it is not sufficient to only talk about individual outcomes.) Finally, our probability
measure is our way of assigning likelihoods to the events, with higher numbers corresponding
to more likely outcomes.

The question we are primarily interested in right now is: what properties do we want our
probability measure P to satisfy? Lets start with boundary conditions: it is natural to say
that the likelihood of nothing happening at all is 0, which we write as:
P () = 0 (1.1)

6
CHAPTER 1. PROBABILITY THEORY 7

(Recall that is the empty set.)

The second condition is by convention: it is extremely useful to restrict the probability values
to lie in the range [0, 1]. This condition can be written as:

P () = 1 (1.2)

An important consequence of this choice is that we may interpret the probability value of
an event as the long-term proportion that the event occurs. As a first example, consider
the archetypal fair coin toss. Saying that the probability that the coin comes up heads is
1/2 can be interpreted as follows: if we flipped the coin from now until forever (an infinite
number of times), then we would expect a fraction 1/2 of the flips to come up heads. This
is the frequentist view of statistics. Briefly, the other major view of statistics is known as
the subjectivist view, in which the probability value is interpreted as the degree of your be-
lief in the outcome. Regardless of which view you adopt, the laws of probability are the same.

Finally, the last probability axiom is countable additivity, which is stated as follows: if for
each 1 i < we have an event Ai , such that the events Ai are pairwise disjoint (i.e. no
two events have an outcome in common), then the probabilities of the events must add, in
the following sense:

!
[ X
P Ai = P (Ai ) (1.3)
i=1 i=1

In other words, likelihoods add for disjoint events. Perhaps it is not so clear right now why
countable additivity is so fundamental that we make it an axiom, but without this axiom we
would not have much of a theory at all. Before we move on, we can verify a quick consequence
of countable additivity: if A1 and A2 are disjoint events, then P (A1 A2 ) = P (A1 ) + P (A2 ).
This is known as finite additivity, and it follows from countable additivity and (1.1) by
taking A3 = A4 = = . As a check, prove to yourself via induction P that if A1 , . . . , An
are disjoint events, then finite additivity implies that P (A1 An ) = ni=1 P (Ai ). (As a
side-remark: observe that countable additivity does not follow from finite additivity alone.)

1.2 Fundamental Probability Facts


From the axioms, we can derive a number of rules which are invaluable to our calculations.
The first is that for any event A, we can write = A Ac (where Ac is the complement
of A). Applying finite additivity, we obtain 1 = P () = P (A Ac ) = P (A) + P (Ac ), which
is also written as:
P (Ac ) = 1 P (A) (1.4)
The next is known as subadditivity: if A B, then P (A) P (B). For if A B, then
we can write B = A + (B A), where A and B A are disjoint sets. Then, we have
P (B) = P (A) + P (B A), and noting that probabilities are non-negative, we can drop the
second term and turn it into an inequality.
CHAPTER 1. PROBABILITY THEORY 8

The next is the inclusion-exclusion principle, which states:

P (A B) = P (A) + P (B) P (A B) (1.5)

Instead of boring you with the details, take a look at the picture:

A B

The proof of this fact involves splitting AB into disjoint events and applying finite additivity
again. As you can see, proving these fundamental facts all have a similar flavor: split the
event into disjoint events, so that we can apply countable additivity. If we take the inclusion-
exclusion principle and drop the last term on the right, then we have a simple inequality:
P (A B) P (A) + P (B). With induction, we can extend this to the union bound:

P (A1 An ) P (A1 ) + + P (An ) (1.6)

Its a crude bound, but nevertheless useful.

Finally, we will prove one more, S just to illustrate the concept. Suppose we have events
A1 A2 A, such that i=1 Ai = A. We want to show that P (An ) P (A). Define
0 0 0
Ai Sby A1 = A1 and Ai S = Ai Ai1 for i 2. Then, we can notice the following facts: for all
n, ni=1 A0i = An , and 0 0
i=1 Ai = A. Furthermore, the Ai are disjoint, so applying countable
additivity here yields
n
! n
[ X
0
P (A) = lim P Ai = lim P (A0i ) = lim P (An ). (1.7)
n n n
i=1 i=1

We call this property monotonicity, although we will not discuss it further.

1.2.1 Generalized Inclusion-Exclusion Principle


From the inclusion-exclusion principle for two events, one can derive a general version for
the union of n events:
n
X X
P (A1 An ) = (1)k+1 P (Ai1 Aik ) (1.8)
k=1 1i1 <<ik n

The inner summation ranges over all k-tuples of indices (i1 , . . . , ik ) satisfying the condition
1 i1 < < ik n. In words, the generalized inclusion-exclusion principle prescribes the
following rule for calculating P (A1 An ): First, sum each of the individual probabilities
CHAPTER 1. PROBABILITY THEORY 9

P (Ai ). Next, subtract the probabilities for all pairwise intersections of events, P (Ai Aj ).
Then, add the probabilities for all 3-wise intersections of events P (Ai Aj Ak ), and continue
with the same pattern (with alternating signs).

One can prove the generalized inclusion-exclusion principle with a tedious proof by induction
on n, the number of events. Instead, we will try to arrive at a better understanding of the
formula and develop a more elegant proof.

First, study the diagram for the case of n = 2, which reveals why the inclusion-exclusion
principle is necessary: if we sum P (A) and P (B), then we are double-counting the intersec-
tion P (A B). To avoid dealing with overlapping events, we can write A B as the disjoint
union of three events A B c , A B, and Ac B. We can write these events as length-2
binary strings: the first digit is 1 if A is included, and the second digit is 1 if B is included.
Then, the three disjoint events correspond to the strings 10, 11, and 01, respectively. We
have included all length-2 binary strings which contain at least one 1 (the string with all 0s
corresponds to Ac B c , which is clearly not contained in A B).

Generalizing, A1 An can be written as the disjoint union of 2n 1 events, where each


event is encoded as a length-n binary string. The ith digit is 1 if Ai is included in the event,
and we include all length-n strings except the string of all 0s.

Consider a length-n binary string with exactly m 1s, where m 1. Call this event B. Let
us examine (1.8) and see how many times we count P (B). If we consider one event at a
time, then B is contained in exactly m of the events Ai1 , so P (B) appears m times with a
positive sign. If we consider pairwise intersections of events, then B is contained in exactly
m

2
of the events Ai1 Ai2 , and here P (B) appears with a negative sign. Continuing on,
if we consider k-wise intersections of events, then P (B) appears m k
times, with the sign
k+1
(1) . Therefore, the total number of times that P (B) is counted is
m   m  
k+1 m k m
X X
(1) =1 (1) .
k=1
k k=0
k

However, by the binomial theorem, m k m



= (1 1)m = 0, so we see that P (B) is
P
k=0 (1) k
counted exactly one time! Since this applies to each of the length-n binary strings (except
the string of all 0s), we see that the RHS of (1.8) sums the probabilities of each of the 2n 1
disjoint events which make up A1 An exactly once, which establises the formula.

1.3 Discrete Probability


We now consider the case when is at most countably infinite. Recall that the probability
measure P is a function which assigns real numbers to sets of outcomes; in general, this
statement is far more expressive than if we had simply said that P assigns real numbers to
individual outcomes. However, we will not need the full power of this statement yet. In the
case of discrete probability, the probability of any event is completely determined once we
CHAPTER 1. PROBABILITY THEORY 10

specify the probability of each outcome. Let us make this explicit:


X
P (A) = P ({}) (1.9)
A

What we are saying here is that the probability of the event A is the sum of the probabilities
of each outcome in A. Perhaps this is intuitively clear and we are being overly fussy about
the details, but attention to detail will be quite important later in the course.

1.3.1 Uniform Sample Space


An important case is when every outcome is equally likely to occur, i.e. P () = 1/||. We
say that is a uniform probability space. The advantage of having a uniform probability
space is that we may employ methods of combinatorics to compute probabilities:

|A|
P (A) = (1.10)
||
To compute the probability of the event A, we count the number of ways in which A is
achieved, and divide by the total number of elements in .

1.4 Conditional Probability


First, we state the important Law of Total Probability, which allows us to break down the
probability of one event into the probabilities of smaller events. Formally, suppose that the
events A1 , . . . , An partition the sample space, that is, they are disjoint and A1 An = .
Then, to find the probability of an event B, we can write:
n
X
P (B) = P (B Ai ) (1.11)
i=1

The utility of this statement is that P (B) may be more difficult to compute than P (B Ai ),
because for the latter, we have more information: we know that both B and Ai have occurred,
and the knowledge of Ai may aid us in the computation of P (B). This line of reasoning is
also useful for calculating probabilities associated with sequential processes, which can be
represented in a tree structure. Each node of the tree represents a decision, with the differ-
ent branches from a node representing the different possible outcomes from the decision. To
fully analyze such situations, we will need another tool called conditional probability.

Conditional probability is the idea that events can affect each other; for example, observing
that there are more clouds in the sky today informs us that there is a greater chance for
rain. How do we formalize this idea mathematically? Suppose that P represents our beliefs
in the likelihoods of various events (adopting the subjectivist view for a moment), and then
we observe an event A. In a sense, P is outdated information; P is a probability law that
was formulated before we learned about A, and knowing A might change our beliefs. What
CHAPTER 1. PROBABILITY THEORY 11

we need is a new probability law, which we will call the conditional probability law given the
event A. Since we have a new probability law, we may call it by a different name, such as
PA , but this is not standard. Instead, we write P (B | A), read as the probability of B given
A, to mean the probability assigned to the event B under the new law.

It is important to recognize that the new law is a true probability law, i.e. it must satisfy the
three probability axioms. In addition, we would like the following property: P (A | A) = 1.
This is equivalent to saying that knowing the occurence of A, we are certain that A has
indeed occurred. These statements are intended to motivate the following definition:

P (A B)
P (B | A) = (1.12)
P (A)

The interpretation of this equation is that the conditional probability law is formed from
the previous probability law in two stages: first, we restrict our attention only to the events
which are subsets of A; second, we normalize the probability, that is, we divide by P (A) to
ensure that the probability of A itself under this new law is now 1.

The expression above can be written in a more symmetric way, by noting that
P (A | B)P (B) = P (A B) = P (B | A)P (A). (1.13)
Often, we can write down conditional probability laws quite easily in a cause-and-effect
fashion, such as the probability that a patient has a fever given that the patient has the
flu. Then, the equation P (A B) = P (A)P (B | A) states that we can directly apply our
cause-and-effect knowledge, by multiplying the unconditional probability P (A) with the con-
ditional probability P (B | A), to obtain the probability that both A and B occur. The rule
P (A B) = P (A)P (B | A) is often known as the chain rule of probability.

Example 1.1 (Birthday Problem I). Suppose that we throw n balls into m bins. What
is the smallest n such that the probability of a collision (two balls landing in the same
bin) is at least 1/2?

When m = 365 and we view the balls as people and the bins as birthdays, the
question becomes: how many people do you need to gather in a room before the proba-
bility that two people share a birthday is at least 1/2? This is commonly known as the
birthday problem.

Let Ai denote the event that ball i does not collide with balls 1, . . . , i 1. We have
P (A1 ) = 1 since the first ball cannot produce a collision. We would like to compute
P (A1 An ), the probability that we do not have any collisions. To do so, we will
use the chain rule of probability,
n
Y
P (A1 An ) = P (A1 ) P (Ai | A1 Ai1 ). (1.14)
i=2
CHAPTER 1. PROBABILITY THEORY 12

Here, P (Ai | A1 Ai1 ) is the conditional probability that ball i does not collide
with the first i 1 balls, given that none of the first i 1 balls produced a collision. If
none of the first i 1 balls produced a collision, then they must each be in separate bins;
therefore, P (Ai | A1 Ai1 ) is the probability of ball i landing in one of the other
m (i 1) bins, which is (m i + 1)/m.

m m1 mn+1 m!
P (A1 An ) = = n
m m m m (m n)!

Although we have obtained an exact answer, many times it is worth the effort to inves-
tigate if we can approximate the solution in order to better understand its properties.
We rewrite the desired probability in the following way:
    n1
Y 
1 n1 i
P (A1 An ) = 1 1 1 = 1
m m i=0
m

Next, we can use the approximation ex 1 x to replace the factor 1 i/m with ei/m .
(1 x is the first-order series expansion of ex .)
n1
Y Pn1
P (A1 An ) ei/m = e i=0 i/m

i=0

Pn1
Evaluating the summation, we have i=0 i = n(n 1)/2 n2 /2, so
2 /(2m)
P (A1 An ) en .
p
Setting P (A1 An ) = 1/2, we find that n2 = (2 ln 2)m, or n = (2 ln 2)m. If we
ignore constants, then in the language of computational complexity, we have m O(n2 ).

As another application, suppose that we are designing a hash function

h : {0, . . . , n 1} {0, . . . , m 1}.

Here, the balls are inputs and the bins are the number of buckets in our hash
table. What we have found is that if we want the probability of collision between
two inputs to be less than 1/2 (in other words, for distinct inputs i and j, we want
P (h(i) 6= h(j)) 1/2), then we want the size of our hash table, m, to be roughly n2 .

Example 1.2 (Birthday Problem II). Using the union bound, one  can obtain the results
of Example 1.1 more easily. Since there are n balls, there are n2 = n(n 1)/2 different
pairs of balls. Let Bi denote the event that the ith pair of balls collides, and write
CHAPTER 1. PROBABILITY THEORY 13
Sn(n1)/2
B= i=1 Bi as the event of a collision. By the union bound,
n(n1)/2
X n(n 1) 1 n2
P (B) P (Bi ) = .
i=1
2 m 2m

To calculate P (Bi ), we observed that regardless of where the first ball lands, the prob-
ability that the second ball in the pair lands in the same bin as the first ball is 1/m.
Again, if we set P (B) = 1/2, we find that m O(n2 ).

Although the union bound may seem crude, it is usually simple to apply and often yields
good results.

1.4.1 Bayes Law


Consider the above situation, where we have the conditional probability that a patient has
the fever knowing that the patient has the flu. Actually, in reality, the situation is backwards:
we can observe that the patient has a fever, but we are not yet sure of the diagnosis. In fact,
there may be competing explanations for the phenomena, ranging from pneumonia to infec-
tions and even cancer. How do we use probabilistic reasoning to make an informed diagnosis?

We use the Law of Total Probability, (1.11), along with the definition of conditional proba-
bility, to write the following derivation:
P (A B) P (A B) P (B | A)P (A)
P (A | B) = = =
P (B) P (A B) + P (Ac B) P (B | A)P (A) + P (B | Ac )P (Ac )
In one sense, we have derived a way to relate P (A | B) to P (B | A), so we have a computa-
tional tool: whenever we have the wrong conditional probabilities for a problem we face,
use Bayes Rule to turn the conditional probabilities around. More generally, if A1 , . . . , An
partition the sample space, then:
P (B | Ak )P (Ak )
P (Ak | B) = Pn , 1kn (1.15)
i=1 P (B | Ai )P (Ai )

The derivation of Bayes Rule was not so difficult, so one may wonder why such a big deal is
made out of the formula. Let us examine another way to view Bayes Rule: given multiple
possible explanations for an observation we have just made, we can discern the most likely
explanation. This is called probabilistic inference, and it concerns problems such as patient
diagnosis, or determining whether a blood test result is a false positive.

Another interpretation is gained when we view conditional probability in the framework of


taking an old probability law (outdated information) and producing a new probability law
after the observation of an event. In this view, Bayes Rule can be viewed as an update rule,
which tells us the correct way to respond to new information. Sometimes, it is said that
Bayes Rule is the foundation for the field of artificial intelligence. After all, a rational agent
is one that collects information and best figures out how to utilize the new information, a
task for which Bayes Rule excels.
CHAPTER 1. PROBABILITY THEORY 14

1.5 Independence
Speaking of information, what if we observe an event A, but knowing that A occurs tells
us exactly nothing about whether B will occur? In other words, our belief that B occurs is
unchanged after the observation of A. In this case, we say that A and B are independent:
P (B | A) = P (B). Again, this equation may not look symmetric, but we can write the
condition for independence as follows:
P (A B) = P (A)P (B) (1.16)
Independence is a truly subtle concept. If we view the independence of A and B as the
information about B you receive from observing A (or vice versa), then we should have that
Ac and B are also independent, and A and B c should also be independent, and so forth.
Pinning down the concept of independence as information requires a more formal setting
than we have described so far, but for now, concepts such as knowledge and information
provide a good intuitive guide. For example, we will make statements such as: if I know the
outcome of one coin toss, I still have no clue what the next coin toss will be, so we say that
coin tosses are independent of each other.

In the above example, you may have noticed that the independence of coin tosses requires
talking about the independence of multiple events, and it is not fully clear what we mean.
In fact, there are different forms of independence: we can say that the events A1 , . . . , An are
pairwise independent if every pair of events is independent. A stronger statement is the
statement of mutual independence, which is tricky to formalize. Let us give it a try: if we
have two sets of events, such that the sets are disjoint, then mutual independence requires
that any combination of events from the first set is independent of any combination of events
from the second event. For example, A1 A3 should be independent of A2 A5 A6 , and so
forth. An equivalent way of stating this condition is:
n
! n
\ Y
P A0i = P (A0i ) (1.17)
i=1 i=1

where each A0i is allowed to be either Ai or . The reason we allow A0i to be is because we
want to allow combinations such as A1 A2 (combinations which do not use every Ai ), which
we can write as A1 A2 . This is just a technical detail though. Importantly,
you should recognize that mutual independence is stronger than pairwise independence. In
the case of coin tosses, we would like the stronger condition of mutual independence to hold,
because we want to talk about sets of coin tosses as well.

How about an infinite set of events Ai , maybe even uncountably many events? The answer
is that we say Ai is independent if any finite subcollection of the Ai is independent.

Example 1.3. Here is an example to demonstrate that mutual independence is indeed


stronger than pairwise independence. Consider two flips of a fair coin, and define the
following events: H1 is the event that the first coin is heads, H2 is the event that the
CHAPTER 1. PROBABILITY THEORY 15

second coin is heads, and S is the event that the two tosses produced the same outcome
(two heads or two tails).

We can check that the three events are pairwise independent. Conditioned on Hi ,
P (S | Hi ) is the probability that the other coin also lands heads, which is 1/2. Hence,
P (S | Hi ) = P (S) = 1/2, so Hi and S are independent. Clearly, H1 and H2 are indepen-
dent, so we have pairwise independence.

However, we do not have mutual independence:


1 1
= P (H1 H2 S) 6= P (H1 )P (H2 )P (S) =
4 8

Example 1.4. One might ask if P (A1 A2 A3 ) = P (A1 )P (A2 )P (A3 ) alone is sufficient
for mutual independence. The answer is no.

Consider the probability space of a fair die, = {1, 2, 3, 4, 5, 6}, and define the following
events: A1 = A2 = {1, 2, 3} and A3 = {3, 4, 5, 6}. We have
1
P (A1 A2 A3 ) = P ({3}) = = P (A1 )P (A2 )P (A3 ),
6
but clearly A1 and A2 are not independent. Therefore, when checking for mutual inde-
pendence of events Ai , one must check that (1.17) holds for every subset of the Ai .

1.5.1 Correlated Events


We say that the events A and B are positively correlated if P (A B) > P (A)P (B).
Similarly, we say that A and B are negatively correlated if P (A B) < P (A)P (B). Of
course, if P (A B) = P (A)P (B), then A and B are independent.

Intuitively, if P (A B) is large relative to P (A)P (B), then events A and B tend to occur
together; another way to see this is that P (AB) > P (A)P (B) implies that P (A|B) > P (A)
and P (B | A) > P (B). Hence, observing one of A or B increases the likelihood of observing
the other. If P (A B) is small relative to P (A)P (B), then observing one of A or B will
decrease the likelihood of observing the other; an extreme example is when A and B are
mutually exclusive (i.e. disjoint).
Chapter 2

Discrete Random Variables

Random variables are the tool of choice for modeling probability spaces for which the out-
comes are associated with a numerical value of interest. For instance, stock market prices
and the daily temperature are two examples of random numbers in our lives. Each random
variable has an associated distribution which describes its behavior; we will introduce com-
mon families of discrete distributions. We will see that the expectation of a random variable
is a useful summary statistic of its distribution which satisfies a crucial property: linearity.

2.1 Random Variables


Definition 2.1. A random variable is a function X : R that assigns a real
number to every outcome in the probability space.

We typically denote random variables by capital letters.

2.1.1 Combining Random Variables


We define addition of random variables in the following way: the random variable X + Y
is the random variable that maps to X() + Y (). Similarly, the random variable XY
is the random variable that maps to X()Y (). More generally, let f : Rn R be
any function. Then f (X1 , . . . , Xn ) is defined to be the random variable that maps to
f (X1 (), . . . , Xn ()).

2.1.2 The Distribution of a Random Variable


Random variables assign probabilities to real numbers. To see how this is done, consider
the sample space of two tosses of a fair coin: = {HH, HT, T H, T T }. Let us define the
random variable X in the following way:

X(HH) = 2
X(HT ) = 1
X(T H) = 1
X(T T ) = 0

16
CHAPTER 2. DISCRETE RANDOM VARIABLES 17

(Here, X represents the number of heads that we see in the two coin tosses.) What is the
probability that X takes on a particular value? We can write:
P (X = 0) = P ({T T }) = 1/4
P (X = 1) = P ({HT, T H}) = 1/2
P (X = 2) = P ({HH}) = 1/4
We can reasonably say that X assigns the probability 1/4 to the real number 0, the proba-
bility 1/2 to the real number 1, and the probability 1/4 to the real number 2. In this way, we
see that X induces a probability measure on the real line, which we call the distribution
of X. The distribution of X satisfies the probability axioms; for example, (1.2) applied to
the distribution of X is the equation:
X
P (X = x) = 1 (2.1)
x

where the sum ranges over all values of x in the range of X.

In the example above, we were explicit about the probability space in order to show that
X is a function from to R. However, the utility of random variables is that we can often
forget about the underlying probability space and focus our attention on the distribution
of X. For a discrete random variable X, we can specify the distribution of X simply by
giving the probabilities P (X = x) for all x in the range of X, without reference to the
original probability space . Indeed, that is what we will proceed to do from here onwards.

There are many common distributions (described in detail below), which have special names.
We use the symbol to denote that a random variable has a known distribution, e.g.
X Bin(n, p) indicates that the distribution of X is the Bin(n, p) distribution.

Equivalently, we can describe a probability distribution by its cumulative distribution


function, or its CDF function. The CDF is usually specified as a formula for P (X x).
The CDF contains exactly the same information as the original distribution. To see this
fact, observe that we can recover the probability distribution function (also known as the
PDF) from the CDF by the following formula:

P (X = x) = P (X x) P (X x 1) (2.2)

(assuming X takes on integer values).

2.1.3 Multiple Random Variables


The joint distribution of two random variables X and Y is the probability distribution
P (X = x, Y = y) for all possible pairs of values (x, y). The joint distribution must satisfy
the normalization condition:
XX
P (X = x, Y = y) = 1 (2.3)
x y
CHAPTER 2. DISCRETE RANDOM VARIABLES 18

We can recover the distribution of X separately (known as the marginal distribution of


X) by summing over all possible values of Y :
X
P (X = x) = P (X = x, Y = y) (2.4)
y

Similarly:
X
P (Y = y) = P (X = x, Y = y) (2.5)
x

The joint distribution contains all of the information about X and Y . From the joint dis-
tribution, we can recover the marginal distributions of X and Y . The converse is not true:
the marginal distributions are usually not sufficient to recover the joint distribution. The
reason for this is that the joint distribution captures information about the dependence of
X and Y . This leads us to formulate the following definition:

Definition 2.2. We say that two discrete random variables are independent if

x, y R P (X = x, Y = y) = P (X = x)P (Y = y). (2.6)

Notice the utility of independence: if X and Y are independent, then we can write their joint
probability as a product of their marginal probabilities (sometimes, we say that the joint
distribution factors). Not only does independence allow us to write down the joint distribu-
tion using only information about the marginal distributions, it also simplifies calculations.
All of the results in this section generalize easily to multiple random variables.

2.2 Expectation
Knowing the full probability distribution gives us a lot of information, but sometimes it is
helpful to have a summary of the distribution.

Definition 2.3. The expectation (or expected value) of a discrete random variable
X is defined to be the real number:
X X
E[X] = X()P () = xP (X = x) (2.7)
x

Technical Remark: The expectation is not always well-defined. To avoid these issues,
throughout the course we will assume E[|X|] < .

The interpretation of the expected value is as follows: pick N outcomes, 1 , . . . , N from a


probability distribution (we call this N trials of an experiment). For each trial, record the
value of X(i ). Then
X(1 ) + + X(N )
E[X]
N
CHAPTER 2. DISCRETE RANDOM VARIABLES 19

as N (in a certain sense that we will formalize later). Therefore, E[X] is the long-run
average of an experiment in which you measure the value of X.

Often, the expectation values are easier to work with than the full probability distributions
because they satisfy nice properties. In particular, they satisfy linearity:

Theorem 2.4 (Linearity of Expectation). Suppose X, Y are random variables, a R


is a constant, and c is the constant random variable (i.e. c() = c). Then:

1. E[X + Y ] = E[X] + E[Y ]

2. E[aX + c] = aE[X] + c

Proof. The proof essentially follows from the linearity of summations.

1.
X X X
E[X + Y ] = (X()P () + Y ()P ()) = X()P () + Y ()P ()

= E[X] + E[Y ]

2.
X X X
E[aX + c] = (aX() + c()) P () = a X()P () + c P ()

= aE[X] + c

Important: Notice that we did not assume that X and Y are independent. We will use
these properties repeatedly to solve complicated problems.

In the previous section, we noted that if X is a random variable and f : R R is a function,


then f (X) is a random variable. The expectation of f (X) is defined as follows:
X X
E[f (X)] = f (X())P () = f (x)P (X = x) (2.8)
x

The definition can be extended easily to functions of multiple random variables using the
joint distribution:
X
E[f (X1 , . . . , Xn )] = f (x1 , . . . , xn )P (X1 = x1 , . . . , Xn = xn ) (2.9)
x1 ,...,xn

Next, we prove an important fact about the expectation of independent random variables.
CHAPTER 2. DISCRETE RANDOM VARIABLES 20

Theorem 2.5 (Expectation of Independent Random Variables). Let X and Y be inde-


pendent random variables. Then the random variable XY satisfies

E[XY ] = E[X]E[Y ]. (2.10)

Proof.
X X
E[XY ] = xyP (X = x, Y = y) = xyP (X = x)P (Y = y)
x,y x,y
! !
X X
= xP (X = x) yP (Y = y) = E[X]E[Y ]
x y

The definition of independent random variables was used in the first line of the proof.
It is crucial to remember that the theorem does not hold true when X and Y are not
independent!

2.2.1 Tail Sum Formula


Next, we derive an important formula for computing the expectation of a random variable
that only takes on values in the natural numbers.

Theorem 2.6 (Tail Sum Formula). Let X be a random variable that only takes on values
in N. Then
X
E[X] = P (X x).
x=1

Proof. We manipulate the formula for the expectation:



X X
X k
E[X] = kP (X = k) = P (X = k)
k=1 k=1 x=1
X X X
= P (X = x) = P (X x)
x=1 k=x x=1

The formula is known as the tail sum formula because we compute the expectation by
summing over the tail probabilities of the distribution.

2.3 Discrete Probability Distributions


We will now discuss the common discrete probability distributions.
CHAPTER 2. DISCRETE RANDOM VARIABLES 21

2.3.1 Uniform Distribution


As a first example of a probability distribution, consider the uniform distribution over the
set {1, . . . , n}, typically denoted as Unif{1, . . . , n}. The meaning of uniform is that each
element of the set is equally likely to be chosen; therefore, the probability distribution is
1
P (X = x) = , x {1, . . . , n}.
n
The expectation of the uniform distribution is calculated fairly easily from the definition:
n
X 1 1 n(n + 1) n+1
E[X] = x = =
x=1
n n 2 2

where to evaluate the sum, we have used the triangular number identity:
n
X n(n + 1)
k= (2.11)
k=1
2

Example 2.7. Suppose you have a box with n chocolates, one of which is a special
dark chocolate. You pick out chocolates randomly, one at a time, without putting them
back into the box, until you find the special dark chocolate. Let X denote the number of
chocolates that you remove from the box. What is the distribution and expectation of X?

By symmetry considerations, X Unif{1, . . . , n}. The argument proceeds as follows:


since you are picking chocolates without replacement, we can imagine an equivalent sit-
uation in which you number the chocolates from 1 to n, pick a random permutation of
1, . . . , n, and draw the chocolates in the order specified by the permutation. A random
permutation must assign the special chocolates to each of the n positions with equal prob-
ability, so X is uniform over {1, . . . , n}. As discussed above, the expectation is (n + 1)/2.

As a consequence, if each person picks out just one chocolate from the box at random,
it does not matter whether you are the first, the last, or the kth person to pick out a
chocolate: the probability that you will end up with the special dark chocolate is always
the same, 1/n.

2.3.2 Bernoulli Distribution


The Bernoulli distribution with parameter p, denoted Ber(p), is a simple distribution that
describes the result of performing one experiment which succeeds with probability p. The
probability space is = {Success, Failure} with P (Success) = p and P (Failure) = 1 p.
Define the random variable X as
(
0, = Failure,
X() =
1, = Success.
CHAPTER 2. DISCRETE RANDOM VARIABLES 22

The distribution of X is

1 p,
x = 0,
P (X = x) = p, x = 1,

0, otherwise.

A concise way to describe the distribution is P (X = x) = (1 p)1x px for x {0, 1}. The
expectation of the Ber(p) distribution is

E[X] = 0 P (X = 0) + 1 P (X = 1) = 0 (1 p) + 1 p = p.

A quick example: the number of heads in one fair coin flip follows the Ber(1/2) distribution.

2.3.3 Indicator Random Variables


Let A be an event. We define the indicator of A, 1A , to be the random variable
(
0, / A,
1A () =
1, A.

Observe that 1A follows the Ber(p) distribution where p = P (A).

An important property of indicator random variables (and Bernoulli random variables) is


that X = X 2 = X k for any k 1. To see why this is true, note that X can only take on
values in the set {0, 1}. Since 02 = 0 and 12 = 1, then X() = X 2 (). By induction,
we have X = X k for k 1. We will use this property when we discuss the variance of
probability distributions.

The expectation of the indicator random variable is

E[1A ] = P (A) (2.12)

because it is a Bernoulli random variable with p = P (A).

2.3.4 Binomial Distribution


The binomial distribution with parameters n and p, abbreviated Bin(n, p), describes the
number of successes when we conduct n independent trials, where each trial has a probability
p of success. The binomial distribution is found by the following argument: the probability
of having a series of trials with k successes (and therefore n k failures) is pk (1 p)nk . We
need to multiply this expression by the number of ways to achieve k successes in n trials,
n

which is k . Hence,
 
n x
P (X = x) = p (1 p)nx , x {0, . . . , n}.
x
CHAPTER 2. DISCRETE RANDOM VARIABLES 23

Prove for yourself that the probabilities sum to 1, i.e.


n  
X n x
p (1 p)nx = 1. (2.13)
x=0
x
Let us proceed to compute the expectation of this distribution. According to the formula,
n  
X X n x
E[X] = xP (X = x) = x p (1 p)nx .
x x=0
x

This is quite a difficult sum to calculate! (Try it yourself and see if you can make any
progress.) To make our work simpler, we will instead make a connection between the binomial
distribution and the Bernoulli distribution we defined earlier. Let Xi be the indicator random
variable for the event that trial i is a success. (In the language of the previous section,
Xi = 1Ai , where Ai is the event that trial i is a success.) The key insight lies in observing
X = X1 + + Xn .
Each indicator variable Xi is 1 or 0 depending on whether trial i is a success, so if we sum
up all of the indicator variables, then we obtain the total number of successes in all n trials.
Therefore, compute
E[X] = E[X1 + + Xn ] = E[X1 ] + + E[Xn ].
Notice that in the last line, we used linearity of expectation. Now we can see why linearity
of expectation is so powerful: combined with indicator variables, it allows us to break up
the expectation of a complicated random variable into the sum of the expectations of simple
random variables. Using our result from the previous section on indicator random variables,
E[Xi ] = P (trial i is a success) = p.
Each term in the sum is simply p, and there are n such terms, so therefore
E[X] = np.
The result should make intuitive sense: if you are conducting n trials, and the probability of
success is p, then you expect a fraction p of the trials to be successes, which is saying that
you expect np total successes. The expectation matches our intuition.

By the way, the random variables Xi are an example of i.i.d. random variables, which is a
term that comes up very frequently (so we might as well define it now): i.i.d. stands for
independent and identically distributed. Indeed, since each trial is independent of each other
by assumption, the variables Xi are independent, although we did not need this fact to com-
pute the expectation. Linearity of expectation is powerful: it holds even when the variables
are not independent! Also, the Xi variables are identically distributed, which means they all
had the same probability distribution: Xi Ber(p).

A strategy now emerges for tackling complicated expected value questions: when computing
E[X], try to see if you can break down X into the sum of indicator random variables. Then,
computing the expectation becomes much easier because you can take advantage of linearity.
CHAPTER 2. DISCRETE RANDOM VARIABLES 24

2.3.5 Geometric Distribution


The geometric distribution with parameter p, abbreviated Geom(p), describes the number
of trials required to obtain the first success, assuming that each trial is independent and has
a probability of success p. If it takes exactly x trials to obtain the first success, there were
first x 1 failures (each with probability 1 p) and one success (with probability p). Hence,
the distribution is
P (X = x) = (1 p)x1 p, x Z+ .
Prove for yourself that the probabilities sum to 1, i.e.

X
(1 p)x1 p = 1. (2.14)
x=1

When working with the geometric distribution, it is often easier to work with the tail prob-
abilities P (X > x). In order for X > x to hold, there must be at least x failures; hence,

P (X > x) = (1 p)x .

Note that the tail probability is related to the CDF in the following way:

P (X > x) = 1 P (X x).

The clever way to find the expectation of the geometric distribution uses a method known as
the renewal method. E[X] is the expected number of trials until the first success. Suppose
we carry out the first trial, and one of two outcomes occurs. With probability p, we obtain
a success and we are done (it only took 1 trial until success). With probability 1 p, we
obtain a failure, and we are right back where we started. In the latter case, how many trials
do we expect until our first success? The answer is 1 + E[X]: we have already used one trial,
and we expect E[X] more trials since nothing has changed from our original situation (the
geometric distribution is memoryless). Hence,

E[X] = p 1 + (1 p) (1 + E[X]).

Solving this equation yields


1
E[X] = ,
p
which is also intuitive: if we have, say, a 1/100 chance of success on each trial, we would
naturally expect 100 trials until our first success. (Note: if the method above does not seem
rigorous to you, then worry not. We will revisit the method under the framework of condi-
tional expectation in the future.)

Here is a more computational way to obtain the formula. We want to evaluate the sum

X
X
x1
E[X] = x(1 p) p=p (x + 1)(1 p)x .
x=1 x=0
CHAPTER 2. DISCRETE RANDOM VARIABLES 25

Recall the following identity (from calculus, and geometric series):



X 1
xk = (2.15)
k=0
1x
Multiply both sides of the identity by x:

X x
xk+1 =
k=0
1x
Differentiate both sides with respect to x:

X 1
(k + 1)xk =
k=0
(1 x)2
Set x = 1 p to evaluate our original sum:
1 1 1
E[X] = p 2
=p 2 =
(1 (1 p)) p p
A third way of finding the expectation is to use the Tail Sum Formula (Theorem 2.6).

We can show that the minimum of independent geometric random variables is geometric:

Theorem 2.8. Let X Geom(p) and Y Geom(q) be independent random variables.


Then, min{X, Y } Geom(p + q pq).

Proof. We can use the tail probabilities to simplify the derivation. If the minimum of X
and Y is greater than z, then both X and Y are greater than z. Hence,

P (min{X, Y } > z) = P (X > z, Y > z) = P (X > z)P (Y > z) = (1 p)z (1 q)z


= (1 (p + q pq))z .

We recognize the last expression as the tail probability of a geometric random variable
with parameter p + q pq.
Another way of thinking about the above result is that on each trial, we have a success from
X with probability p and a success from Y with probability q. By the inclusion-exclusion
rule, the probability that we have a success from either X or Y is p + q pq, and the trials
are independent from each other. Hence, we meet the description for a geometric distribution.

Example 2.9 (Coupon Collectors Problem). There are n different coupons that you
would like to collect. Every time you buy an item from the store, you receive a random
coupon (each of the n coupons is equally likely to appear). What is the expected number
of items you must buy before you collect every coupon?

Let Ti be the number of items it requires to collect the ith new coupon. In other words,
CHAPTER 2. DISCRETE RANDOM VARIABLES 26

starting from when you have seen i 1 distinct coupons, Ti represents the additional
number of items you must purchase before youP see a coupon you have not seen before.
T , the total time to collect all coupons, is T = ni=1 Ti .

Once we have collected i 1 coupons, there are n i + 1 coupons we have not seen yet,
so the probability that the next item we buy comes with a coupon we have not seen is
(n i + 1)/n. If we regard each object bought as an independent trial, then we see that
Ti Geom(p), where p = (n i + 1)/n. By linearity of expectation,
n n n
X X n X1
E[T ] = E[Ti ] = =n = nHn ,
i=1 i=1
ni+1 i=1
i

where Hn is the nth harmonic sum, Hn = ni=1 i1 . A good approximation to Hn is


P
ln n + , where is the Euler-Mascheroni constant, defined to be

:= lim (Hn ln n). (2.16)


n

has the numerical value 0.577. Using this approximation, E[T ] n(ln n + ).

2.3.6 Memoryless Property


An important property of the geometric distribution is that it is memoryless, which is to
say that a random variable following the geometric distribution only depends on its current
state and not on its past. To make this notion formal, we shall show:

Theorem 2.10 (Memoryless Property). The geometric distribution satisfies

P (X > s + t | X > s) = P (X > t).

Proof.

P (X > s + t, X > s) P (X > s + t) (1 p)s+t


P (X > s + t | X > s) = = =
P (X > s) P (X > s) (1 p)s
= (1 p)t = P (X > t)

Intuitively, the theorem says: suppose you have already tried flipping a coin s times, without
success. The probability that it takes you at least t more coin flips until your first success is
the same as the probability that your friend picks up a coin and it takes him/her at least t
coin flips. Moral of the story: the geometric distribution does not care how many times you
have already flipped the coin, because it is memoryless.

2.3.7 Negative Binomial Distribution


We can consider a slight generalization of the geometric distribution: if we have independent
trials, with probability of success p, how many trials do we need until we obtain k successes?
CHAPTER 2. DISCRETE RANDOM VARIABLES 27

Consider the probability that we require x trials: we have k successes and x k failures,
and the probability of any sequence of k successes and x k failures is pk (1 p)xk . Now, a
counting argument gives the number of such sequences: the last success must occur on the
xth trial, and there are x1
k1
ways to distribute the k 1 remaining successes among the
other trials. Hence, the negative binomial distribution is
 
x1 k
P (X = x) = p (1 p)xk , x = k, k + 1, . . .
k1

When k = 1, we recover the geometric distribution. To compute the expectation, we use


linearity once again: let Xi be the number of trials it takes toP
obtain the ith success, starting
after we have already observed i 1 successes. Then X = ki=1 Xi , where Xi Geom(p).
Using the expectation of the geometric distribution,
k
X k
E[X] = E[Xi ] = .
i=1
p

2.3.8 Poisson Distribution


The Poisson distribution with parameter , abbreviated Pois(), is an approximation to the
binomial distribution under certain conditions: let the number of trials, n, approach infinity
and the probability of success per trial, p, approach 0, such that the mean E[X] = np remains
a fixed value . These assumptions also tell us when the approximation is reasonable: the
probability of success should be low, and the number of trials should be high, such that
the product np is roughly between 1 and 10. Under these conditions, x is typically small
compared to n. The probability distribution is
 
n x 1 n!
P (X = x) = p (1 p)nx = px (1 p)nx
x x! (n x)!
n
nx px x e


1 ,
x! n x!

where in the last line, we have used the identity


 x n
lim 1 + = ex . (2.17)
n n
The previous section was to motivate the form of the Poisson distribution. We now define
the Poisson distribution:
e x
P (X = x) = , x N.
x!
Check for yourself that the probabilities sum to 1, i.e.

X e x
= 1. (2.18)
x=0
x!
CHAPTER 2. DISCRETE RANDOM VARIABLES 28

Remember to use the important identity



x
X xk
e = . (2.19)
k=0
k!

(Random tidbit: In many cases, the power series is taken as the definition of ex . The power
series converges everywhere.)

The expectation of the Poisson distribution is, as we would expect, . But lets prove it:

X e x X e x
X x1
E[X] = x = x =e = e e =
x=0
x! x=1
x! x=1
(x 1)!

2.3.9 Sums of Poisson Random Variables


Here, we will prove an important fact about the sums of Poisson random variables.

Theorem 2.11 (Sums of Independent Poisson Random Variables). Let X Pois()


and Y Pois() be independent random variables. Then

X + Y Pois( + ).

Proof. We will compute the distribution of X + Y and show that it is Poisson (using
independence).
z z
X X e j e zj
P (X + Y = z) = P (X = j, Y = z j) =
j=0 j=0
j! (z j)!
z z  
e(+) X z! e(+) X z j zj
= j zj =
z! j=0 j! (z j)! z! j=0 j
e(+) ( + )z
= = P (Pois( + ) = z)
z!
In the last line, we have used the Binomial Theorem.

Remarks: Linearity of expectation tells us that E[X + Y ] = E[X] + E[Y ] = + . This


theorem tells us something stronger, namely that the distribution of X + Y remains Poisson,
Pn parameter P+n . By induction, it follows that if Xi Pois(i ) for i {1, . . . , n}, then
with
i=1 Xi Pois( i=1 i ).

2.3.10 Poisson Splitting


Here, we introduce another important property of the Poisson distribution known as the
Poisson splitting property, which will be proven first and motivated later.
CHAPTER 2. DISCRETE RANDOM VARIABLES 29

Theorem 2.12 (Poisson Splitting). Suppose that X Pois() and that conditioned on
X = x, Y follows the Bin(x, p) distribution. Then Y Pois(p).

Proof. We use the definition of conditional probability and show that Y has the correct
distribution. Notice that the sum starts from x = y because X Y .

X
X
P (Y = y) = P (X = x, Y = y) = P (X = x)P (Y = y | X = x)
x=y x=y

e x x y x
 
X
xy
X x!
= p (1 p) =e py (1 p)xy
x=y
x! y x=y
x! y! (x y)!

e (p)y X ((1 p))xy e (p)y (1p)
= = e
y! x=y
(x y)! y!
ep (p)y
= = P (Pois(p) = y)
y!

As an example: suppose that the number of calls that a calling center receives per hour
is distributed according to a Poisson distribution with mean . Furthermore, suppose that
each call that the calling center receives is independently a telemarketer with probability p
(therefore, the distribution of telemarketing calls is binomial, conditioned on the number of
calls received). Then, the number of telemarketing calls that the calling receives per hour
(unconditional) follows a Poisson distribution with mean p. This property of the Poisson
distribution is also rather intuitive because it says that if a Poisson random variable is thinned
out such that only a fraction p remains, then the resulting distribution remains Poisson.
However, take time to appreciate that the Poisson splitting property is not immediately
obvious without the mathematical proof.
Chapter 3

Variance & Inequalities

Previously, we have discussed the expectation of a random variable, which is a measure of the
center of the probability distribution. Today, we discuss the variance, which is a measure of
the spread of the distribution. Variance, in a sense, is a measure of how unpredictable your
results are. We will also cover some important inequalities for bounding tail probabilities.

3.1 Variance
Suppose your friend offers you a choice: you can either accept $1 immediately, or you can
enter in a raffle in which you have a 1/100 chance of winning a $100 payoff. The expected
value of each of these deals is simply $1, but clearly the offers are very different in nature! We
need another measure of the probability distribution that will capture the idea of variability
or risk in a distribution. We are now interested now in how often a random variable will
take on values close to the mean.

Perhaps we could study the quantity X E[X] (the difference between what we expected
and what we actually measured), but we quickly notice a problem:

E [X E[X]] = E[X] E[X] = 0 (3.1)

The expectation of this quantity is always 0, no matter the distribution! Every random
variable (except the constant random variable) can take on values above or below the mean
by definition, so studying the average of the differences is not interesting. To address this
problem, we could study |X E[X]| (thereby making all differences positive), but in prac-
tice, it becomes much harder to analytically solve problems using this quantity. Instead, we
will study a quantity known as the variance of the probability distribution:

Definition 3.1. The variance of a probability distribution is:

var(X) = E[(X E[X])2 ] (3.2)

Remark: We often denote


p the mean of the probability distribution as , and the variance
2
as . We call = var(X) the standard deviation of X. The standard deviation is

30
CHAPTER 3. VARIANCE & INEQUALITIES 31

useful because it has the same units as X, allowing for easier comparison. As an example, if
X represents the height of an individual, then X would have units of meters, while var(X)
has units of meters2 .

3.1.1 The Computational Formula


The explicit formula for variance is:
X
var(X) = (x E[X])2 P (X = x) (3.3)
x

In practice, however, we tend to use the following formula to calculate variance:

Theorem 3.2 (Computational Formula for Variance). The variance of X is:

var(X) = E[X 2 ] E[X]2 (3.4)

Proof. We use linearity of expectation (note that E[E[X]] = E[X] since E[X] is just a
constant):

E[(X E[X])2 ] = E X 2 2XE[X] + E[X]2 = E[X 2 ] 2E[X]E[X] + E[X]2


 

= E[X 2 ] E[X]2

This formula will be extremely useful to us throughout the course, so please memorize it!

3.1.2 Properties of the Variance


We next examine some useful properties of the variance.

Theorem 3.3 (Properties of the Variance). Let X be a random variable, a R be a


constant, and c be the constant random variable. Then:

var(aX + c) = a2 var(X) (3.5)

Proof. We use the computational formula:

var(aX + c) = E[(aX + c)2 ] E[aX + c]2 = E[a2 X 2 + 2acX + c2 ] (aE[X] + c)2


= a2 E[X 2 ] + 2acE[X] + c2 a2 E[X]2 2acE[X] c2
= a2 (E[X 2 ] E[X]2 ) = a2 var(X)

Observe that adding a constant does not change the variance. Intuitively, adding a constant
shifts the distribution to the left or right, but does not affect its shape (and therefore its
spread). On the other hand, scaling by a constant does scale the variance.
CHAPTER 3. VARIANCE & INEQUALITIES 32

Corollary 3.4 (Scaling of the Standard Deviation). Let X be a random variable, a R


be a constant, and c be the constant random variable. Then:

aX+c = |a|X (3.6)

Proof. The corollary follows immediately from taking the square root of the result in
Theorem 3.3.

We saw that linearity of expectation was an extremely powerful tool for computing expec-
tation values. We would like to have a similar property hold for variance, but additivity of
variance does not hold in general. However, we have the following useful theorem:

Theorem 3.5 (Variance of Sums of Random Variables). Let X and Y be random vari-
ables. Then:

var(X + Y ) = var(X) + var(Y ) + 2(E[XY ] E[X]E[Y ]) (3.7)

Proof. As always, we start with the computational formula for variance.

var(X + Y ) = E[(X + Y )2 ] E[X + Y ]2 = E[X 2 + 2XY + Y 2 ] (E[X] + E[Y ])2


= E[X 2 ] + 2E[XY ] + E[Y ]2 E[X]2 2E[X]E[Y ] E[Y ]2
= (E[X 2 ] E[X]2 ) + (E[Y ]2 E[Y ]2 ) + 2(E[XY ] 2E[X]E[Y ])
= var(X) + var(Y ) + 2(E[XY ] E[X]E[Y ])

We will reveal the importance of the term E[XY ] E[X]E[Y ] later in the course. For now,
we are more interested in the corollary:

Corollary 3.6 (Variance of Independent Random Variables). Let X and Y be indepen-


dent. Then:
var(X + Y ) = var(X) + var(Y ) (3.8)

Proof. When X and Y are independent, E[XY ] E[X]E[Y ] = 0 according to Theo-


rem 2.5.

The general case is proven by induction. If X1 , . . . , Xn are pairwise independent random


variables, then
var(X1 + + Xn ) = var(X1 ) + + var(Xn ). (3.9)
Remark: Observe that the assumption of pairwise independence can be replaced by the
condition that X1 , . . . , Xn are pairwise uncorrelated, which means E[Xi Xj ] = E[Xi ]E[Xj ]
for every pair (i, j).
CHAPTER 3. VARIANCE & INEQUALITIES 33

3.2 Probability Distributions Revisited


We will revisit the probability distributions introduced last time and proceed to calculate
their variances.

3.2.1 Uniform Distribution


Recall that if X Unif{1, . . . , n},
n+1
E[X] = .
2
We compute:
n n
2
X 1
2 1X 2 1 n(n + 1)(2n + 1) (n + 1)(2n + 1)
E[X ] = x = x = = ,
x=1
n n x=1 n 6 6

where we have used the identity (verified using induction)


n
X n(n + 1)(2n + 1)
k2 = . (3.10)
k=1
6

The variance is calculated to be (after a little algebra):

(n + 1)(2n + 1) (n + 1)2 n2 1
var(X) = E[X 2 ] E[X]2 = =
6 4 12

3.2.2 Bernoulli Distribution & Indicator Random Variables


Recall that if X Ber(p),
E[X] = p.
Additionally, recall that Bernoulli random variables satisfy the important property X 2 = X.
Hence, we have
E[X 2 ] = E[X] = p.
The variance of a Bernoulli random variable is

var(X) = E[X 2 ] E[X]2 = p p2 = p(1 p).

In the special case of indicator random variables, p = P (A) and:

var(1A ) = P (A)(1 P (A)) (3.11)

Observe that p(1 p) is maximized when p = 1/2, and it attains a maximum vaue of 1/4
(to convince yourself, draw the parabola). We will make use of this observation later.
CHAPTER 3. VARIANCE & INEQUALITIES 34

3.2.3 Binomial Distribution


Recall that if X Bin(n, p), X is the sum of i.i.d. Ber(p) random variables:

X = X 1 + + Xn

Hence,

var(X) = var(X1 + + Xn ) = var(X1 ) + + var(Xn ),

where we have used the independence of the indicator random variables to apply Corol-
lary 3.6. Since each indicator random variable follows the Ber(p) distribution,

var(X) = n var(Xi ) = np(1 p).

3.2.4 Computing the Variance of Dependent Indicators


Unlike when we computed the mean of the binomial distribution (which did not require any
assumptions except that the binomial distribution could be written as the sum of indica-
tors), the calculation of the variance of the binomial distribution relied on a crucial fact:
the indicator variables were independent. In this section, we outline a general method for
computing the variance of random variables which can be written as the sum of indicators,
even when the indicators are not independent.

Let X be written as the sum of identically distributed indicators, which are not assumed to
be independent:
X = 1A1 + + 1An
We first note that the expectation is easy, thanks to linearity of expectation (which holds
regardless of whether the indicator random variables are independent or not):
n
X n
X
E[X] = E[1Ai ] = P (Ai ) (3.12)
i=1 i=1

Using the fact that the indicator variables are identically distributed:

E[X] = nP (Ai ) (3.13)

Next, we compute E[X 2 ]:


E[X 2 ] = E[(1A1 + + 1An )2 ]
Observe that the square (1A1 + + 1An )2 has two types of terms:

1. There are like-terms, such as 12A1 and 12A3 . However, we know from the properties of
indicators that 12Ai = 1Ai . There are n of these terms in total:
n
X n
X
12Ai = 1Ai = 1A1 + + 1An = X (3.14)
i=1 i=1
CHAPTER 3. VARIANCE & INEQUALITIES 35

2. Then, there are cross-terms, such as 1A2 1A4 and 1A1 1A2 . There are n2 total terms in the
square, and n of those terms are like-terms, which leaves n2 n = n(n 1) cross-terms.
We usually write the sum:
X
1Ai 1Aj = 1A1 1A2 + + 1An1 1An
i6=j

We can discover more about the cross-terms by examining their meaning. Consider the term
1Ai 1Aj : it is the product of two indicators. Each indicator 1Ai is either 0 or 1; therefore, their
product is also 0 or 1, which suggests that the product is also an indicator! The product is
1 if and only if each indicator is 1, which in the language of probability is expressed as

P (1Ai 1Aj = 1) = P (1Ai = 1 1Aj = 1) = P (Ai Aj ). (3.15)

We have arrived at a crucial fact: the product of two indicators 1Ai and 1Aj is itself an
indicator for the event Ai Aj . Therefore, we can rewrite the sum:
X X
1Ai 1Aj = 1Ai Aj (3.16)
i6=j i6=j

Putting it together, we have that


n
!2
X X
X2 = 1Ai =X+ 1Ai Aj . (3.17)
i=1 i6=j

The expectation of the square is


X
E[X 2 ] = E[X] + E[1Ai Aj ] = nP (Ai ) + n(n 1)P (Ai Aj ). (3.18)
i6=j

(For simplicity, we made the assumption that all of the intersection probabilities P (Ai Aj )
are the same.) Finally, the variance is E[X 2 ] E[X]2 , or:

var(X) = nP (Ai ) + n(n 1)P (Ai Aj ) n2 (P (Ai ))2 (3.19)

Although the resulting formula looks rather complicated, it is a remarkably powerful demon-
stration of the techniques we have developed so far. The path we have taken is an amusing
one: when the indicators are not independent, additivity of variance fails to hold, so the tool
we ended up relying on was... linearity of expectation and indicators!

3.2.5 Geometric Distribution


Next, we compute the variance of the geometric distribution. (It will mostly be an exercise
in manipulating complicated series, but we include it for completeness.)

Recall that if X Geom(p),


1
E[X] = .
p
CHAPTER 3. VARIANCE & INEQUALITIES 36

We must compute

X
2
E[X ] = p x2 (1 p)x1 .
x=1

Recall the formula


X 1
kxk1 = . (3.20)
k=1
(1 x)2
Shifting the index k by 1 yields the equivalent formula

X 1
(k + 1)xk = .
k=0
(1 x)2

Differentiating this equation with respect to x yields



X 2
k(k + 1)xk1 = . (3.21)
k=1
(1 x)3

Subtracting (3.20) from (3.21) yields



X 2 1 1+x
k 2 xk1 = 3
2
= .
k=1
(1 x) (1 x) (1 x)3

Setting x = 1 p yields

X 2p
k 2 (1 p)k1 = .
k=1
p3
Finally, we obtain
2p 2p
E[X 2 ] = p 3
= .
p p2
The variance is computed to be
2p 1 1p
var(X) = 2
2 = .
p p p2

3.2.6 Negative Binomial Distribution


Recall that the negative binomial distribution is the number of trials until the kth success,
where each trialPis independent with probability of success p. To calculate the expectation,
we wrote X = ki=1 Xi , where Xi is the number of trials it takes to observe the ith success,
starting from i 1 successes. Each Xi is an independent Geom(p) distribution, so we can
write the variance of X as the sum of the variances:
k
X k(1 p)
var(X) = var(Xi ) =
i=1
p2
CHAPTER 3. VARIANCE & INEQUALITIES 37

3.2.7 Poisson Distribution


Once again, computing the variance of the Poisson distribution will be an exercise in ma-
nipulating complicated sums (with one clever trick). The result, however, is extremely useful.

Recall that if X Pois(),


E[X] = .
We will proceed to calculate E[X(X 1)] instead.

X x e 2
X x2 2
X x
E[X(X 1)] = x(x 1) = e = e = 2 e e = 2
x=2
x! x=2
(x 2)! x=0
x!

Hence, by linearity of expectation,

var(X) = E[X 2 ] E[X]2 = E[X(X 1)] + E[X] E[X]2 = 2 + 2 .

We have what may be considered to be a surprising result:

var(X) =

3.3 Inequalities
Often, probability distributions can be difficult to compute exactly, so we will cover a few
important bounds.

3.3.1 Markovs Inequality


The following inequality is quite flexible because it can be applied with any increasing func-
tion f . The only information we require is E[f (X)].

Theorem 3.7 (Markovs Inequality). Let X be a random variable, f be an increasing,


non-negative function, and a R such that f (a) 6= 0. Then:

E[f (X)]
P (X a) (3.22)
f (a)

Proof. Let 1Xa be the indicator that X a and define h(X) = f (a)1Xa . We claim
that h(X) f (X) always:

1. If X < a, then 1Xa = 0, so h(X) = 0 f (X) (since f is non-negative).

2. If X a, then h(X) = f (a) f (X) (since f is increasing).


CHAPTER 3. VARIANCE & INEQUALITIES 38

Then, we have

E[f (X)] E[h(X)] = E[f (a)1Xa ] = f (a)E[1Xa ] = f (a)P (X a).

Corollary 3.8 (Weak Markovs Inequality). Let X be a non-negative random variable


and a > 0. Then
E[X]
P (X a) . (3.23)
a

Proof. The proof is immediate because f (x) = x is an increasing function.

Example 3.9. Here is a quick example showing that Markovs inequality is tight. Let X
take on the value a with probability p, and 0 otherwise. Then E[X] = ap and Markovs
inequality gives P (X a) p, which is tight.

3.3.2 Chebyshevs Inequality


We will use Markovs Inequality to derive another inequality which uses the variance to
bound the probability distribution. Chebyshevs Inequality is useful for deriving confidence
intervals and estimating sample sizes.

Theorem 3.10 (Chebyshevs Inequality). Let X be a random variable and a > 0. Then:

var(X)
P (|X E[X]| a) (3.24)
a2

Proof. Let Y = |X E[X]| and f (y) = y 2 . Since f is an increasing function, apply


Markovs Inequality:
E[Y 2 ]
P (Y a)
a2
Note, however, that E[Y 2 ] = E[|X E[X]|2 ] = E[(X E[X])2 ] = var(X). This com-
pletes the proof.

Corollary 3.11 (Another Look at Chebyshevs Inequality). Let X be a random variable


with standard deviation and k > 0. Then:

1
P (|X E[X]| k) (3.25)
k2

Proof. Set a = k in Chebyshevs Inequality.


CHAPTER 3. VARIANCE & INEQUALITIES 39

Notably, Chebyshevs Inequality justifies why we call the variance a measure of the spread
of the distribution. The probability that X lies more than k standard deviations away from
the mean is bounded by 1/k 2 , which is to say that a larger standard deviation means X is
more likely to be found away from its mean, while a low standard deviation means X will
remain fairly close to its mean.

3.3.3 Cauchy-Schwarz Inequality


The next inequality is also quite useful in certain situations, and the proof is really cute.

Theorem 3.12 (Cauchy-Schwarz Inequality). We have


p
|E[XY ]| E[X 2 ]E[Y 2 ] (3.26)

provided that E[X 2 ] < and E[Y 2 ] < .

Proof. Consider

0 E[(X + xY )2 ] = E[Y 2 ] x2 + 2E[XY ] x + E[X 2 ] .


| {z } | {z } | {z }
a>0 b c

Observe that the above expression is a quadratic equation in the variable x, which is
always non-negative. Visualize the parabola: the parabola is always non-negative, which
means it has either 0 or 1 root. This is equivalent to the condition b2 4ac, or

4E[XY ]2 4E[X 2 ]E[Y 2 ].

Dividing by 4 and taking square roots yields the desired inequality.

3.4 Weak Law of Large Numbers


Now, we can justify why the expectation is called the long-run average of a sequence of
values. Suppose that X1 , . . . , Xn are i.i.d. random variables, which we can think of as
successive measurements of a true variable X. The idea is that X is some quantity which
we wish to measure, and X follows some probability distribution with unknown parameters:
mean and variance 2 . Each random variable Xi is a measurement of X, which is to say
that each Xi follows the same probability distribution as X. In particular, this means that
each Xi also has mean and variance 2 . We are interested in the average of the samples
we collect:
:= X1 + + Xn
X (3.27)
n
What is the expectation of X? Since we would like to measure the parameter , we are

hoping that E[X] = . (In other words, we are hoping that the average of our successive
CHAPTER 3. VARIANCE & INEQUALITIES 40

measurements Xi will be close to the true parameter .) We can use linearity of expectation
to quickly check that this holds:

= 1 1
E[X] (E[X1 ] + + E[Xn ]) = n =
n n
an unbiased estimator of .
We therefore call X

The next question to ask is: on average, we expect X to estimate . But for a given
experiment, how close to do we expect X to be? How long will it take for X to converge
to its mean, ? These are questions that involve the variance of the distribution. First, let
us compute

= 1 1 2 2
var(X) (var(X 1 ) + + var(X n )) = n = . (3.28)
n2 n2 n
(Notice that the answer is not 2 !) In calculating our answer, we have used the scaling
property of the variance and the assumption of independence. The dependence of X , the
is
standard deviation of X,
1
X , (3.29)
n
a dependence that is well-worth remembering. In particular, this result states that the more
samples we collect, the smaller our standard deviation becomes! This result is what allows
the scientific method to work: without it, gathering more samples would not make us any
more certain of our results. We next state and prove a famous result, which shows that X
converges to its expected value after enough samples are drawn.

Theorem 3.13 (Weak Law of Large Numbers). For all > 0, in the limit as n ,
) 0.

P ( X (3.30)

Proof. We will use Chebyshevs Inequality:



) var(X)

P ( X
2
we have
Filling in what we know about var(X),
2
) ,

P ( X
n2
which tends to 0 as n .
is -far away from ? The
Intuitively, the theorem is asking: what is the probability that X
answer is: as n , the probability becomes 0. Increasing the number of samples decreases
the probability that the sample average will be far from the true average. When the WLLN
holds, we say that X converges in probability to .
CHAPTER 3. VARIANCE & INEQUALITIES 41

3.5 Confidence Intervals


Example 3.14. We will describe how Chebyshevs Inequality can be used to derive
a confidence interval. Let X1 , . . . , Xn be i.i.d. with mean and variance 2 . In the
notation of the preceding section, we will prove

a, X
P ( (X + a)) 0.95 (3.31)

where a > 0 is a constant. The interpretation of the statement is that we collect n


samples and compute the sample average X. With 95% confidence, we believe that the
a, X
true mean lies in the interval (X + a).

We use Chebyshevs Inequality:


a, X
+ a)) = P ( X < a) = 1 P ( X a)

P ( (X

var(X) 2
1 = 1
a2 na2
If we set our confidence level to be 0.95, then we can solve for a.

2 1
1 = 0.95 = a=
na2 0.05n
4.472/n, X
Therefore, our confidence interval is (X + 4.472/n).

Example 3.15. As a specialization of the previous example, we will consider the prob-
lem of estimating the bias of a coin. A coin is biased with P (H) = p, and our goal is to
find a 95% confidence interval for p. How can we construct a confidence interval here?

Flip the coin n times and let Xi be the indicator that the ith flip came up heads. Observe
that in this case, = E[Xi ] = p and 2 = var(Xi ) = p(1 p), since we are in the setting
of Xi Ber(p). Now, we can apply our bound from the previous example:
1 p 1 2.236
a= p(1 p)
0.05n 2 0.05n n

used p(1
where we have p) 1/4. Hence, our 95% confidence interval for the bias p is
(p 2.236/ n, p + 2.236/ n).

3.6 Bonus: Chernoff Bounds


Consider what happens when we apply Markovs Inequality to the increasing non-negative
function f (x) = ex (for > 0). We obtain the statement:

E[eX ]
P (X x) (3.32)
ex
CHAPTER 3. VARIANCE & INEQUALITIES 42

What value of should we choose? The obvious answer is: the best possible one! In other
words, we can optimize over the values of in search of the best possible bound. This is
known as a Chernoff bound and it can be used to prove that the tail probabilities of certain
distributions decay exponentially fast. Lets take a look at an example:

Example 3.16. We will bound the tail probability of a Poisson distribution. Consider
X Pois(). First, we compute
x

x e
X e X
E[eX ] = ex = e = e ee = e(e 1) .
x=0
x! x=0
x!

The Chernoff bound gives



e(e 1)
P (X x) x
= e ee x . (3.33)
e
To optimize this bound, we differentiate with respect to and we set the result to 0.
d e x
0= e e = e ee x (e x).
d
We can see that the optimal value of is given by e = x/. Plug this into (3.33) to
obtain the bound
 x 
(x/)x ln(x/)
P (X x) e e = exp x ln +x .

Consider the special case where = 1 and x > 1. Then

P (X x) xx e(1x) ,

which does indeed decrease exponentially fast.


Chapter 4

Regression & Conditional Expectation

A fundamental question in statistics is: given a set of data points {(Xi , Yi )}, how can we
estimate the value of Y as a function of X? First, we will discuss the covariance and corre-
lation of two random variables, and use these quantities to derive the best linear estimator
of Y , known as the LLSE. Then, we will define the conditional expectation, and proceed
to derive the best general estimator of Y , known as the MMSE. The notion of conditional
expectation will turn out to be an immensely useful concept with further applications.

4.1 Covariance
We have already discussed the case of independent random variables, but many of the vari-
ables in real life are dependent upon each other, such as the height and weight of an indi-
vidual. Now we will consider how to quantify the dependence of random variables, starting
with the definition of covariance.

Definition 4.1. The covariance of two random variables X and Y is defined as:

cov(X, Y ) := E[(X E[X])(Y E[Y ])] (4.1)

The covariance is the product of the deviations of the two variables from their respective
means. Suppose that whenever X is larger than its mean, Y is also larger than its mean;
then, the covariance will be positive, and we say the variables are positively correlated. On
the other hand, if whenever X is larger than its mean, Y is smaller than its mean, then
the covariance is negative and we say that the variables are negatively correlated. In other
words: positive correlation means that X and Y tend to fluctuate in the same direction, while
negative correlation means that X and Y tend to fluctuate in opposite directions.

Recall that we said that two events A and B are


1. positively correlated if P (A B) > P (A)P (B),
2. independent if P (A B) = P (A)P (B),
3. negatively correlated if P (A B) < P (A)P (B).

43
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 44

Let us connect the correlation of events with the correlation of random variables. Let 1A
and 1B be the indicators for events A and B respectively. Note that E[1A 1B ] = P (A B),
E[1A ] = P (A), and E[1B ] = P (B).

1. If A and B are positively correlated, then E[1A 1B ] > E[1A ]E[1B ], so cov(1A , 1B ) > 0.

2. If A and B are independent, then E[1A 1B ] = E[1A ]E[1B ], so cov(1A , 1B ) = 0. (Caution:


In this case, 1A and 1B are independent, but in general, cov(X, Y ) only means that X
and Y are uncorrelated, which is weaker than independence. We will see an example
illustrating this below.)

3. If A and B are negatively correlated, then E[1A 1B ] < E[1A ]E[1B ], so cov(1A , 1B ) < 0.

The three cases above demonstrate that correlation of events corresponds to correlation of
the corresponding indicator random variables.

Just as we had a computational formula for variance, we have a computational formula for
covariance.

Theorem 4.2 (Computational Formula for Covariance). Let X and Y be random vari-
ables. Then:
cov(X, Y ) = E[XY ] E[X]E[Y ] (4.2)

Proof. We take the definition of covariance and expand it out:

cov(X, Y ) = E[(X E[X])(Y E[Y ])]


= E[XY XE[Y ] E[X]Y + E[X]E[Y ]]
= E[XY ] E[X]E[Y ] E[X]E[Y ] + E[X]E[Y ]
= E[XY ] E[X]E[Y ]

Corollary 4.3 (Covariance of Independent Random Variables). Let X and Y be inde-


pendent random variables. Then we have cov(X, Y ) = 0.

Proof. By the assumption of independence, we have E[XY ] = E[X]E[Y ]; the corollary


follows immediately.

Example 4.4. Pick a point uniformly randomly from {(1, 0), (0, 1), (1, 0), (0, 1)}.
Let X be the x-coordinate and Y be the y-coordinate of the point that we choose.
Observe that XY = 0 always, so E[XY ] = 0. By symmetry about the origin, we have
E[X] = E[Y ] = 0. Hence, cov(X, Y ) = E[XY ] E[X]E[Y ] = 0. However, X and Y are
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 45

not independent. In particular,


1
0 = P (X = 0, Y = 0) 6= P (X = 0)P (Y = 0) = .
4
Important: Corollary 4.3 shows that independence of X and Y implies cov(X, Y ) = 0.
Example 4.4 shows that the converse is not true, i.e. cov(X, Y ) = 0 is not sufficient to say
that X and Y are independent. Study the above example carefully!

Next, we will show how the covariance and variance are related.

Corollary 4.5 (Covariance & Variance). Let X be a random variable. Then:

var(X) = cov(X, X) (4.3)

Proof. The proof is straightforward.

cov(X, X) = E[X X] E[X]E[X] = E[X 2 ] E[X]2 = var(X)

Corollary 4.5 is not very useful for calculating var(X), but it shows that variance can be
seen as a special case of covariance.

Corollary 4.6 (Variance of Sums of Random Variables). Let X and Y be random vari-
ables. Then:
var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y ) (4.4)

Proof. The majority of the work for this proof was completed in the previous set of
notes, in which we found

var(X + Y ) = var(X) + var(Y ) + 2(E[XY ] E[X]E[Y ]).

We can now recognize E[XY ] E[X]E[Y ] as cov(X, Y ).

The utility of the last result is that it holds true for any random variables, even ones that
are not independent.

4.1.1 Bilinearity of Covariance


Next, we show that the covariance is bilinear, that is, linear in each of its arguments.

Theorem 4.7 (Bilinearity of Covariance). Suppose Xi , Yi are random variables, and


a R is a constant. Then the following properties hold:

cov(X1 + X2 , Y ) = cov(X1 , Y ) + cov(X2 , Y ) (4.5)


CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 46

cov(X, Y1 + Y2 ) = cov(X, Y1 ) + cov(X, Y2 ) (4.6)


cov(aX, Y ) = a cov(X, Y ) = cov(X, aY ) (4.7)

Proof. The proofs are straightforward using linearity of expectation. To prove (4.5),

cov(X1 + X2 , Y ) = E[(X1 + X2 )Y ] E[X1 + X2 ]E[Y ]


= E[X1 Y ] E[X1 ]E[Y ] + E[X2 Y ] E[X2 ]E[Y ]
= cov(X1 , Y ) + cov(X2 , Y ).

Similarly, to prove (4.6),

cov(X, Y1 + Y2 ) = E[X(Y1 + Y2 )] E[X]E[Y1 + Y2 ]


= E[XY1 ] E[X]E[Y1 ] + E[XY2 ] E[X]E[Y2 ]
= cov(X, Y1 ) + cov(X, Y2 ).

Finally, to prove (4.7),

cov(aX, Y ) = E[aXY ] E[aX]E[Y ] = a(E[XY ] E[X]E[Y ]) = a cov(X, Y ),

cov(X, aY ) = E[XaY ] E[X]E[aY ] = a(E[XY ] E[X]E[Y ]) = a cov(X, Y ).

4.1.2 Standardized Variables


Sometimes, we would like to write random variables in a standard form for easier computa-
tions. Standard form means zero mean and unit variance.

Definition 4.8. Let X be a non-constant random variable. Then

X E[X]
X := (4.8)
X

is called the standard form of X. In statistics books, X is also called the z-score of
X.

Next, we explain why X is called the standard form.

Theorem 4.9 (Mean & Variance of Standardized Random Variables). Let X be a


standardized random variable. Then the mean and variance of X are:

E[X ] = 0 (4.9)

var(X ) = E[(X )2 ] = 1 (4.10)


CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 47

Proof. First, we prove that the mean is 0 using linearity of expectation.

E[X] E[X]
E[X ] = =0
X

Next, since E[X ] = 0, then var(X ) = E[(X )2 ] E[X ]2 = E[(X )2 ]. Using the
properties of variance,
 
X E[X] var(X) var(X)
var(X ) = var = 2
= = 1.
X X var(X)

Standardizing the random variable X is equivalent to shifting the distribution so that its
mean is 0, and scaling the distribution so that its standard deviation is 1. The random
variable X is rather convenient because it is dimensionless: for example, if X and Y repre-
sent measurements of the temperature of a system in degrees Fahrenheit and degrees Celsius
respectively, then X = Y .

4.1.3 Correlation
We next take a slight detour in order to define the correlation of two random variables, which
appears frequently in statistics. Although we will not use correlation extensively, the expo-
sition to correlation presented here should allow you to interpret the meaning of correlation
in journals.

Definition 4.10 (Correlation). The correlation of two random variables X and Y is:

cov(X, Y )
Corr(X, Y ) := (4.11)
X Y

The correlation is often denoted by or r, and is sometimes referred to as Pearsons


correlation coefficient.
The next result provides an interpretation of the correlation.

Theorem 4.11 (Covariance & Correlation). The correlation of two random variables
X and Y is:
Corr(X, Y ) = cov(X , Y ) = E[X Y ] (4.12)

Proof. We calculate the covariance of X and Y using the properties of covariance.


 
X E[X] Y E[Y ] cov(X, Y )
cov(X , Y ) = cov , =
X Y X Y

(Remember that the covariance of a constant with a random variable is 0.) The sec-
ond equality follows easily because cov(X , Y ) = E[X Y ] E[X ]E[Y ] and we have
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 48

E[X ] = E[Y ] = 0.

The result states that we can view the correlation as a standardized version of the covariance.
As a result, we can also prove a result about the possible values of the correlation:

Theorem 4.12 (Magnitude of the Correlation). If X, Y are non-constant random vari-


ables, then:
1 Corr(X, Y ) 1 (4.13)

Proof. The expectation of the random variable (X Y )2 is non-negative; hence

0 E[(X Y )2 ] = E[(X )2 ] 2E[X Y ] + E[(Y )2 ] = 2 2E[X Y ].

We have the inequality


Corr(X, Y ) = E[X Y ] 1
and the result follows by considering the two cases.

Corollary 4.13 (Correlations of 1). Let X and Y be non-constant random variables


and suppose that Corr(X, Y ) = 1 or Corr(X, Y ) = 1. Then Y is a linear function of
X.

Proof. From the above proof, Corr(X, Y ) = 1 if and only if 0 = E[(X Y )2 ], which
can only happen if X Y = 0. This implies that Y = X , which is true if and only
if Y = aX + b for constants a, b R. (What are the constants a and b?)

Now, we can see that correlation is a useful measure of the degree of linear dependence
between two variables X and Y . If Corr(X, Y ) is 1 or 1, then X and Y are perfectly linearly
correlated, i.e. a plot of Y versus X would be a straight line. The closer the correlation is to
1, the closer the data resembles a straight-line relationship. If X and Y are independent,
then the correlation is 0 (but the converse is not true). As a final remark, the square of
the correlation coefficient is called the coefficient of determination (usually denoted R2 ).
The coefficient of determination appears frequently next to best-fit lines on scatter plots as
a measure of how well the best-fit line fits the data.

4.2 LLSE
We will immediately apply our development of the covariance to the problem of finding the
best linear predictor of Y given X. We begin by presenting the main result, and then proceed
to prove that the result satisfies the properties we desire. 1

1
The material for the sections on regression rely heavily on Professor Walrands notes, although I have
inserted my own interpretation of the material wherever appropriate.
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 49

Definition 4.14 (Least Linear Squares Estimate). Let X and Y be random variables.
The least linear squares estimate (LLSE) of Y given X is defined as:

cov(X, Y )
L(Y | X) := E[Y ] + (X E[X]) (4.14)
var(X)

Observe that the LLSE is a random variable: in fact, it is a function of X.

4.2.1 Projection Property


Theorem 4.15 (Projection Property of LLSE). The LLSE satisfies:

E[Y L(Y | X)] = 0 (4.15)

E[(Y L(Y | X))X] = 0 (4.16)

Proof. The proofs are actually relatively straightforward using linearity. Proof of (4.15):
 
cov(X, Y )
E[Y L(Y | X)] = E Y E[Y ] (X E[X])
var(X)
cov(X, Y )
= E[Y ] E[Y ] (E[X] E[X]) = 0
var(X)

Proof of (4.16):
  
cov(X, Y )
E[(Y L(Y | X))X] = E X Y E[Y ] (X E[X])
var(X)
cov(X, Y )
= E[XY ] E[X]E[Y ] (E[X 2 ] E[X]2 )
var(X)
cov(X, Y )
= cov(X, Y ) var(X) = 0
var(X)

If you have not studied linear algebra, then the rest of the section can be safely skipped.
Linear algebra is not necessary to understand the properties of the LLSE, although linear
algebra certaintly enriches the theory of linear regression. We discuss linear algebra concepts
solely to motivate the Projection Property.

Given a probability space , then the space of random variables over is a vector space (that
is, random variables satisfy the vector space axioms). Indeed, we have already introduced
how to add and scalar-multiply random variables. Specifically, since random variables are
functions, the vector space of random variables is the vector space of functions X : R,
which is also called the free space of R over the set (denoted Rhi).
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 50

The map h, i : V V R given by hX, Y i := E[XY ] is an inner product, which makes


V into an inner product space (of course, the axioms of an inner product space must also
be verified). Therefore, we can say that two random variables X and Y are orthogonal
if E[XY ] = 0. According to the Projection Property, we see that the random variable
Y L(Y | X) is orthogonal to both the constant random variable 1 and the random variable
X. Hence, Y L(Y | X) is orthogonal to the plane L(X) := span{1, X}. Note that L(X)
is the subspace of V containing all linear functions aX + b of X, and L(Y | X) L(X).
Geometrically, we have that the projection of Y onto L(X) is L(Y | X), which is to say that
L(Y | X) is, in a sense, the closest linear function of X to Y . We will make this notion more
precise in the next section.

4.2.2 Linear Regression


Next, we show that the Projection Property implies that L(Y | X) is the best linear predictor
of Y , in a least-squares sense.

Theorem 4.16 (Least-Squares Property). Define L(X) := {aX + b : a, b R} to be the


set of linear functions of X. Then L(Y |X) has the property that for any aX +b L(X),

E[(Y L(Y | X))2 ] E[(Y aX b)2 ]. (4.17)

In other words, when we estimate Y , L(Y | X) has the lowest mean squared error out of
any other linear function of X.

Proof. According to the Projection Property, we can combine (4.15) and (4.16) to obtain

E[(Y L(Y | X))(aX + b)] = 0

for any aX + b L(X). Let Y := L(Y | X) to simplify the notation. We calculate:


h i
2
E[(Y aX b) ] = E ((Y Y ) + (Y aX b)) 2

= E[(Y Y )2 ] + 2 E[(Y Y )(Y aX b)] +E[(Y aX b)2 ]


| {z }
=0

= E[(Y Y )2 ] + E[(Y aX b)2 ]

In the second line of the proof, the term E[(Y L(Y | X))(L(Y | X) aX b)] vanishes
by the Projection Property because L(Y | X) aX b L(X). The quantity above
represents the mean squared error of aX + b as a predictor of Y , so we seek to minimize
this quantity. Since the expectation of a non-negative random variable is always non-
negative, the mean squared error is minimized when aX + b = L(Y | X).

The least-squares line is used everywhere as a visual summary of a trend. Now you have
seen the theory behind this powerful tool!
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 51

4.3 Quadratic Regression


We can extend the ideas of the previous section to find the best quadratic estimator of Y
given X. A quadratic function of X has the form f (X) = aX 2 + bX + c. Our goal is to
make Y f (X) orthogonal to any other quadratic function of X, that is, we want f (X) to
be the projection of Y onto Q(X) (where Q(X) is the space of quadratic functions of X).

To achieve this, we want Y f (X) to be orthogonal to 1, X, and X 2 . We have the equations


0 = E[Y aX 2 bX c], (4.18)
0 = E[(Y aX 2 bX c)X], (4.19)
0 = E[(Y aX 2 bX c)X 2 ]. (4.20)
We can solve these equations for a, b, and c in order to find the best quadratic function of
X as an estimator of Y . We will not work through the details, but hopefully you can see
how the general procedure goes.

4.4 Conditional Expectation


2
Next, we will search for an even more powerful predictor of Y given X. We define the
conditional expectation E[X | Y = y] using the following formula:
X
E[X | Y = y] = xP (X = x | Y = y) (4.21)
x

In other words, E[X | Y = y] is the expectation of X with respect to the probability distribu-
tion of X conditioned on Y = y. It is important to stress that E[X | Y = y] is just a real
number, just like any other expectation.

Notice that for every possible value of Y , we can assign a real number E[X | Y = y]. In
other words, this is a function of y:
f :RR given by f (y) = E[X | Y = y]
Hence, let us define E[X | Y ] to be a function of Y in the following manner:

Definition 4.17 (Conditional Expectation). Let X and Y be random variables. Then


E[X | Y ] is also a random variable, called the conditional expectation of X given Y ,
which has the value E[X | Y = y] with probability P (Y = y). Observe that E[X | Y ] is
a function of Y , i.e. E[X | Y ] = f (Y ).

This point cannot be stressed enough: E[X | Y ] is a random variable! Although the con-
ditional expectation may seem mysterious at first, there is an easy rule for writing down
E[X | Y ]. Let us consider an example for concreteness.

2
The material presented in this section appears slightly differently from the presentation in Professor
Walrands lecture notes, but the lecture notes are still the source for these discussion notes.
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 52

Example 4.18. Suppose that we roll a die N times, where N is some non-negative
integer-valued random variable. Let X be the sum of the dice rolls. Conditioned on
N = 1 (that is, we roll one die), then the expected value is 7/2, i.e. E[X | N = 1] = 7/2.
Similarly, conditioned on N = 2, we roll two dice, so E[X | N = 2] = 7 (the expected
sum of two dice is 7). In general, conditioned on N = n, we roll n dice and we have
E[X | N = n] = 7n/2.

If N = n, then the random variable E[X | N ] has the value E[X | N = n] = 7n/2; hence,
we can write E[X | N ] = 7N/2 (which is a function of N , following the discussion above).
At first, it may appear that in going from the expression for E[X | N = n] to E[X | N ],
we merely replaced the n with N . Of course, there is more going on than just a simple
substitution (E[X | Y ] is a random variable and E[X | Y = y] is just a real number),
but substituting y 7 Y is exactly the procedure for writing down E[X | Y ]. Dont worry if
conditional expectation is difficult to grasp at first. Mastery of this concept requires practice,
and we will soon see how to apply conditional expectation to the problem of prediction.

4.4.1 The Law of Iterated Expectation


First, we prove an amazing fact about conditional expectation. We noted that E[X | Y ] is a
random variable, and of course, we are always interested in the expectation values of random
variables. Hence, we can ask the question: what is the expectation of E[X | Y ]? To answer
this question, recall that E[X | Y ] is a function of Y , that is, E[X | Y ] = f (Y ). Then, to
calculate the expectation of E[X | Y ], we see that we should compute the expectation with
respect to the probability distribution of Y . Once you understand this point, the following
proof is straightforward in its details.

Theorem 4.19 (Law of Iterated Expectation). Let X and Y be random variables.

E[E[X | Y ]] = E[X] (4.22)

Proof. As noted above, we compute E[E[X | Y ]] with respect to the probability distri-
bution of Y .
!
X X X
E[E[X | Y ]] = E[X | Y = y]P (Y = y) = xP (X = x | Y = y) P (Y = y)
y y x
!
XX X X
= xP (X = x, Y = y) = x P (X = x, Y = y)
y x x y
X
= xP (X = x) = E[X]
x

What a marvelous proof! Once you understand this proof, you will have understood most
of the concepts we have covered so far. In the P first line, we use the definition of the ex-
pectation of a function of Y , i.e. E[f (Y )] = y f (y)P (Y = y); then, we use the definition
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 53

of E[X | Y = y], which was given in the previous section. In the second line, we use our
knowledge of conditional probability; then, we recall that summing over all possible values
of Y in the joint distribution P (X = x, Y = y) yields the marginal distribution P (X = x).
Finally, in the last line, we come back to the definition of E[X].

Example 4.20. Let us return to Example 4.18, where we found that E[X | N ] = 7N/2.
For concreteness, suppose that N Geom(p). We have
7 7
E[X] = E[E[X | N ]] = E[N ] = .
2 2p
Observe that we have found the expectation of a sum of a random number of random
variables, a task that would be far more difficult without the tool of conditioning.

Example 4.21 (Random Walk). For a second example, let us consider a drunk indi-
vidual is walking along the number line. At the start, the drunkard starts at the origin
(X0 = 0); at successive time steps, the drunkard is equally likely to take a step forward
(+1) or backward (1). In other words, given that the drunkard is at position k at
time step n, at time step n + 1 the drunkard will be equally likely to be found at either
position k + 1 or k 1.

Denoting the drunkards position at time step n as Xn , we first compute E[Xn ]. We can
write Xn+1 = Xn + 1+ 1 , where 1+ is the indicator that the drunkard takes a step
forward and 1 is the indicator that the drunkard takes a step backward. Taking the
expectation of both sides yields E[Xn+1 ] = E[Xn ] + 1/2 1/2 = E[Xn ] (remembering
that the probability of moving forward, which is the same as the probability of moving
backward, is 1/2). We have found that the expected position of the drunkard is always
the same for all n, so the expected position of the drunkard at any later time is the same
as the starting position: E[Xn ] = E[X0 ] = 0.

Next, we calculate E[Xn2 ]. Conditioned on the drunkard being located at position k at


time step n, then the drunkard will be equally likely to be found at position k + 1 or
k 1. Hence, we have

2 1
P (Xn+1 = (k + 1)2 | Xn = k) = P (Xn+1
2
= (k 1)2 | Xn = k) = .
2
Taking the expectation yields

2 1 1
E[Xn+1 | Xn = k] = (k + 1)2 + (k 1)2 = k 2 + 1.
2 2
Substituting Xn for k, we have
2
E[Xn+1 | Xn ] = Xn2 + 1.
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 54

Using the law of total expectation,


2 2
E[Xn+1 ] = E[E[Xn+1 | Xn ]] = E[Xn2 ] + 1.

E[X02 ] = 0 provides the initial condition, so E[Xn2 ] = n. In particular, we have explicitly


2 2
var(Xn ) = E[Xn ] E[Xn ] = n, so the standard deviation of the drunkards
shown that
walk is n.

Example 4.22 (Galton-Watson Branching Process I). Let us consider a simple model
of population growth. Suppose we start with a population of X0 = N people, and let Xn
be the number of people in the population on the nth time step. Furthermore, assume:

1. Each person survives for only one time step.

2. At each time step, independently of the rest of the population, each individual is
expected to leave behind offspring.

What do we expect the population to look like at time n?

The challenge of this problem is that the number of people in the population at any given
time step is a random variable, so we have a random number of individuals who repro-
duce randomly. However, we have already seen in Example 4.18 a systematic method for
dealing with this kind of randomness! First, we condition on the number of individuals
in the population at time n 1, and we compute the population size at time n.

Conditioned on Xn1 = m, we have m people, who are each expected to leave behind
offspring. Therefore, E[Xn | Xn1 = m] = m and we have E[Xn | Xn1 ] = Xn1 .
Then, we can use the law of iterated expectation:

E[Xn ] = E[E[Xn | Xn1 ]] = E[Xn1 ]

We have obtained a recursive solution to the problem: at each time step, we expect the
population size to be multiplied by . Therefore,

E[Xn ] = n E[X0 ] = n N.

We can see that if < 1, then the expected population size tends to 0, whereas if > 1,
the expected population size grows exponentially fast.

4.5 MMSE
4.5.1 Orthogonality Property
Before we apply the conditional expectation to the problem of prediction, we first note a
few important properties of conditional expectation. The first is that the conditional ex-
pectation is linear, which is a crucial property that carries over from ordinary expectations.
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 55

For example, E[X + Y | Z] = E[X | Z] + E[Y | Z]. The justification for this, briefly, is that
E[X + Y | Z = z] = E[X | Z = z] + E[Y | Z = z].

The second important property is: let f (X) be any function of X and g(Y ) any function
of Y . Then the conditional expectation E[f (X)g(Y ) | Y ] = g(Y )E[f (X) | Y ]. The intuitive
idea behind this property is that when we condition on Y , then any function of Y is treated
as a constant and can be moved outside of the expectation by linearity. In other words,
E[f (X)g(Y ) | Y = y] = E[f (X)g(y) | Y = y] = g(y)E[f (X) | Y = y]. Then, to obtain
E[f (X)g(Y ) | Y ], we perform our usual procedure of substituting y 7 Y .

Let us now prove the analogue of the Projection Property for E[Y | X].

Theorem 4.23 (Orthogonality Property). Let X and Y be random variables, and let
(X) be any function of X. Then we have:

E[(Y E[Y | X])(X)] = 0 (4.23)

Proof. We first calculate E[(Y E[Y | X])(X) | X] using the properties of conditional
expectation.

E[(Y E[Y | X])(X) | X] = (X)E[Y E[Y | X] | X]


= (X)(E[Y | X] E[E[Y | X] | X])
= (X)(E[Y | X] E[Y | X]) = 0

Note: Why is E[E[Y | X] | X] = E[Y | X]? We have that E[Y | X] is a function of X,


and conditioned on the value of X, E[Y | X] is essentially a constant. Now observe that
Theorem 4.19 gives us

E[(Y E[Y | X])(X)] = E[E[(Y E[Y | X])(X) | X]] = 0.

In our proof, we used the useful trick of conditioning on a variable first in order to use the law
of total expectation. Compare with the Projection Property: the Orthogonality Property is
stronger in that (X) is allowed to be any function of X, whereas the Projection Property
was proven for linear functions of X.

4.5.2 Minimizing Mean Squared Error


Definition 4.24 (Minimum Mean Square Error). The minimum mean square error
(MMSE) estimator of Y given X is the random variable f (X) which minimizes the mean
squared error, i.e. for any function g(x),

E[(Y f (X))2 ] E[(Y g(X))2 ].

Compared to the task of finding the best linear estimator of Y given X, finding the best
general estimator of Y given X seems to be an even more difficult task. However, the MMSE
CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 56

will simply turn out to be our new friend, the conditional expectation. In fact, the proof is
virtually the same as the proof for the LLSE.

Theorem 4.25 (MMSE). Let X and Y be random variables. Then the MMSE of Y
given X is E[Y | X], i.e. for any function g(x),

E[(Y E[Y | X])2 ] E[(Y g(X))2 ]. (4.24)

Proof. Let Y = E[Y | X] for simplicity of notation. We have, using the Orthogonality
Property,

E[(Y g(X))2 ] = E[(Y Y + Y g(X))2 ]


= E[(Y Y )2 ] + 2 E[(Y Y )(Y g(X))] +E[(Y g(X))2 ]
| {z }
=0

= E[(Y Y )2 ] + E[(Y g(X))2 ].

The term E[(Y E[X | Y ])(E[X | Y ] g(X))] vanishes by the Orthogonality Property
since E(X | Y ) g(X) is just another function of X.

We have come a long way, and the answer is surprisingly intuitive. We have just found the
best estimator of Y given X (in the mean squared error sense) is simply the expected value
of Y given X!

4.6 Conditional Variance


Definition 4.26. Let X and Y be random variables. We define var(X | Y = y) to be
the variance of the conditional probability distribution P (X = x | Y = y). Furthermore,
the conditional variance var(X | Y ) is defined to be the random variable that takes on
the value var(X | Y = y) with probability P (Y = y). Note that var(X | Y ) is a function
of Y , analogously to E[X | Y ].

Theorem 4.27 (Law of Total Variance). Let X and Y be random variables. Then:

var(X) = E[var(X | Y )] + var(E[X | Y ]) (4.25)

Proof. First, we use the computational formula for the variance.

var(X) = E[X 2 ] E[X]2

We calculate each term by the law of total expectation.

var(X) = E[E[X 2 | Y ]] E[E[X | Y ]]2


CHAPTER 4. REGRESSION & CONDITIONAL EXPECTATION 57

= E[E[X 2 | Y ]] E[E[X | Y ]2 ] + E[E[X | Y ]2 ] E[E[X | Y ]]2


h i
= E E[X 2 | Y ] E[X | Y ]2 + var(E[X | Y ])
= E[var(X | Y )] + var(E[X | Y ])

The formula above is the analogue of the Law of Iterated Expectation for variance.

Example 4.28 (Variance of a Random Sum). Suppose X = X1 + + XN , where N


is a random variable. What is var(X)?

We compute E[X | N ] = N E[Xi ] and var(X | N ) = N var(Xi ). Then,

var(X) = E[var(X | N )] + var(E[X | N ]) = E[N ] var(Xi ) + var(N )(E[Xi ])2 .

Example 4.29 (Galton-Watson Branching Process II). Let us revisit Example 4.22 with
the additional assumption that the number of offspring has variance 2 . Using our new
tools, we can now compute var(Xn ).

As we computed above, E[Xn |Xn1 ] = Xn1 . Given that there are m individuals in the
population, the variance of the population at the next time step is m 2 by linearity of
variance (recall that we assume that the individuals offsprings are independent). Hence,
var(Xn | Xn1 ) = 2 Xn1 . Now, we may apply (4.25):

var(Xn ) = E[ 2 Xn1 ] + var(Xn1 ) = 2 n1 N + 2 var(Xn1 )

First, suppose that = 1. Then, the recurrence simplifies to var(Xn ) = 2 N +var(Xn1 ),


which means that the variance increases linearly: var(Xn ) = 2 N n. (Here, we are as-
suming that the initial population is fixed, so var(X0 ) = 0.)

For 6= 1, the solution to the recurrence is obtained by finding a pattern after a few
iterations:

var(Xn ) = 2 n1 N + 2 var(Xn1 ) = 2 n1 N + 2 n N + 4 var(Xn2 )


n1
2 n1
X 1 n
= = N k = 2 n1 N
k=0
1

We have used the formula for a finite geometric series.


Chapter 5

Markov Chains

Markov chains are an important class of probabilistic models that lend themselves favorably
to analysis. First, we describe how to calculate useful properties of Markov chains: expected
hitting times and absorption probabilities. Then, we discuss the limiting behavior of Markov
chains and classify their long-term behavior.

5.1 Introduction
Markov chains are an important probabilistic method of modeling processes over time. Con-
cretely, consider a sequence of random variables X1 , X2 , X3 , . . . . (Note that we are only
considering discrete-time Markov chains.) We say that Xn represents the state of a system
at time n, and we are interested in the behavior of the system. In general, a system could
depend on the entire history of the system until that point, that is, the distribution of Xn
could depend on X1 , . . . , Xn1 . It is very difficult to analyze such systems, however, so we
make a powerful simplifying assumption:

P (Xn = xn | X1 = x1 , . . . , Xn1 = xn1 ) = P (Xn = xn | Xn1 = xn1 ) (5.1)

We call this assumption the Markov property. In words, it states that the distribution
of Xn only depends on the time immediately before, not on any of the previous times. A
common way to describe the Markov property is: the future depends on the past only through
the present. In many situations, the Markov property is a very reasonable assumption to
make. Even when the validity of the assumption is questionable, the Markov property may
still be necessary to make calculations tractable and analysis possible.

One can deduce an extended version of the Markov property. We rewrite the probability
P
Q(X n+1 = xn+1 , . . . , Xn+m = xn+m | X1 = x1 , . . . , Xn = xn ) using the Chain Rule to obtain
m
k=1 P (Xn+kQn= xn+k | X1 = x1 , . . . , Xn+k1 = xn+k1 ). We can forget the first n 1 states,
so we have k=1 P (Xn+k = xn+k | Xn = xn , . . . , Xn+k1 = xn+k1 ). Finally, another appli-
cation of the Chain Rule yields P (Xn+1 = xn+1 , . . . , Xn+m = xn+m | Xn = xn ), which shows
that the first n 1 time steps are not relevant once we know the nth time step.

Let us be explicit in what a Markov chain entails. A Markov chain consists of:

58
CHAPTER 5. MARKOV CHAINS 59

A set of states K = {1, . . . , N }, where N is the number of states. We consider only


finite discrete-time Markov chains in this course, that is, Markov chains with a finite
number of states.

An initial probability distribution over the states, 0 .

The probabilities of transitioning between states: for every pair of states i and j, we
must specify P (i, j), the probability of moving from state i to state j. Once we are in
state i, we know that we must transition to P some state at the next time step (even if
we transition back to state i), so we require N j=1 P (i, j) = 1.

From the three components of a Markov chain, we can define a sequence of random variables
{Xn : n N} such that
P (X0 = i) = 0 (i) (5.2)
and
P (Xn = j | Xn1 = i) = P (i, j). (5.3)
Notice that (5.3) uses the Markov property discussed above. The interpretation of the ran-
dom variables is that the system transitions from state to state at every time step, and Xn
represents the state of the system at time n.

There are N 2 transition probabilities P (i, j), and we often organize them into a matrix P
such that the (i, j) entry of P
the matrix is P (i, j). We call P the transition probabil-
ity matrix. The condition N j=1 P (i, j) = 1 means that each row of P must sum to 1.
(Remark: There is another convention in which the columns of the transition matrix must
sum to 1. We will stick to the convention that the rows sum to 1, but do not be surprised if
you encounter different notation in the literature.)

Instead of writing the cumbersome notation P (Xn = i) for the distribution of Xn , we will
use the notation n to mean the distribution of Xn : n (i) = P (Xn = i).

Example 5.1 (Bernoulli-Laplace Diffusion). As a simple model for diffusion, suppose


there are k white marbles and k black marbles in two boxes (k marbles in each box). At
each time step, we pick one marble from each box and exchange them. We can model
the situation as a Markov chain, where the states are K = {0, . . . , 2k}, representing the
number of white marbles in the first box.

Suppose that there are currently i white marbles in the first box. Then, the probability
that the number of white marbles will decrease to i 1 is i/k (the probability of choosing
a white marble from the first box), multiplied by i/k (the probability of choosing a black
marble from the second box). By a similar argument, the probability that the number
CHAPTER 5. MARKOV CHAINS 60

of white marbles will increase to i + 1 is (k i)2 /k, so our transition probabilities are

2 2
i /k ,
j = i 1,
2
P (i, j) = 2i(k i)/k , j = i,

(k i)2 /k 2 , j = i + 1.

Example 5.2 (Random Walk I). A random walk is an example of a Markov chain
in which the state space K is infinite. Take K = Z and let P (i, i 1) = 1 p and
P (i, i + 1) = p for all i. If p = 1/2, we say the Markov chain is a symmetric random
walk.

1p 1p 1p 1p
p i1 p i p i+1 p

Example 5.3 (Queueing). As another example of an infinite Markov chain, consider a


model for queueing in which Xn represents the number of customers in the queue at time
n. At time n, one customer is removed from the queue, and Yn new customers arrive,
where the Yn are i.i.d. and take values in N. The state space is K = N, representing the
number of customers in the queue. For i 1, the number of customers is reduced to
i 1, and then the probability of transitioning to j is the probability that j (i 1)
customers arrive.

P (i, j) = P (Yn = j i + 1)

Along with P (0, j) = P (1, j), we have specified the transition probabilities.

5.2 Transition of Distribution


The first question we can ask is: given a transition matrix P , how can we compute the
distribution at time n (n )? Here is where the Markov property shines: we can easily
compute n if we know the distribution at one time step prior. Therefore, suppose we are
given n1 . We can compute
N
X N
X
n (j) = P (Xn1 = i, Xn = j) = P (Xn1 = i)P (Xn = j | Xn1 = i)
i=1 i=1
XN
= n1 (i)P (i, j).
i=1

Actually, the equation above has exactly the same form as matrix multiplication. If we write
n as a row vector and P as a matrix, then we have found:
n = n1 P (5.4)
CHAPTER 5. MARKOV CHAINS 61

Iterating this, we can obtain the distribution at time n starting from the initial distribution.

n = 0 P n (5.5)

Passing from one time step to the next is simply multiplication by the transition matrix.
In practice, it is extremely easy for computers to carry out this matrix multiplication, so
Markov chain models are often very tractable. In theory, we can use the power of linear
algebra in order to analyze a probabilistic situation. Hopefully, you can begin to see why
Markov chains are such a useful family of models.

5.3 Markov Chain Computations


5.3.1 Hitting Time
Suppose that we start at a state s K and we want to find the expected time until we reach
another state s0 . We can compute the hitting time in the following way: define (i) to be
the expected time to reach state s0 , starting at state i. Then, (s0 ) = 0 by definition and
the quantity we are interested in is (s).

We can write down the first-step equations in order to compute (i). They are:

N
X
(i) = 1 + P (i, j)(j) (5.6)
j=1

The first-step equations say: in order to reach s0 , we must first take a step (that is the +1
in the equations above). After we take a step, we transition to some other step j, and once
we are in state j, we expect to take (j) more steps until we reach s0 . Of course, we must
weight each possiblity by the probability of transitioning to the state j, and we sum over all
states to account for all possibilities.

When we carry out this procedure, we end up with N equations, one for each state. Each
equation is linear in the (i), so we can solve the system just as we solve any other linear
system (either through repeated substitution or Gaussian elimination). Once we solve the
system, we have (s), which is the expected time to reach s0 from s.

Example 5.4. Suppose X Geom(p). Previously, we have computed E[X] = 1/p.


Here, we will show how to model X as a Markov chain and compute the expectation.

Recall that a geometric random variable represents the number of flips of a biased coin
until we see heads. Therefore, we will define a two-state Markov chain with states
K = {1, 2}, where state 1 represents the state of we are still flipping the coin and state
2 represents the terminal state (we are done flipping the coin). Then, we can see that
the expected time to reach state 2 from state 1 is precisely the expectation of X. We
CHAPTER 5. MARKOV CHAINS 62

have (2) = 0, and

(1) = 1 + (1 p)(1) + p(2) = 1 + (1 p)(1),

since with probability 1p, we see tails and we have to keep flipping, and with probability
p, we see heads and we move to the terminal state. Now, we can solve the equation for
(1) and we obtain p(1) = 1, or (1) = 1/p.

We can generalize the above method. Suppose that each state i gives you a reward R(i) for
visiting that state, and we want to find the expected reward that we accumulate before we
reach any of the nodes in a set S 0 . Observe that the first-step equations above correspond
to the special case where each state gives a reward of 1 and the goal is the set S 0 = {s0 }. In
general, define (i) to be the expected sum of rewards that we accumulate until we reach a
state in S 0 , starting from state i. Then (i) = R(i) for any i S 0 , and for the other states:
N
X
(i) = R(i) + P (i, j)(j) (5.7)
j=1

Here, we are saying that the accumulated rewards consist of the reward we obtain now, plus
the reward that we expect at state j (weighted by the probability of transitioning to j).

Caution: Do not memorize the equations in this section! Instead, carefully examine their
meaning and try to understand why they solve the problem we are trying to solve. Make
these equations part of your intuition.

5.3.2 Probability of S 0 Before S


We have found how to compute the expected time to reach state s0 from state s. Now,
consider the problem of reaching S 0 before S, where S and S 0 are sets of states. Denote by
(i) the probability of reaching S before S 0 , starting from state i. Then, (i) = 1 for all
i S and (i) = 0 for all i S 0 . We can write down another linear system of equations:
N
X
(i) = P (i, j)(j) (5.8)
j=1

These equations are similar to the first-step equations in that they look ahead one time
step. For each possible state j, weighted by the probability of transitioning to j, we sum
up the probability of reaching S before S 0 starting at j. Solving these equations amounts to
solving another system of linear equations.

Example 5.5 (Gamblers Ruin). Consider a model of a gambler who starts with a dol-
lars and wishes to leave the casino with b dollars. The set of states is K = {0, . . . , b}
representing the amount of money that the gambler possesses. The gambler wins a bet
with probability p, and all bets are for one dollar. What is the probability of {b} before
CHAPTER 5. MARKOV CHAINS 63

{0}, that is, what is the probability that the gambler will succeed before going bankrupt?

Denote by (i) the probability of reaching b before 0, starting with i dollars. We have
(0) = 0 and (b) = 1. For i
/ {0, b}, we have the equations

(i) = (1 p)(i 1) + p(i + 1).

The equation above is known as a difference equation or a linear recurrence rela-


tion. We solve these equations by judicious guessing.

First, we assume that p = 1/2 (the odds are even). We search for a solution of the form
(i) = c1 i + c0 . From (0) = 0, we see that c0 = 0, and from (b) = 1, we see that
c1 = b1 . Hence, our result is (i) = i/b. To verify that (i) represents a valid solution,
we plug the result into the difference equation:
 
i 1 i1 i+1
= +
b 2 b b

Hence, we have found the correct solution: the probability of success increases linearly
with the amount of money with which we start.

Now, suppose that p 6= 1/2. We search for a solution of the form (i) = c1 i + c0 .
Plugging in the form of the solution, we obtain:

c1 i + c0 = (1 p)(c1 i1 + c0 ) + p(c1 i+1 + c0 )

After cancelling terms and simplifying, we obtain the equation

p2 + 1 p = 0.

The equation is quadratic in , and it is solved by choosing the solution corresponding to


< 1. (One can verify that the solution = 1 does not satisfy (0) = 0 and (b) = 1.)
p
1 1 4p(1 p) 1 (2p 1) 1p
= = =
2p 2p p

Setting (0) = 0, we have c0 + c1 = 0. Setting (b) = 1, we have c0 + c1 b = 1. We solve


to find c0 = c1 = (1 b )1 , from which we extract our desired solution:

1 i
(i) =
1 b

If p < 1/2, then < 1, and this spells bad news for the gambler. (You should try
plugging in various values of , i, and b above to obtain numerical results.)
CHAPTER 5. MARKOV CHAINS 64

Now, we can repeat the entire argument above, switching p and q (which amounts to
replacing with 1 ) and i with b i. This corresponds to computing the probability of
reaching 0 before b starting from state i, which we denote as (i). We have the results

bi


, p = 1/2,

(i) = 1 b (bi)


, p 6= 1/2.
1 b

It can be seen that (i) + (i) = 1, which means that we are guaranteed to hit b before
0 or 0 before b. It is impossible for the gambler to be stuck at the casino forever.

Example 5.6 (Gamblers Ruin II). In Example 5.5, we found that (i), the probability
of reaching b before 0, starting from state i, is
i

,
p = 1/2,
(i) = b1 i

, p 6= 1/2,
1 b

where = (1 p)/p. Equivalently, we can think of (i) as the probability of winning


b i additional dollars before losing i dollars. Here, money is measured relative to i, the
amount of money that was brought into the casino. By replacing b with b + i, we obtain
i


, p = 1/2,
(i) = b + i
1 i

, p 6= 1/2,
1 b+i

the probability of winning b dollars before losing i dollars. Think of


(i) as the probabil-
ity that, starting with no money, we reach b before i. Now, let i : we are allowing
the gambler to have infinite capital (so the gambler is allowed to incur any amount of
debt), and we are interested in the probability that the gambler will ever profit b dollars.

Case 1. p > 1/2: Here, < 1, so i 0 and


(i) 1.

Case 2. p = 1/2: The ratio i/(b + i) also tends to 1 as i .

Case 3. p < 1/2: Since > 1, we can rearrange the formula to


b
1 i i 1

1 p

(i) = = =
1 b+i i b b 1p

as i , since i 0.
CHAPTER 5. MARKOV CHAINS 65

In summary, a gambler with infinite capital is guaranteed to profit b dollars if p 1/2;


otherwise, the probability is (p/(1 p))b .

Example 5.7. There is a simpler way of computing the probability of one event before
another which is conceptually interesting. As a concrete example, suppose that we have
two machines which fail with probabilities p1 and p2 respectively. When either one of
the machine fails, the entire factory shuts down. What is the probability that the first
machine has failed when the factory shuts down?

The question is really asking: given that the factory shuts down at some time step, what
is the probability that the first machine has failed? We can apply the law of conditional
probability here. Let Mi denote the event that machine i fails on the specified time step.

P (M1 ) p1
P (M1 | M1 M2 ) = =
P (M1 M2 ) p1 + p2 p1 p2

(Apply the inclusion-exclusion rule.) See if you can obtain the same answer using a
Markov chain model.

5.4 Long-Term Behavior of Markov Chains


Now that we have seen how to solve some problems with Markov chains, let us analyze their
long-term behavior.

5.4.1 Classification of States


First, we observe that there are only two types of states: transient states and recurrent
states. Transient states are states from which it it is possible to leave and never return. For
example, suppose that the state i has a non-zero probability of transitioning to state j, but
once we are in state j, then there is no way to return to state i. Since a Markov chain runs
forever, eventually we must transition from i to j, and for all time steps afterwards, we will
never return to i. Therefore, we must have n (i) 0 as n .

The other possibility is a recurrent state, which is the opposite of the situation described
above. Suppose that we start from state i, and regardless of the next state, there is always
a non-zero probability that we will return to state i. In this case, since a Markov chain runs
forever, we must necessarily return to state i infinitely many times, which means we visit
state i infinitely often during the course of our Markov chain.

We now formalize our observations. Let fn (i, j) denote the probability that a system starting
P
at state i first visits the state j at time n. Let f (i, j) = n=1 fn (i, j) denote the probability
that a system starting at state i eventually visits state j.
CHAPTER 5. MARKOV CHAINS 66

Definition 5.8. In a Markov chain, a state i is recurrent if f (i, i) = 1 and transient


otherwise.

Here, i.o. means infinitely often. The following theorem requires knowledge of the
Borel-Cantelli Lemma, Lemma 9.2.

Theorem 5.9 (Classification of States). For a state i in a Markov chain, there are two
possibilities:
P n
n=1 P (i, i) < , P (Xn = i i.o. | X0 = i) = 0, and i is transient.
P n
n=1 P (i, i) = , P (Xn = i i.o. | X0 = i) = 1, and i is recurrent.

Proof. The probability that, starting from state i, we return to state i at least k times
is f (i, i)k . Letting k , we see that
(
0, f (i, i) < 1,
P (Xn = i i.o. | X0 = i) = (5.9)
1, f (i, i) = 1.

Suppose that n
P
n=1 P (i, i) < . By the Borel-Cantelli Lemma (Lemma 9.2), we have
P (Xn = i i.o. | X0 = i) = 0.

Suppose that f (i, i) < 1; our goal is to bound n


P
n=1 P (i, i). Therefore, we calculate
P n (i, i) by the following argument: if we start at state i and end up at state i at time
n, then suppose that the first time we returned to state i was at time n m. The
probability that the first return to i occurred at time n m is fnm (i, i), and starting
at state i, the probability that we remain at state i after m steps is P m (i, i). Summing
over m, we obtain
n1
X
n
P (i, i) = fnm (i, i)P m (i, i).
m=0

We sum over n:
j j n1 j1 j
X X X X X
P n (i, i) = fnm (i, i)P m (i, i) = P m (i, i) fnm (i, i)
n=1 n=1 m=0 m=0 n=m+1
j1 jm j
X X X
m
= P (i, i) fn (i, i) f (i, i) + f (i, i) P m (i, i)
m=0 n=1 m=1
| {z }
f (i,i)
CHAPTER 5. MARKOV CHAINS 67

Rearranging this, we find


j
X f (i, i)
P n (i, i) < ,
n=1
1 f (i, i)

< 1. The RHS is a constant that does not depend on j, so letting j ,


since f (i, i) P
we see that n
n=1 P (i, i) < . This completes the chain of equivalences.

Remark: The classification of states does not require the Markov chain to be finite.

5.4.2 Invariant Distribution


If a Markov chain has at least one transient state, then it can be decomposed into transient
states and recurrent states. Essentially, a Markov chain must eventually leave its transient
states and become trapped in the recurrent states. On the other hand, if the Markov chain
has only one recurrent state, then it cannot be decomposed any further, so we say that the
Markov chain is irreducible. We have the following definition:

Definition 5.10. A Markov chain is irreducible if from any state we can reach any
other state.
Irreducibility has consequences for the distribution of a Markov chain. We say that a distri-
bution is an invariant distribution if:
= P (5.10)
Observe that if at any time n, n = , then the distribution does not change at the next time
step, which implies that the distribution will forever be the same at every time step from
that point onwards (by induction, of course). If we start off with the invariant distribution
(0 = ), then n = for all n.

Theorem 5.11. A finite, irreducible Markov chain has a unique invariant distribution.
Of course, this has the hidden implication that a Markov chain that is not irreducible need
not have a unique invariant distribution. Consider, for example, P = I (the identity matrix).
Then any distribution is an invariant distribution.

It turns out that the invariant distribution tells us more. Let vn (i) be the number
Pn1 of times
that the state i is visited in the first n time steps. (We can also write vn (i) = j=0 1(Xj =i) .)
Then the fraction of time spent in state i is vn (i)/n, and we have the following theorem:

Theorem 5.12. Suppose we have a finite, irreducible Markov chain with invariant dis-
tribution . Then, for all states i,

vn (i)
(i)
n
CHAPTER 5. MARKOV CHAINS 68

as n .

Thus, the invariant distribution also tells us the long-term fraction of time that we will spend
in each state.

5.4.3 Convergence of Distribution


Let us examine the above sentence a bit more carefully. If P = I, then n = 0 always.
This means that in the long-run, the behavior of the Markov chain depends heavily on the
initial distribution. Is this always the case? Are there Markov chains for which we can make
statements about the limiting distribution, regardless of the initial distribution? The answer
is yes, and the relevant property here is aperiodicity.

Consider a Markov chain which, with probability 1, transitions from 1 to 2 and from 2 to
1. Clearly, the distribution will not tend toward a limit in this case, because (1) and (2)
will swap at every time step. In fact, as long as we can rule out this case, we are guaranteed
convergence to a limiting distribution.

Define the period of a state i to be the largest integer d(i) 1 (if such a number exists)
such that the number of time steps it takes to return to state i is necessarily a multiple of
d. We say that a Markov chain is aperiodic if the period if every state is 1. Intuitively,
aperiodicity is related to the tendency of the Markov chain to mix its states. To give an idea
of how aperiodicity is used, we can prove an important lemma using a little number theory.

Lemma 5.13. Let S be a set of positive integers which is closed under addition, such
that gcd(S) = 1. There exists an integer n0 such that for all integers n n0 , n S.

Proof. If gcd(S) = 1, there is a sequence of positive integers (an ) with greatest common
divisor 1. Let dn = gcd{a1 , . . . , an }. Note that dn is non-increasing with n and dn 1,
so there exists some m such that dm = 1. Therefore, we have found a finite subset
{a1 , . . . , am } S with gcd{a1 , . . . , am } = 1. We can find integers c1 , . . . , cm such that

c1 a1 + + cm am = 1.

By relabeling the an , we can assume that the coefficients c1 , . . . , ck are positive and
ck+1 , . . . , cm are negative.

c1 a1 + + ck ak (ck+1 ak+1 cm am ) = 1
| {z } | {z }
b1 b2

Since S is closed under addition, we have found b1 , b2 S with b1 b2 = 1.

Take n0 = b2 (b2 1). For any n n0 , we can write n = qb2 + r, where 0 r b2 1


CHAPTER 5. MARKOV CHAINS 69

and q b2 1. Then, q r 0 and

n = qb2 + r = qb2 + r(b1 b2 ) = (q r)b2 + rb1 ,

which is in S because S is closed under addition.

Lemma 5.14. For i, j K for an irreducible, aperiodic Markov chain, there exists a
positive integer n0 such that P n (i, j) > 0 for all n n0 .

Proof. Let S be the set {n > 0 : P n (j, j) > 0}. If n1 , n2 S, then n1 + n2 S since
P n1 +n2 (j, j) P n1 (j, j)P n2 (j, j) > 0, so S is closed under addition. Since the Markov
chain is aperiodic, gcd(S) = 1. Now, Lemma 5.13 gives an integer n0 such that for all
n n0 , P n (j, j) > 0.

Since the Markov chain is irreducible, there is some n1 such that P n1 (i, j) > 0. Take
n0 = n0 + n1 . For any n n0 , observe that n n1 n0 , so from the result above,
P n (i, j) P n1 (i, j)P nn1 (j, j) > 0, which completes the proof.

The lemma says that if we continue to raise the transition matrix to successively higher pow-
ers, then eventually we will reach a transition matrix with all positive entries. A transition
matrix with this property is known as a regular transition matrix; we have just proven
that the transition matrix for an irreducible, aperiodic Markov chain is regular.

Aperiodicity also yields the following result:

Theorem 5.15. For a finite, irreducible Markov chain, the period of every state is the
same.

Furthermore:

Theorem 5.16. For a finite, irreducible, and aperiodic Markov chain, the distribution
converges to the limiting distribution.

n as n

In other words, for each state i,

n (i) (i) as n .

5.4.4 Convergence Rate


In fact, we can go beyond Theorem 5.16 and make a statement about the rate of convergence.
CHAPTER 5. MARKOV CHAINS 70

Theorem 5.17 (Exponential Convergence). For a finite, irreducible, and aperiodic


Markov chain, there exists C 0 and 0 < 1 such that |P n (i, j) (j)| Cn ,
for all j.

Proof. First, assume that P (i, j) > 0 for all i and j. Write = mini,jK P (i, j) > 0.
Since 1 = N
P
j=1 P (i, j) N , then we must have 0 1 N < 1. We will show that
= 1 N satisfies |P n (i, j) (j)| n .

Fix j and write mn (j) = miniK P n (i, j) and Mn (j) = maxiK P n (i, j). Let i1 and i2
be such that P n+1 (i1 , j) = mn+1 (j) and P n+1 (i2 , j) = Mn+1 (j). Divide the states into a
disjoint union K = K1 K2 , where K1 consists of the states k for which P (i1 , k) P (i2 , k).
P
The first step is to bound kK1 (P (i1 , k) P (i2 , k)). Since every entry of a row is at
least , the worst case is when P (i2 , k) = and P (i1 , k) is as large as possible. Since the
rows must sum to 1, the largest P (i1 , k) can be is 1 (N 1) (think of having in the
other N 1 entries). Therefore, the sum is bounded by
X
(P (i1 , k) P (i2 , k)) 1 (N 1) = 1 N = .
kK1

The second step is to calculate Mn+1 (j) mn+1 (j).


X
Mn+1 (j) mn+1 (j) = P n+1 (i1 , j) P n+1 (i2 , j) = (P (i1 , k) P (i2 , k))P n (k, j)
kK

Break up the summation and examine each part individually.


X X
(P (i1 , k) P (i2 , k)) P n (k, j) Mn (j) (P (i1 , k) P (i2 , k))
| {z }
kK1 kK1
Mn (j)
X X
(P (i1 , k) P (i2 , k)) P n (k, j) mn (j) (P (i1 , k) P (i2 , k))
| {z } | {z }
kK2 <0 kK2
mn (j)
X
= mn (j) (P (i1 , k) P (i2 , k))
kK1

Hence, using the bound in step one,


X
Mn+1 (j) mn+1 (j) (Mn (j) mn (j)) (P (i1 , k) P (i2 , k)) (Mn (j) mn (j)).
kK1

Therefore, Mn (j) mn (j) n , which shrinks to 0 as n . This implies that for each
i, P n (i, j) converges to (j), and the rate of convergence satisfies |P n (i, j) (j)| n .
CHAPTER 5. MARKOV CHAINS 71

When not every P (i, j) is positive, we can always find a finite integer m such that
P m (i, j) > 0 for all i and j (this follows from Lemma 5.14). Applying the above case to
P m , we see that we have exponential convergence in this case as well.

5.4.5 Balance Equations


Now that we see the importance of the invariant distribution, we can turn towards cal-
culation. Recall that the invariant distribution satisfies = P , which is equivalent to
(P I) = 0. This is a matrix-vector equation, so it actually forms a linear system of N
equations in the unknowns (i). Using some linear algebra knowledge, it can be shown that
one of the equations is redundant, that is, linearly dependent on the rest of the equations
(since 1 is an eigenvalue of P , the matrix P I does not have full P rank). Therefore, we
augment the system of equations with the normalization condition N i=1 (i) = 1, which
uniquely specifies the invariant distribution. The system of equations (P I) = 0 is called
the balance equations.

Example 5.18 (Random Walk on a Circle). Consider i.i.d. random variables n which
take on
PNvalues in {0, . . . , N 1},
P with distribution P (n = i) = pi . (Of course, we must
1
have i=0 pi = 1.) Let Xn = ni=0 i mod N . Then Xn describes a random walk on a
circle. It is possible to model the Xn as a Markov chain.

Starting from 0, the probability of transitioning to state i is pi . Starting from state 1,


the probability of transitioning to state i is pi1 (where the index i 1 is also taken
modulo N ). Generalizing, we see that P (i, j) = pji , where j 1 is interpreted modulo
N
 . The transition matrix has a special property: eachP row is a distinct permutation of
1
p0 pN 1 . In particular, the columns sum to 1: N

i=0 P (i, j) = 1, for all j. If we
assume that each pi > 0, then the Markov chain is irreducible and aperiodic, and we can
conjecture a uniform stationary distribution. Indeed, if (i) = N 1 for each i,
N 1 N 1
X 1 X 1
(i)P (i, j) = P (i, j) = = (j),
i=0
N i=0 N
| {z }
=1

so = N 1 N 1 is the unique stationary distribution. Since finite, irreducible,


 

and aperiodic Markov chains converge, we have shown that the distribution of Xn con-
verges to as n , which is to say that if the transition probabilities are posi-
tive, a random walk on a circle will always converge to the uniform distribution over
{0, . . . , N 1}.

5.4.6 Invariant Distribution & Hitting Times


Let (i) denote the time it takes,Pstarting at state i, to return to state i again for the first
time. Then, we have E[(i)] = n=1 nf
(n)
(i, i). There is an essential connection between
the ith entry of the invariant distribution and E[(i)], the expected time to return to state
CHAPTER 5. MARKOV CHAINS 72

i. Informally, if the distribution converges to , then (i) represents the probability that
Xn will be in state i for large n. Therefore, we may expect the expected return time to be
similar to the geometric distribution, that is, we may expect E[(i)] = 1/(i). Indeed, we
have the following result (which does not require the Markov chain to be finite):

Theorem 5.19. Suppose that state i is persistent and limn P n (i, i) = p. Then we
have p = E[(i)]1 , where 1/ is interpreted as 0.

Proof. Fix n and suppose that we start at state i. Since state i is persistent, the proba-
bility that we return to state i eventually is 1. We decompose this event in the following
way: either we have returned to state i at time n, or the last time that we return to
state i is time k, and then it takes longer than n k steps before we reach state i again.
n1
X n
X
n k
1 = P (i, i) + P (i, i)P ((i) > n k) = P nk (i, i)P ((i) > k) (5.11)
k=0 k=0

In order to cover the case that p = 0, we must prove that p > 0 if and only if E[(i)] < .

Suppose that p > 0. Fix j < n. Then,

1 P (nj) (i, i)P ((i) > j) + + P n (i, i)P ((i) > 0).

Letting n , since limn P n (i, i) = p, we have 1 p jk=0 P ((i) > k), which puts
P

a bound on the partial sums: jk=0 P ((i) > k) p1 . Since the RHS is a constant, we
P
let j and obtain

X 1
E[(i)] = P ((i) > k) < .
k=0
p

Now, suppose that E[(i)] < . Letting n in (5.11), we have



X
1= pP ((i) > k) = pE[(i)],
k=0

which shows that p = E[(i)]1 > 0.


Chapter 6

Continuous Probability I

Our study of discrete random variables has allowed us to model coin flips and dice rolls, but
often we would like to study random variables that take on a continuous range of values (i.e.
an uncountable number of values). At first, understanding continuous random variables will
require a conceptual leap, but most of the results from discrete probability carry over into
their continuous analogues, with sums replaced by integrals.

6.1 Continuous Probability: A New Intuition


Pick a random number in the interval [0, 1] R. What is the probability that we pick
exactly the number 2/3? There are uncountably many real numbers in the interval [0, 1], so
it seems overwhelmingly unlikely that we pick any particular number. We must find that
the probability of choosing exactly 2/3 is 0 (it is impossible). In fact, for any real number
a [0, 1], the probability of choosing a must also be 0. But we are guaranteed to choose
some number in [0, 1], so how is that possible that whatever number we chose was chosen
with probability 0? Furthermore, if I consider the probability of choosing a number less than
1/2, then intuitively, we would like to say the probability is 1/2. Does this not imply that
when we add up a bunch of zero probabilities, we manage to get a non-zero probability?
Clearly, our theory of discrete probability breaks down.

Therefore, let us begin with a few definitions. It is natural if you find that you cannot inter-
pret these definitions immediately, since our intuition from discrete probability will require
some updates. Over the course of working with continuous probability, you will start to
build a new intuition.

The key to resolving the issues raised above is to recall our original definition of a probabil-
ity measure: a probability measure assigns real numbers to sets (not individual outcomes).
Even in this new setting, as long as we can consistently assign probability values to sets,
then probability theory will continue to work as before. One way of assigning consistent
probabilities to sets is through a density function, which will be the focus of our studies.

The density function of a continuous random variable X (also known as the probability

73
CHAPTER 6. CONTINUOUS PROBABILITY I 74

density function, or PDF), is a real-valued function fX (x) such that

1. fX is non-negative: x R fX (x) 0.

2. fX is normalized, which is to say that fX satisfies:


Z
fX (x) dx = 1 (6.1)
R

Compare the normalization condition here to the normalization condition in discrete prob-
ability, which says X
P (X = x) = 1.
x

We can interpret the continuous normalization condition to mean that the probability that
X R is 1. Similarly, we can define the probability that X lies in some interval [a, b] as:
Z b
P (X [a, b]) := fX (x) dx (6.2)
a

Remark: It does not matter whether I write the interval as [a, b] (including the endpoints)
or (a, b) (excluding the endpoints). The endpoints themselves do not contribute to the prob-
ability, since the probability of a single point is 0 (as discussed above).

The probability of an interval in R is interpreted as the area under the density function above
the interval. Similarly, we can calculate the probability of the union of disjoint intervals by
adding together the probabilities of each interval.

When we discuss continuous probability, it is also extremely useful to use the cumulative
distribution function of X, or the CDF, defined as:
Z x
FX (x) := P (X x) = f (x0 ) dx0 (6.3)

Remark: Once again, it makes no difference whether I write P (X x) or P (X < x), since
we have that P (X = x) = 0. From now on, I will be sloppy and use the two interchangeably.

To obtain the PDF from the CDF, we use the Fundamental Theorem of Calculus:

d
fX (x) = FX (x) (6.4)
dx

We have given an intepretation of the area under the density function fX (x) as a probability.
The natural question is: what is the interpretation of fX (x) itself? In the next two sections,
we present an interpretation of fX (x) and introduce two ways of computing distributions in
continuous probability.
CHAPTER 6. CONTINUOUS PROBABILITY I 75

6.1.1 Differentiate the CDF


To motivate the discussion, we will consider the following example: suppose you pick a ran-
dom point within a circle of radius 1. Let R be the random variable which denotes the radius
of the chosen point (i.e. the distance away from the center of the circle). What is fR (r)?

The first method is to simply work with the CDF and to obtain fR (r) by differentiating
FR (r). Since FR (r) := P (R < r), and we have chosen a point uniformly randomly inside
of the circle, then the probability we are looking for is the ratio of area of the inner circle
(which has radius r) to the area of the total circle (which has radius 1).

Area Inside a Circle of Radius r r2


P (R < r) = = = r2
Total Area Inside Circle
Differentiating quickly yields fR (r) = 2r for 0 < r < 1. Often, differentiating the CDF is a
fast way of finding the density function.

6.1.2 The Differential Method


The second method works with the density function directly, and therefore involves ma-
nipulation of differential elements (such as dr). If you have taken physics courses before,
then you may already be familiar with the method. If you do not feel comfortable with the
manipulations in this section, you can always differentiate the CDF instead.

To briefly motivate the procedure, let us consider FR (r + dr). Using a Taylor expansion,
 
d
FR (r + dr) = FR (r) + FR (r) dr + O((dr)2 ),
dr

where the notation O((dr)2 ) includes terms of order (dr)2 or higher. Recalling that we have
FR (r) = P (R < r) and the derivative of the CDF is the density function,

P (R < r + dr) P (R < r) = fR (r) dr + O((dr)2 ).

Let us immediately apply the formula we have derived to the motivating example. From the
CDF, we have that the probability of picking a point with R < r + dr is (r + dr)2 , and the
probability of picking a point with R < r is r2 . Therefore,

P (R < r + dr) P (R < r) = (r + dr)2 r2


= r2 + 2r dr + (dr)2 r2
= 2r dr + (dr)2 .

The expression above must equal fR (r) dr + O((dr)2 ). Therefore, by looking at the term
which is proportional to dr, we can identify fR (r) = 2r and we obtain the same answer!
CHAPTER 6. CONTINUOUS PROBABILITY I 76

Initially, the differential method seems to require more calculations and is not as familiar
as working with the CDF instead. However, through this discussion we have obtained an
interpretation for the density function:

P (X < x + dx) P (X < x) = P (X (x, x + dx)) = fX (x) dx (6.5)

In words: the probability that the random variable X is found in the interval (x, x + dx)
is proportional to both the length of the interval dx and the density function evaluated in
the interval. The interpretation can be rephrased in the following way: the density function
fX (x) is the probability per unit length near x.

The basic procedure for obtaining the density function directly is:
1. Use the information given in the problem to find P (X (x, x + dx)).

2. Drop terms of order (dx)2 or higher.

3. Identify the term multiplied by dx as the density function fX (x).

6.2 Continuous Analogues of Discrete Results


Now, we will return to familiar ground by re-introducing the results from discrete probability
in the continuous case. In most cases, replacing summations with integration over R will
work as intended.

The expectation of a continuous random variable X is:


Z
E[X] := xfX (x) dx (6.6)
R

Similarly, the expectation of a function of X is:


Z
E[g(X)] := g(x)fX (x) dx (6.7)
R

We can continue to use the formula var(X) = E[X 2 ] E[X]2 to obtain the variance of a
continuous random variable. Since integrals are linear, linearity of expectation still holds.

The joint distribution of two continuous random variables X and Y is fX,Y (x, y). The
joint distribution represents everything there is to know about the two random variables.
The joint distribution must satisfy the normalization condition:
Z Z
fX,Y (x, y) dx dy = 1 (6.8)

CHAPTER 6. CONTINUOUS PROBABILITY I 77

We say that X and Y are independent if and only if the joint density factorizes:

fX,Y (x, y) = fX (x)fY (y) (6.9)

To obtain the marginal distribution of X from the joint distribution, integrate out the
unnecessary variables (in this case, we integrate out Y ):
Z
fX (x) = fX,Y (x, y) dy (6.10)

The joint distribution can be extended easily to multiple random variables X1 , . . . , Xn . The
joint density satisfies the normalization condition:
Z
fX1 ,...,Xn (x1 , . . . , xn ) dx1 dxn = 1 (6.11)
Rn

How do we use the joint distribution to compute probabilities? Consider a region J R2 .


The probability that (X, Y ) J is:
Z
P ((X, Y ) J) := fX,Y (x, y) dx dy (6.12)
J

In other words, to find the probability that (X, Y ) is in a region J, we integrate the joint
density over the region J. As in multivariable calculus, it is often immensely helpful to draw
the region of integration (J) before actually computing the integral.

To calculate the expectation of a function of many random variables, we compute:


Z
E[g(X1 , . . . , Xn )] = g(x1 , . . . , xn )fX1 ,...,Xn (x1 , . . . , xn ) dx1 dxn (6.13)
Rn

6.2.1 Tail Sum Formula


There is a continuous analogue of the tail sum formula.

Theorem 6.1 (Continuous Tail Sum Formula). Let X be a non-negative random vari-
able. Then: Z
E[X] = (1 FX (x)) dx (6.14)
0

Proof.
Z Z Z x Z Z
E[X] = xfX (x) dx = fX (x) dt dx = fX (x) dx dt
0 0 0 0 t
Z Z
= P (X > t) dt = (1 FX (t)) dt
0 0
CHAPTER 6. CONTINUOUS PROBABILITY I 78

The proof is quite similar to the discrete case. Interchanging the order of integration is
justified by Fubinis Theorem because the integrand is non-negative.

Example 6.2. Suppose that c > 0 and X 0. What is E[min(c, X)]?

We use the Tail Sum Formula: if x < c, then P (min(c, X) x) = 1 FX (x), and if
x c, then P (min(c, X) c) = 0.
Z Z c
E[min(c, X)] = P (min(c, X) x) dx = (1 FX (x)) dx
0 0

In this example, X is partly discrete and partly continuous. We say that X is an example
of a mixed random variable.

6.3 Important Continuous Distributions


6.3.1 Uniform Distribution
The first distribution we will look at in detail is the Unif(0, 1) distribution, which means
that X is chosen uniformly randomly in the interval (0, 1). The property of being uniform
means that the probability of choosing a number within an interval should only depend on
the length of the interval. We can produce this by requiring the density function to equal a
constant c in the interval (0, 1). Of course, since the density must be normalized to 1, then
c = 1 and the density function is

fX (x) = 1, 0 < x < 1. (6.15)

The CDF is found by integrating:



0,
x<0
FX (x) = x, 0<x<1 (6.16)

1, x>1

Similarly, suppose that X and Y are i.i.d. Unif(0, 1) random variables. Since they are
independent, the joint distribution is simply the product of their respective density functions:

fX,Y (x, y) = fX (x)fY (y) = 1, 0 < x < 1, 0 < y < 1

The uniform distribution is especially simple because the density is constant. Suppose we
want to find the probability that (X, Y ) will lie in a region J R2 . We can integrate the
joint density to obtain:
Z Z
P ((X, Y ) J) = fX,Y (x, y) dx dy = dx dy = Area(J)
J J

The procedure for computing probabilities is therefore very simple. To find the probability
of an event involving two i.i.d. Unif(0, 1) random variables X and Y , draw the unit square,
CHAPTER 6. CONTINUOUS PROBABILITY I 79

and shade in the region in the square which corresponds to the given event. The area of
the shaded region is the desired probability. As a result, many questions involving uniform
distributions have very geometrical solutions.

To compute the expectation, we can simply observe that the distribution is symmetric about
x = 1/2, so we can immediately write down 1/2 as the expectation. To make the argument
slightly more formal, consider the random variable Y = 1 X. Observe that Y and X have
identical distributions (but they are not independent). Therefore, E[X] = E[Y ] = E[1 X],
or E[X] = 1 E[X] using linearity. Hence,
1
E[X] = . (6.17)
2
One question you may have is: why are X and Y identically distributed? To answer that
question in a rigorous way, we will need the change of variables formula in order to show that
X and Y have the same density function (and hence the sameR distribution). For now, accept
1
the intuition! (Alternatively, carry out the integral E[X] = 0 x dx, but thats boring.)

We compute variance in the standard way:


Z 1
1
2 2 1 3 1
E[X ] = x dx = x =
0 3 0 3

Then, use var(X) = E[X 2 ] E[X]2 = 1/3 1/4 to show that


1
var(X) = . (6.18)
12

6.3.2 Exponential Distribution


We will search for a continuous analogue of the geometric distribution. Recall that the geo-
metric distribution satisfied the Memoryless Property: P (X > s + t | X > s) = P (X > t).
In one sense, the Memoryless Property defines the geometric distribution, and we can look
for a continuous distribution that satisfies the same property.

Theorem 6.3. Let T be a random variable satisfying the Memoryless Property. Then
T has the density function
fT (t) = et , t > 0, (6.19)
where > 0 is a parameter. We say that T follows the exponential distribution, or
T Expo().

Proof. We search for a differentiable CDF which satisfies the Memoryless Property, which
will uniquely specify the density.
CHAPTER 6. CONTINUOUS PROBABILITY I 80

Let GT (t) := 1 FT (t) be the survival function of T . Then

fT (t) dt = P (T (t, t + dt)) = P (T > t, T < t + dt) = P (T < t + dt | T > t)P (T > t)
= (1 P (T > t + dt | T > t))P (T > t) = (1 P (T > dt))P (T > t)
= (1 GT (dt))GT (t).

Notice the use of the Memoryless Property. Consider the initial condition of GT (t): we
know that GT (0) = P (T > 0) = 1. Additionally, since dt is a small quantity, we can take
the Taylor expansion of GT (dt) about t = 0 and drop terms of order (dt)2 and higher.

fT (t) dt (1 (GT (0) +G0T (0) dt))GT (t) = G0T (0)GT (t) dt
| {z }
=1

Recall that
d d
GT (t) = (1 FT (t)) = fT (t).
dt dt
We obtain a differential equation for GT (t):

d
GT (t) = G0T (0)GT (t)
dt
The solutions to the differential equation are all of the form GT (t) = cet , where c R
is a constant and we wrote = G0T (0). Using the initial condition GT (0) = 1, we see
that c = 1, so that GT (t) = et . The CDF is

FT (t) = 1 et , t>0 (6.20)

The density is easily obtained by differentiating. Note that the condition > 0 arises
because if < 0, then the density function does not satisfy the condition of being non-
negative. When = 0, the density is fT (t) = 0 and cannot normalize to 1, so = 0 is
also not a valid choice. Through this derivation, we have also showed the uniqueness of
the exponential distribution.
In the lecture notes and slides, it was shown that the exponential distribution has the Mem-
oryless Property. Here, we have shown that a density with the Memoryless Property is
exponential. Hence, we have proven both directions: a continuous probability distribution is
exponential if and only if it satisfies the Memoryless Property! This is a remarkable result,
and it is also true for the geometric distribution in the discrete case (although we have not
proved it). The Memoryless Property should be thought of as the defining characteristic of
the exponential distribution.

We will proceed to compute the basic properties of the exponential distribution. First, check
for yourself that the exponential distribution is propertly normalized:
Z
et dt = 1 (6.21)
0

We claim that the normalization condition itself is already most of the work necessary to
CHAPTER 6. CONTINUOUS PROBABILITY I 81

find the expectation of T . Normally, we would need to calculate


Z
E[T ] = tet dt
0

which can be solved using integration by parts. Instead, here is a trick:


Z Z Z
t t d d 1 1
E[T ] = te dt = e dt = et dt = = 2
0 0 d 0 d
Therefore,
1
E[T ] = . (6.22)

Similarly, the variance is computed by finding
Z Z 2 Z
2 2 t t d2 t d2 1 2 2
E[T ] = t e dt = 2
e dt = 2
e dt = 2
= 3 = 2.
0 0 d 0 d

Hence, the variance is var(T ) = 2/2 1/2 or


1
var(T ) = . (6.23)
2

Example 6.4. For the exponential distribution, we can also calculate the median, that
is, the t such that P (T t) = 1/2 = P (T t). We have 1/2 = et , so ln 2 = t
and we find that the median is (ln 2)/. We can see that, compared to the mean, there is
an additional ln 2 factor. In the subject of chemistry, this is also known as the half-life
of a radioactivate substance.

6.3.3 The Minimum & Maximum of Exponentials


Theorem 6.5. Let T1 , . . . , Tn be independent exponential random variables with param-
eters 1 , . . . , n respectively. Then the minimum of the random variables is also expo-
nentially distributed:

min {T1 , . . . , Tn } Expo(1 + + n ) (6.24)

Proof. The easiest way to prove this is to once again consider the survival function
GT (t) = P (T > t).

P (min{T1 , . . . , Tn } > t) = P (T1 > t, . . . , Tn > t) = P (T1 > t) P (Tn > t)


= e1 t en t = e(1 ++n )t

We have the survival function of an exponential distribution with parameter 1 + +n .


CHAPTER 6. CONTINUOUS PROBABILITY I 82

Example 6.6. Suppose that we have n i.i.d. Expo(1) random variables. What is the
expectation of the maximum?

View the exponential random variables as representing the lifetimes of n light bulbs. The
expectation of the maximum is the expected time for all n light bulbs to die. This is
simply the expected time until the first light bulb dies, plus the expected time it takes for
the remaining n1 light bulbs to die. We can compute each of these quantities separately.

The first light bulb to die is the minimum of n Expo(1) random variables. By Theo-
rem 6.5, the minimum of the light bulbs is Expo(n), with mean 1/n. Let Sn be the
time for all n light bulbs to die. Once the first light bulb dies, we wait for n 1 light
bulbs to die. By the Memoryless Property, however, this is the same as if we had started
with n 1 i.i.d. Expo(1) random variables and asked for the maximum. In other words,
E[Sn ] = 1/n + E[Sn1 ]. By solving this recurrence, we obtain
n
X 1
E[Sn ] = = Hn ln n + .
k=1
k

For the next example, imagine a store with two service lines. Two customers are already
waiting in each of the two lines; call these customers A and B respectively. A third customer,
C, arrives, and waits for the first available teller. Assume that the two lines have service
times which are exponentially distributed, with parameters 1 and 2 respectively.

Example 6.7. Assume that 1 = 2 = . What is the probability that customer C is


the last to leave the store?

Consider the time at which the first customer leaves the store. The other customer who
is still waiting in line, by the Memoryless Property, has a service time distributed as a
brand-new Expo() distribution. Customer C enters the service line which is now free,
so the service time for customer C is also Expo(). By symmetry, the probability that
customer C will leave first is 1/2.
Chapter 7

Continuous Probability II

We continue our study of continuous random variables with more advanced tools: conditional
probability, change of variables, convolution. We then introduce more distributions, includ-
ing the all-important normal distribution (with the associated Central Limit Theorem).

7.1 Conditional Probability


7.1.1 Law of Total Probability
Suppose we have an event A and we would like to calculate P (A) by using the Law of Total
Probability. Specifically,Pwe would like to condition on X = x. In the discrete case, we
would compute P (A) = x P (A | X = x)P (X = x), but we cannot take a summation over
uncountably many values. The solution is to instead integrate over the density:
Z
P (A) = P (A | X = x)fX (x) dx (7.1)

Example 7.1. Suppose X Unif[0, 1] and Y Unif[0, X]. That is, conditioned on
X = x, Y has a Unif[0, x] distribution. What is P (Y > 1/2)?

First, we compute P (Y > 1/2 | X = x): in a Unif[0, x] distribution, we have

x 1/2 1
P (Y > 1/2 | X = x) = =1 , x > 1/2.
x 2x
We integrate over values of x > 1/2.
Z Z 1  
1
P (Y > 1/2) = P (Y > 1/2 | X = x)fX (x) dx = 1 dx
1/2 2x
1 x=1 1
= x ln x = (1 ln 2)

2 x=1/2 2

83
CHAPTER 7. CONTINUOUS PROBABILITY II 84

7.1.2 Conditional Density


The conditional density of Y given X = x is:
fX,Y (x, y)
fY | X (y | x) := (7.2)
fX (x)
Note that the equation is superficially very similar to the definition of conditional probability
for the discrete case. There are some subtleties, however. The function fY | X (y | x) is a
function of y only. However, to calculate the conditional density, we divide a function of x
and y by a function of x, so in most cases, our function fY | X (y | x) will actually involve the
variable x as well! What is going on here? The answer is that x is a parameter of the function,
not an argument. For each possible choice of x, we have a different conditional density,
and each conditional density is a function of y only. As an illustration, the normalization
condition for the conditional density is:
Z
fY | X (y | x) dy = 1 (7.3)
R

In fact, we can prove this fact quite easily.


Proof.
Z Z Z
fX,Y (x, y) 1 1
fY | X (y | x) dy = dy = fX,Y (x, y) dy = fX (x) = 1
R R fX (x) fX (x) R fX (x)
As another illustration, note that x is just a number, e.g. x = 0. Then we can write
fX,Y (0, y)
fY | X (y | 0) = .
fX (0)
Now it should be very clear that fY | X (y | 0) is indeed a function of y only.

Similarly, the conditional CDF of Y given X = x is:


Z y
0
FY | X (y | x) := P (Y y | X = x) = fY | X (y | x) dy 0 (7.4)

We can compute the probability of an event [a, b] by


Z b
P (Y [a, b] | X = x) = FY | X (b | x) FY | X (a | x) = fY | X (y | x) dy. (7.5)
a

7.2 Functions of Random Variables


7.2.1 Change of Variables
Often, we wish to find the density of a function of a random variable, such as the density of
X 2 (assuming that we already know fX (x)). The problem can be solved in a satisfying and
general way.
CHAPTER 7. CONTINUOUS PROBABILITY II 85

Theorem 7.2 (Change of Variables). Let X be a random variable and let g : R R be


one-to-one. Let h be the inverse of g (h exists since g is one-to-one). Then:

fY (y) = fX (h(y))|h0 (y)| (7.6)

Proof. The method I prefer to use is to first manipulate the CDF and then differentiate.
First, we consider the case when g is strictly increasing.

FY (y) = P (Y < y) = P (g(X) < y) = P (X < h(y)) = FX (h(y))

Differentiate both sides:


fY (y) = fX (h(y))h0 (y)
Since h is strictly increasing, then h0 (y) > 0 and the change of variables equation holds.
Now consider the case in which g is strictly decreasing. Now that h is strictly decreasing,
applying h to both sides of the inequality also flips the direction of the inequality. Hence

FY (y) = P (X > h(y)) = 1 P (X < h(y)) = 1 FX (h(y)).

Differentiate both sides:


fY (y) = fX (h(y))h0 (y)
Here, since h is strictly decreasing, then h0 (y) < 0 and therefore |h0 (y)| = h0 (y). Hence,
the change of variables formula still holds.

My advice is to not bother remembering the change of variables formula. Instead, remember
the basic outline of the proof: write down the CDF of Y , then write the expression in terms
of the CDF of X, and then differentiate.

Why did we assume that the function g was one-to-one? Only out of convenience: the
condition that g is one-to-one is not necessary for change of variables to work, although the
change of variables formula is somewhat more complicated:
X
fY (y) = fX (x)|h0 (y)| (7.7)
x:g(x)=y

The idea is that since g is no longer one-to-one, there may be many values of x such that
g(x) = y, so we must sum up over all x such that g(x) = y. Can we define change of variables
in the discrete case? Actually, the discrete case is rather easy.
X
P (Y = y) = P (X = x) (7.8)
x:g(x)=y

If you think about the above equation, you will realize that we have been using the formula
all along without knowing it.

We will use change of variables in the section about the normal distribution.
CHAPTER 7. CONTINUOUS PROBABILITY II 86

7.2.2 Convolution
Here, we explain how to compute the density of Z := X + Y .
P (Z (z, z + dz)) = fZ (z) dz
Note that X + Y = Z (z, z + dz) is equivalent to the event that Y (z x, z x + dz),
where x ranges over all values. Hence, we can find P (Z (z, z + dz)) by integrating over
the joint density of X and Y .
Z Z zx+dz
fZ (z) dz = fX,Y (x, y) dy dx
zx

Since dz is an infinitesimal length, we can assume that fX,Y (x, y) does not change over the
interval of integration for y. The inner integral therefore equals the value of the function,
fX,Y (x, z x), multiplied by the length of the interval, dz.
Z
fZ (z) dz = fX,Y (x, z x) dz dx

Therefore, we can identify:
Z
fZ (z) = fX,Y (x, z x) dx (7.9)

When X and Y are independent, the formula simplifies to:


Z
fZ (z) = fX (x)fY (z x) dx (7.10)

This is known as the convolution formula.

7.2.3 Ratios of Random Variables


Next, we will compute the density of Z := X/Y . X/Y = Z (z, z + dz) is equivalent to
X (yz, yz + y dz), when y > 0, and X (yz + y dz, yz) when y < 0 (since y < 0 implies
that yz + y dz < yz). Hence, we split up the integral over the joint density into two pieces:
Z 0 Z yz Z Z yz+y dz
fZ (z) dz = fX,Y (x, y) dx dy + fX,Y (x, y) dx dy
yz+y dz 0 yz

We assume that dz is small so that fX,Y (x, y) is effectively constant over the inner integral.
Z 0 Z Z
fZ (z) dz = fX,Y (yz, y)(y dz) dy + fX,Y (yz, y)(y dz) dy = |y|fX,Y (yz, y) dy dz
0
Therefore, we obtain:
Z
fZ (z) = |y|fX,Y (yz, y) dy (7.11)

In the special case when X and Y are independent, we have:


Z
fZ (z) = |y|fX (yz)fY (y) dy (7.12)

CHAPTER 7. CONTINUOUS PROBABILITY II 87

Example 7.3. Let X and Y be i.i.d. Expo(). What is the density of Z = X/Y ?

We carry out the integral, but note that y ranges from 0 to .


Z Z
yz y
fZ (z) = ye e dy = y(z + 1)e(z+1)y dy
0 z+1 0

We recognize the integral as the expectation of an Expo((z + 1)) distribution, which is


((z + 1))1 . Hence,
1
fZ (z) = , z > 0.
(1 + z)2

7.3 Normal Distribution


We turn our attention to one of the most important probability distributions: the standard
normal distribution (also called a Gaussian), which we denote N (0, 1). (The first parameter
is the mean, and the second parameter is the variance). The density will be proportional to
2
ex /2 (defined over all of R), which has no known elementary antiderivative. Therefore, we
devote the next section to integrating this function.

7.3.1 Integrating the Normal Distribution


Let us find the normalization constant, that is, we seek c R such that
Z
2
c ex /2 dx = 1.
R

2
In fact, we will solve a slightly more general integral by considering ex /2 instead. (We can
always set = 1 at the end of our computations, after all.) The reason for doing so may
not be clear, but it will actually save us some time later on.
2
As I mentioned above, ex /2 cannot be integrated normally, so we will need to use another
trick (hopefully you find all of these tricks somewhat interesting). The trick here is to
consider the square of the integral:
Z Z Z Z
x2 /2 y 2 /2 2 2
2
I = e dx e dy = e(x +y )/2 dx dy
R R

Notice that the integral depends only on the quantity x2 + y 2 , which you may recognize as
the square of the distance from the origin. (The variables x and y were chosen suggestively
to bring to mind the picture of integration on the plane R2 .) Therefore, it is natural to
change to polar coordinates with the substitutions

x2 + y 2 = r 2 , (7.13)
dx dy = r dr d. (7.14)
CHAPTER 7. CONTINUOUS PROBABILITY II 88

(Dont forget the extra factor of r that arises due to the Jacobian. For more information,
consult a multivariable calculus textbook which develops the theory of integration under
change of coordinates.)

In polar coordinates, the integral can now be evaluated.


Z 2 Z Z 2 Z
2 r2 /2 r2 /2 1 r2 /2 2
I = e r dr d = d re dr = 2 e =
0 0 0 0
0

Set = 1 and I 2 = 2. We obtain the surprising result that I = 2. The normalized
density is
1 2
fX (x) = ex /2 . (7.15)
2
As noted before, there is no elementary antiderivative of the density function, so we cannot
write down the CDF in terms of familiar functions. The CDF of the standard normal
distribution is often denoted:
Z x
1 0 2
(x) := e(x ) /2 dx0 (7.16)
2

7.3.2 Mean and Variance of the Normal Distribution


We hope to find that the mean and variance of the N (0, 1) distribution are 0 and 1, as we
claimed. Here, we verify that this is the case. First, notice that the density fX (x) depends
only on x2 , so interchanging x 7 x leaves the density unchanged. The density is therefore
symmetric about x = 0, and we write down

E[X] = 0.

The variance is slightly trickier.


Z Z
2 1 x2 /2 1 2
2
E[X ] = x e dx = x2 ex /2 dx
2 2
Although the integral smells like integration by parts, recall how we managed to avoid
integration by parts in a similar integral when we computed the mean and variance of the
exponential distribution. We would like to apply a similar trick here, so let us instead consider
2
the integral of ex /2 (with the intention of setting = 1 at the end of our computations).
Z Z Z
1 2 x2 /2 1 x2 /2 2 d 2
2
E[X ] = xe dx = (2) e dx = ex /2 dx
2 2 2 d
r
2 d 2 1
= = 2 3/2 = 3/2
2 d 2

Again, set = 1, and since E[X]2 = 0,

var(X) = 1.
CHAPTER 7. CONTINUOUS PROBABILITY II 89

Hopefully, it should be clear by now why we need the : it is simply a tool that we use to
integrate more easily, and then we discard it after we finish the actual integration. In any
case, we have verified that the mean and variance are indeed 0 and 1 respectively.

We now apply the change of variables technique to the standard normal distribution, both as
an illustration of the technique, and also to obtain the general form of the normal distribution.
Consider the function g(x) = + x (with R and > 0). Let Y = g(X). We proceed
to find the density of Y . First, note that the inverse function h = g 1 is
y
h(y) =

and
1
|h0 (y)| = .

Then we have that  
0 1 y
fY (y) = fX (h(y))|h (y)| = fX .

Plugging in (y )/ into the standard normal density yields
1 2 2
fY (y) = e(y) /(2 ) . (7.17)
2
What are the mean and variance of Y ? Recall that Y = + X. Using the basic properties
of linearity and scaling,

E[Y ] = ,
var(Y ) = 2 .

Y a normal random variable with mean and variance 2 , which is denoted Y N (, 2 ).


We have seen that any normal distribution is found from the standard normal distribution
by the following two-step procedure: scale by and shift by .

7.3.3 Sums of Independent Normal Random Variables


In this section, we prove the crucial fact that the sum of independent normal random vari-
ables is also normally distributed. The theorem is breathtaking, but let us precede the
theorem with some discussion about the joint density of two normal random variables.

Let X, Y be i.i.d. standard normal random variables. Since they are independent, their joint
density is the product of the individual densities.
1 2 1 2 1 (x2 +y2 )/2
fX,Y (x, y) = fX (x)fY (y) = ex /2 ey /2 = e
2 2 2

Observe that the joint density only depends on x2 + y 2 = r2 , which is the square of the
distance from the origin. (In fact, we already noticed this property when we were integrat-
ing the Gaussian.) In polar coordinates, the density only has a dependence on r, not on ,
CHAPTER 7. CONTINUOUS PROBABILITY II 90

which exhibits an important geometric property: the joint density is rotationally symmetric,
that is, the Gaussian looks exactly the same if you rotate your coordinate axes. How can we
utilize this geometric property to prove our result?

Let Z = X + Y . Consider FZ (z) = P (Z < z). We can write

FZ (z) = P (Z < z) = P (X + Y < z).

To compute this probability, we integrate the joint density


Z
FZ (z) = fX,Y (x, y) dx dy
J

where J is the region


J := {(x, y) : x + y < z} R2 .
We know that x + y = z is a line in R2 . Perhaps we can align our coordinate axes with the
line x + y = z, and the integral will be simplified. Consider the coordinate system
1 1
x = x0 y0,
2 2
1 0 1 0
y= x + y.
2 2
Under these coordinates, the region J is

J = {(x0 , y 0 ) : 2x0 < z} R2 .

Since the joint density is invariant under rotations, changing variables from x 7 x0 and
y 7 y 0 should not affect the value of the integral. Hence,

Z Z z/ 2 Z
0 0 0 0 0 0
FZ (z) = P (Z < z) = fX,Y (x , y ) dx dy = fX (x ) dx fY (y 0 ) dy 0
J

Z z/ 2
= fX (x0 ) dx0 = P (X < z/ 2) = P ( 2X < z).


We see that Z has the same distribution as 2X, where X N (0, 1). From our pre-
vious section, we saw that scaling a standard normal random variable by 2 yields the
N (0, 2) distribution. However, X N (0, 1) and Y N (0, 1), and we have just found
that Z = X +Y N (0, 2). Could summing up independent normal random variables really
be as easy as adding their parameters?

Lemma 7.4 (Rotational Invariance of the Gaussian). Let X, Y be i.i.d. standard normal
random variables and let , [0, 1] so that 2 + 2 = 1. Then X + Y N (0, 1).
CHAPTER 7. CONTINUOUS PROBABILITY II 91

Proof. We simply extend the previous argument to any arbitrary rotation. By assump-
tion, we can write down

sin() =
cos() =

for some [0, /2]. We can write the CDF of Z = X + Y as


Z
FZ (z) = P (Z < z) = P (X + Y < z) = fX,Y (x, y) dx dy
J

where
J = {(x, y) : x cos() + y sin() < z} R2 .
Under the change of coordinates

x = x0 cos() y 0 sin()
y = x0 sin() + y 0 cos()

the area of integration becomes

J = {(x0 , y 0 ) : x0 < z} R2 .

Hence we conclude that


Z Z z Z
0 0 0 0 0 0
FZ (z) = P (Z < z) = fX,Y (x , y ) dx dy = fX (x ) dx fY (y 0 ) dy 0
J
Z z
= fX (x0 ) dx0 = P (X < z).

Z has the same distribution as X, and X N (0, 1), so were done.

Theorem 7.5 (Sums of Independent Normal Random Variables). Let X1 , . . . , Xn be


independent random variables with Xi N (i , i2 ). Then:

n n
!
X X
X := X1 + + Xn N i , i2 (7.18)
i=1 i=1

Proof. We will prove the result for the sum of two independent normal random variables.
The general result follows as a quick exercise in induction. Let Zi = (Xi i )/i be the
standardized form of Xi . Then Zi N (0, 1). Additionally, observe that
s s
2
1 X1 1 22 X2 2 X1 + X2 (1 + 2 )
Z= 2 2
+ 2 2
= p .
1 + 2 1 1 + 2 2 12 + 22
CHAPTER 7. CONTINUOUS PROBABILITY II 92

Apply Lemma 7.4 to Z to obtain that Z N (0, 1). Finally, since


q
X = X1 + X2 = 1 + 2 + 12 + 22 Z,

it follows from change of variables that

Z N (1 + 2 , 12 + 22 ).

The theorem is beautiful, so do not misuse it! The most common mistake that students
make is that they forget the important rule: variances add, standard deviations do not.

7.3.4 Tail Probability Bounds


In this section, we find bounds on the tail probability, because the CDF of the Gaussian is
not always easy to use.

Proposition 7.6.
Z
1 3 x2 /2 2 /2 2 /2
(x x )e ez dz x1 ex (7.19)
x

Proof. For the left inequality, we integrate the following integral by parts.
Z Z Z
4 z 2 /2 z 2 /2 2
(1 3z )e dz = e dz + (3z 4 )ez /2 dz
x
Zx x
Z
z 2 /2 3 z 2 /2 2
= e dz + z e + z 2 ez /2 dz
z=x
Zx x
Z
z 2 /2 3 x2 /2 1 z 2 /2 2
= e dz x e z e ez /2 dz
x z=x x
1 3 x2 /2
= (x x )e

Therefore, one has


Z Z
1 3 x2 /2 4 z 2 /2 2 /2
(x x )e = (1 3z )e dz ez dz.
x x

For the right inequality, change variables to z = x + y.


Z Z Z Z
z 2 /2 (x+y)2 /2 x2 /2 xy y 2 /2 x2 /2
e dz = e dy = e e e| {z } dy e exy dy
x 0 0 0
1
 
2 /2 1 xy 2
= ex e = x1 ex /2
x

y=0

We have shown that, asymptotically, the tail probability of the Gaussian distribution goes
2
as P (Z > z) (2)1/2 z 1 ez /2 as z .
CHAPTER 7. CONTINUOUS PROBABILITY II 93

7.4 Central Limit Theorem


Why do we care so much about the normal distribution? It certainly has nice properties, but
the exponential distribution is also easy to work with and models many real-life situations
very well (e.g. particle decay). The normal distribution, however, goes even farther: I claim
that it allows us to model any situation whatsoever. In what sense can such a bold statement
possibly be true?

First, let us make the statement precise. Suppose (Xi ) is a sequence of i.i.d. random variables
with mean and finite moments. Define
n
n := 1
X
X Xi .
n i=1
n ) converges in distribution to the normal distribution.
Then as n , n(X

Remember that the Weak Law of Large Numbers tells us that as we increase the number
of samples, the sample mean converges in probability to the expected value. The Central
Limit Theorem is stronger states that the distribution of the sample mean also converges to
a particular distribution, and it is our good friend the normal distribution! The power of
the theorem is that we can start from any distribution at all, discrete or continuous, yet the
sample mean will still converge to a single distribution. In fact, there are even versions of the
Central Limit Theorem with weaker assumptions. Of all of the theorems in mathematics,
the Central Limit Theorem is one of my favorites.

The Central Limit Theorem is the basis for much of modern statistics. When statisticians
carry out hypothesis testing or construct confidence intervals, they do not care that they do
not know the exact distribution of their data. They simply collect enough samples (30 is
usually sufficient) until their sampling distributions are roughly normally distributed.

We have not answered questions such as what does it mean for a distribution to converge?
and why is the Central Limit Theorem true? The former question requires a more rigorous
study of distributions. The latter question will only be answered partially below.

7.4.1 Confidence Intervals Revisited


Using the CLT, we can obtain much tighter confidence intervals than Chebyshevs Inequality.

Example 7.7. Suppose that Xi are i.i.d. with mean and standard deviation . Let X

)/(/ n) N (0, 1)
be the sample average as before. The CLT tells us that Zn = (X
as n . Now, view the Xi as observations, and we will construct a 95% confidence
interval for . First, we relate the desired probability to Zn , and then make use of the
CHAPTER 7. CONTINUOUS PROBABILITY II 94

fact that Zn N (0, 1) (approximately).


 
X
a, X
+ a)) = P ( X a
a) = P

P ( (X / n / n
  Z a/(/n)
a 1 2
= P Zn
ez /2 dz
/ n a/(/ n) 2

We set the above probability to 0.95. Although there is no closed-form expression


for
the probability, we can
plug it into a calculator, which tells us that a/(/ n) 1.96.
Therefore, a 1.96/ n and our confidence interval is (X 1.96/n, X + 1.96/ n).

7.4.2 de Moivre-Laplace Approximation


Recall that the binomial distribution is the sum of i.i.d. Bernoulli trials, so we may apply
the CLT. We can approximate the binomial distribution fairly well by the normal distribu-
tion, especially when n is large. Suppose that X Bin(n, p). A reasonable approach to
approximating the probability P (a X b) would be

P (a X b) = P (a np X np b np)
!
a np X np b np
=P p p p
np(1 p) np(1 p) np(1 p)
! !
b np a np
p p .
np(1 p) np(1 p)

Under this approximation, P (X = x) = P (x X x) = 0, which is clearly not correct.


The problem lies in our attempt to approximate a discrete distribution with a continuous
distribution. It turns out that a better approximation is formed by treating the point
P (X = x) as an interval of width 1 in the continuous distribution. In order words, we should
treat the point {x} as an interval [x 1/2, x + 1/2] when we apply our approximation. The
correction amounts to
! !
b np + 1/2 a np 1/2
P (a X b) p p . (7.20)
np(1 p) np(1 p)

This is commonly known as the de Moivre-Laplace approximation to the binomial


distribution. The correction factor of 1/2 becomes negligible when n is large.

7.5 Order Statistics


Suppose that X1 , . . . , Xn are i.i.d. random variables with common density fX (x) and CDF
FX (x). Let X (k) be the kth smallest of X1 , . . . , Xn , so that X (1) is the minimum of the
points and X (n) is the maximum of the points. We can derive the density of X (k) : recall
that fX (k) (x) has the interpretation P (X (k) (x, x + dx)) fX (k) (x) dx. In order for the
kth smallest point to lie in the interval (x, x + dx), we must have the following:
CHAPTER 7. CONTINUOUS PROBABILITY II 95

1. k 1 points must lie in the interval (, x), which has probability (FX (x))k1 .

2. One point must lie in the interval (x, x + dx), which has probability fX (x) dx.

3. n k points must lie in the interval (x + dx, ), which has probability (1 FX (x))nk .
In addition, there are n ways to choose which of the n points lies in the interval (x, x + dx),
and out of the remaining n 1 points, there are n1
k1
ways to choose which points lie in the
interval (, x). Hence, we have
 
(k) n1
fX (k) dx = P (X (x, x + dx)) = n fX (x)(FX (x))k1 (1 FX (x))nk dx,
k1
so our result is:  
n1
fX (k) =n fX (x)(FX (x))k1 (1 FX (x))nk (7.21)
k1
In the special case when the Xi are i.i.d. Unif[0, 1] random variables, we have fX (x) = 1 and
FX (x) = x for 0 < x < 1, so the density of X (k) is
n!
fX (k) (x) = xk1 (1 x)nk , 0 < x < 1. (7.22)
(k 1)! (n k)!

7.6 Beta Distribution


We define the beta distribution by specifying the density function. Let (r, s) denote the
beta distribution with parameters r and s (we will see the significance of the parameters
soon). If X (r, s), then the density of X is
1
fX (x) = xr1 (1 x)s1 , 0 < x < 1, (7.23)
B(r, s)

where B(r, s) is a normalizing constant. First, we observe that if X (k) is the kth smallest point
out of n i.i.d. U [0, 1] random variables (see the previous section), then X (k) (k, n k + 1).
This provides an interpretation for the parameters of the (r, s) distribution: there are k
points less than or equal to X (k) , and n k + 1 points greater than or equal to X (k) . By
setting n = r + s 1 and k = r in (7.22), we can obtain an expression for B(r, s) for integer
values of r and s:
1 (r + s 1)!
= (7.24)
B(r, s) (r 1)! (s 1)!
Remember that B(r, s) is a normalizing constant for the density (7.23), which implies
Z 1
(r 1)! (s 1)!
B(r, s) = xr1 (1 x)s1 dx = . (7.25)
0 (r + s 1)!
Remembering (7.25) can actually be useful in solving integrals quickly.
CHAPTER 7. CONTINUOUS PROBABILITY II 96

We can solve for the moments of the beta distribution:


Z 1
1 B(r + 1, s) (r + s 1)! r! (s1)!

r
E[X] = xr (1 x)s1 dx = = = ,
0 B(r, s) B(r, s) (r 1)!  1)! (r + s)!
(s 
r+s
Z 1
2 1 r+1 s1 B(r + 2, s) (r + s 1)! (r + 1)!  (s1)!

E[X ] = x (1 x) dx = =
0 B(r, s) B(r, s) (r 1)!  1)!
(s 
(r + s + 1)!
r(r + 1)
= ,
(r + s)(r + s + 1)

so we have
r(r + 1) r2 rs
var(X) = = .
(r + s)(r + s + 1) (r + s)2 (r + s)2 (r + s + 1)

7.6.1 Flipping Coins


Note that the beta distribution is supported on (0, 1) so there is a natural interpretation of
the beta distribution as a distribution over probability values. Suppose that we have a biased
coin, but we do not know what the bias of the coin is. Assume that we have a prior distri-
bution over the bias of the coin: X (r, s). Since the expectation of the beta distribution
is r/(r + s), that means that we believe the bias of the coin is close to r/(r + s), but we do
not have enough information to say the true bias with any certainty. We decide to flip the
coin more times in order to get a better idea of the bias, and we obtain h heads and t tails.
The question is: what is our posterior distribution of X, that is, how should we update our
belief about the bias of the coin?

Of course, the answer is Bayes Rule, but with continuous random variables. To be more
clear, let X (r, s) be a random variable which represents our prior belief about the bias
of the coin, and let A be the event that we flip h heads and t tails. What is the conditional
distribution of X after observing A, fX | A (x)?

Bayes Rule tells us that the answer is

P (X (x, x + dx) A) P (X (x, x + dx))P (A | X = x)


P (X (x, x + dx) | A) = = .
P (A) P (A)

When we use the density interpretation, we have

fX (x)P (A | X = x) dx
fX | A (x) dx = ,
P (A)

which gives the equation

fX (x)P (A | X = x)
fX | A (x) = .
P (A)
CHAPTER 7. CONTINUOUS PROBABILITY II 97

We know that fX (x) = (B(r, s))1 xr1 (1 x)s1 for 0 < x < 1. P (A | X = x) is the
probability of flipping h heads and t tails if the bias of our coin is x, which is the binomial
distribution with h + t trials and probability of success x. Therefore,
(h + t)! h
P (A | X = x) = x (1 x)t .
h! t!
To obtain P (A), we should integrate P (A | X = x) against the density of X:
Z Z 1
(h + t)! h 1
P (A) = P (A | X = x)fX (x) dx = x (1 x)t xr1 (1 x)s1 dx
0 h! t! B(r, s)
Z 1
(h + t)! 1 (h + t)! B(r + h, s + t)
= xr+h1 (1 x)s+t1 dx =
h! t! B(r, s) 0 h! t! B(r, s)
Putting these pieces together, we obtain

1 r1 s1 (h + t)! xh (1 x)t h!


 t!
 B(r,
s)
fX | A (x) = x (1 x)

B(r,
 s)

 h! t!
 (h+ t)! B(r + h, s + t)

1
= xr+h1 (1 x)s+t1 , 0 < x < 1.
B(r + h, s + t)
We have found that the posterior distribution has the (r + h, s + t) distribution! We now
turn to the interpretation of this key result.

We stated that the expectation of the (r, s) distribution is r/(r + s), which would be our
best guess for the bias of the coin if we had observed r heads and s tails in r + s coin flips.
The result above states that if we observe h more heads and t more tails, then we now have
the (r + h, s + t) distribution, which is consistent with the following interpretation: the first
parameter represents how many heads we have seen, and the second parameter represents
how many tails we have seen. The amazing part is that if we start with a belief according to
the beta distribution, then we never have to use another distribution: repeated applications
of Bayes Rule will always yield another beta distribution.

Furthermore, our result almost provides an algorithm for estimating the bias of a coin: start
with a beta distribution, and keep flipping coins. If we observe heads, we increment the first
parameter of the beta distribution. If we observe tails, we increment the second parameter
of the beta distribution. Of course, this is an extremely cheap computation, which makes
the beta distribution a convenient family of distributions for the estimation problem.

This leads to a minor problem, which is: how should we initialize the parameters of the
beta distribution? Regardless of the choice of the initial parameters, after enough flips, the
beta distribution will converge to the correct probability. It is true that our choice of initial
parameters certainly influences how long it takes for our distribution to converge. Instead
of worrying about this problem, however, we may view it as another advantage of the beta
distribution: by initializing our initial parameters, we can incorporate our prior belief into
the algorithm. Here are some examples to illustrate this point:
CHAPTER 7. CONTINUOUS PROBABILITY II 98

If we take an ordinary coin, we may be reasonably confident that the coin is fair, so we can
start with a (500, 500) distribution. The fact that r = s reflects the fact that we think
the coin is fair, and our choice of the number 500 represents the strength of our belief: we
belief strongly in the fairness of the coin, so we initialize the beta distribution with 500
observations of heads and 500 observations of tails. On the other hand, if the coin was given
to you by a friend, perhaps you believe that your friend is not malicious (so you still believe
that the coin is fair), but you do not trust the coin as much as you would an ordinary coin.
In this case, you may decide to initialize the algorithm with a (20, 20) distribution. Finally,
suppose that the coin was given to you by a gambler, and you suspect that the coin may be
loaded (so that it is more likely to come up heads). In this case, you may encode your belief
with a (40, 20) distribution (your suspicions amount to the same information as having
observed 40 heads and 20 tails prior to flipping the coin).

Example 7.8. Now, let us approach the coin-flipping problem from a different perspec-
tive. We flip a coin n times and we are looking for the distribution of the number of
heads. However, we do not know the bias of the coin, so we assume a uniform prior over
the bias of the coin. Let X Bin(n, Y ) be the number of heads, where Y Unif[0, 1].
What is the distribution of X?

Z Z 1  
n x
P (X = x) = P (X = x | Y = y)fY (y) dy = y (1 y)nx dy
0 x

We can solve the integral by using (7.25). We obtain

(n

n! x! x)! 1

P (X = k) =  = ,
(n x)! (n + 1)!

x!

 
n+1

so we have found that X Unif{0, . . . , n}, which agrees with intuition. If your prior
belief about the bias is uniform, then the number of heads is also uniform.
Chapter 8

Moment-Generating Functions

Here, we introduce transforms, which provide powerful conceptual and computational tools
for understanding the distributions of random variables.

8.1 Introduction
Definition 8.1. Let X be a random variable. The moment-generating function of
X is:
MX () = E[eX ] (8.1)

In other words, the MGF is the expectation of the random variable eX . We will return to
the computation of the MGF shortly. Our immediate goal is to explain the meaning behind
the name moment-generating function.

Theorem 8.2. Let X be a random variable with MGF MX (). Then:

(k)
E[X k ] = MX (0) (8.2)

Proof. We can carry out the computation directly. If X is discrete,



(k) dk X x X
MX () = k e P (X = x) = xk ex P (X = x).
d x=0 x=0

Setting = 0, we obtain

X
(k)
MX (0) = xk P (X = x) = E[X k ].
x=0

99
CHAPTER 8. MOMENT-GENERATING FUNCTIONS 100

If X is continuous, the same proof goes through with sums replaced by integrals.
Z Z
(k) dk x
MX () = k e fX (x) dx = xk ek fX (x) dx
d
Z
(k)
MX (0) = xk fX (x) dx = E[X k ]

Technical Remark: In the proof above, we interchanged the order of differentiation and
summation/integration. This is not always valid; there are conditions that must be met in
order to justify this step. However, for the purpose of this exposition, we will assume that
all random variables under consideration are well-behaved so that the above proof holds.

Recall that the Taylor series of a function is given by

f 0 (0) f 00 (0) 2
f (x) = f (0) + x+ x +
1! 2!
The above theorem implies that the expansion of the MGF is (note that MX (0) = 1)

E[X] E[X 2 ] 2
MX () = 1 + + +
1! 2!
Therefore, we can see that the MGF is indeed a generating function, where the kth coefficient
is given by E[X k ]/k!. The values E[X k ] are often known as the moments of a distribution,
which explains the name.

8.1.1 Sums of Random Variables


Here is one of the reasons why the MGF is so useful for computation.

Theorem 8.3. Let X and Y be independent random variables. Then:

MX+Y () = MX ()MY () (8.3)

Proof.

MX+Y () = E[e(X+Y ) ] = E[eX eY ] = E[eX ]E[eY ] = MX ()MY ()

8.1.2 Distribution of Random Sums


MGFs can be used to compute the distribution of SN = X1 + + XN , where N is also a
random variable. We can use the Law of Iterated Expectation to compute the MGF of SN .
Assume that the Xi and N are all independent.

E[eSN ] = E[E[eSN | N ]] = E[eX1 eXN ] = E[MXi ()N ] = E[e(ln MXi ())N ]


CHAPTER 8. MOMENT-GENERATING FUNCTIONS 101

Compare the above expression to the formula for the MGF for N , MN () = E[eN ]. We have
the following result: MSN () is found by evaluating MN () at = ln MXi (). Alternatively,
take MN () and replace every occurrence of e with MXi ().

MSN () = MN () (8.4)

=ln MXi ()

8.1.3 Cumulants
Definition 8.4. The cumulant-generating function of a random variable X is:

CX () = log MX () (8.5)

It seems like all we did was take the logarithm of the MGF. Why is this useful?

Theorem 8.5. Let X be a random variable with cumulant-generating function CX ().


We have the following facts:

1. CX (0) = 0.
0
2. CX (0) = E[X].
00
3. CX (0) = var(X).

Proof. 1. Since MX (0) = 1, then CX (0) = log MX (0) = log 1 = 0.


0
2. We compute CX ():

0 d MX0 ()
CX () = log MX () =
d MX ()

Plugging in = 0, and using MX0 (0) = E[X], we have CX


0
(0) = E[X].
00
3. We compute CX ():

00 d MX0 () M 00 ()MX () (MX0 ())2


CX () = = X
d MX () (MX ())2

Now, we use MX00 (0) = E[X 2 ] to obtain CX


00
(0) = E[X 2 ] E[X]2 = var(X).

The cumulant-generating function is often useful because taking the logarithm before taking
the derivative often makes calculations simpler. Intuitively, logarithms convert products into
sums, and we would much rather differentiate sums rather than products.
CHAPTER 8. MOMENT-GENERATING FUNCTIONS 102

8.2 Common MGFs


8.2.1 Bernoulli Distribution
Let X Ber(p). The MGF is computed directly:

MX () = (1 p) e0 + p e = 1 p + pe (8.6)

8.2.2 Binomial Distribution


Let X Bin(n, p). Since X is the sum of n i.i.d. Ber(p) random variables, the MGF is

MX () = (1 p + pe )n , (8.7)
CX () = n log(1 p + pe ). (8.8)

8.2.3 Geometric Distribution


Suppose X Geom(p). By direct computation,

X
X
x x1
MX () = e p(1 p) = pe ((1 p)e )x1 .
x=1 x=1

Evaluating the sum using the formula for a geometric series, we have

pe
MX () = , (8.9)
1 (1 p)e
CX () = + log p log(1 (1 p)e ). (8.10)

8.2.4 Poisson Distribution


Let X Pois(). Then
x
X
x e
X (e )x 1)
MX () = E[eX
]= e =e = e ee = e(e . (8.11)
x=0
x! x=0
x!

Let us compute the mean and variance of the Poisson distribution. Taking the logarithm,
we have CX () = (e 1). Differentiating, we obtain

0
E[X] = CX (0) = e =
=0

and
00
var(X) = CX (0) = e = .
=0

Previously, we have proven that if X Pois() and Y Pois(), then X + Y Pois( + ).


We will prove this fact again in a much simpler way using the MGF function.
CHAPTER 8. MOMENT-GENERATING FUNCTIONS 103

Theorem 8.6. Suppose that X and Y are independent random variables with distribu-
tions X Pois() and Y Pois(). Then X + Y Pois( + ).

Proof. Observe:
1) 1) 1)
MX+Y () = MX ()MY () = e(e e(e = e(+)(e

This is precisely the MGF function of a Pois( + ) distribution.

Actually, if you were especially observant, you will have noticed that we cheated in the
last line of the above proof. After all, we did not answer the question: does the moment-
generating function completely determine the distribution? Is there potentially another
probability distribution besides Pois( + ) which happens to have the same MGF? The
answer is no, which is to say that the MGF completely specifies the probability distribution.

8.2.5 Exponential Distribution


Suppose X Expo(). We have
Z Z
x
MX () = x
e e dx = e()x dx = (8.12)
0 0
and

CX () = log log( ). (8.13)

Example 8.7. What is the distribution of X1 + . . . + XN , where Xi Expo() and


N Geom(p) are independent?

We use (8.4).

pe p( )1 p
MSN () = = =

1
1 (1 p)e e =MXi () (1 (1 p)( ) ) (1 p)

p
=
p

We recognize this as the MGF of an Expo(p) distribution, so SN Expo(p). We will


see an intuitive reason for this connection when we study the Poisson process.

8.3 Probability Generating Functions


A closely related transform is the probability generating function.
CHAPTER 8. MOMENT-GENERATING FUNCTIONS 104

Definition 8.8. Let X be a non-negative integer-valued random variable. The proba-


bility generating function associated with X is:

X () := E[X ] (8.14)

Just like the MGF, the probability generating function (PGF) of the sum of i.i.d. random
variables is the product of the PGFs:

X1 ++Xn () = E[X1 ++Xn ] = E[X1 ] E[Xn ] = X1 () Xn () (8.15)

We can recover the probability distribution by differentiation. Since



dk dk X n X n!
k
X () = k P (X = n) = nk P (X = n),
d d n=0 n=k
(n k)!

we can evalaute the expression at = 0 and divide by k! to obtain

1 dk
P (X = k) = () .

X
k! dk =0

We can compute the moments as well:



d X X
X () = nn1 P (X = n) = nP (X = n) = E[X]

d =1
n=1
=1
n=1
2
d X
n2
X
X () = n(n 1) P (X = n) = n(n 1)P (X = n) = E[X(X 1)]

d2 =1
n=2
=1
n=2

8.3.1 Poisson Distribution


Let us compute the PGF of X Pois().
x
X
xe
X ()x
X () = =e = e e = e(1) (8.16)
x=0
x! x=0
x!

Again, we can compute the mean and variance.


d (1)
E[X] = e = e(1) =

d

=1 =1
d (1)
E[X(X 1)] = e = 2 e(1) = 2

d

=1 =1

Therefore, the variance is

var(X) = E[X(X 1)] + E[X] E[X]2 = 2 + 2 = .


Chapter 9

Convergence

Here, we take up the subject of convergence of random variables. In the context of probability
theory, convergence is a subtle concept because there are numerous modes of convergence:
convergence in probability, convergence in distribution, almost sure convergence, convergence
in Lp , and more. In addition to discussing the relationships between the various modes of
convergence, we will introduce two of the most famous theorems in probability theory: the
Strong Law of Large Numbers and the Central Limit Theorem.

9.1 Convergence in Probability


We have already met the first notion of convergence: a sequence of random variables Xn is
said to converge in probability to X if
lim P (|Xn X| ) = 0, (9.1)
n

for every > 0. When the Xn are i.i.d. withPnfinite second moment, then the WLLN (The-
1
orem 3.13) assures us that the average n i=1 Xi converges to their common mean E[Xi ]
in probability.

9.2 Almost Sure Convergence


A question you may have is: why was the Weak Law of Large Numbers weak? Is there
a stronger law of large numbers? The answer is yes: there is the notion of almost sure
convergence or convergence with probability 1. We say that Xn X almost surely
(abbreviated a.s.) if P (limn Xn = X) = 1. Recall that P is a probability measure that
assigns probabilities to sets. The set that we are considering here is
  n o
P lim Xn = X = P : lim Xn = X .
n n

In words: the set of outcomes such that Xn converges to X has probability 1. The meaning
is rather subtle and powerful. In practice, almost sure convergence guarantees that you can
never be unlucky enough to see a sequence Xn that does not converge.

105
CHAPTER 9. CONVERGENCE 106

Let us see why a.s. convergence is stronger than convergence in probability.

Theorem 9.1. Suppose Xn X a.s. Then Xn X in probability.

Proof. Since Xn X a.s., we know that P (XT n 6 X) = 0. Fix > 0 and define
An = {m n |Xm X| }. Write A for n=1 An , so A1 A2 A. By
monotonicity, P (An ) P (A). Since A is the event that |Xn X| infinitely many
times, A {Xn 6 X}, so

P (|Xn X| ) P (An ) P (A) P (Xn 6 X) = 0

as n . Hence, Xn X in probability.

In fact, convergence in probability is strictly weaker than a.s. convergence.

9.2.1 Borel-Cantelli Lemma


The following all-important lemma provides the basic tool for proving a.s. convergence. First,
a little bit of notation: An i.o. means An infinitely often, that is, the event that infinitely
many of the An occur.
P
Lemma 9.2 (First Borel-Cantelli Lemma). Let An be events with n=1 P (An ) < .
Then P (An i.o.) = 0.

Proof. Let P{A


P1An be the indicator for An . Then, the event
n i.o.} is the same as the

P { n=1 1An = }. However, the assumption


event that n=1 P (An ) < implies that
E[
P
1
n=1 An ] < , which means that P ( 1
n=1 An = ) = 0.

P
Corollary 9.3. Suppose that for all > 0, n=1 P (|Xn | ) < . Then Xn 0 a.s.

Proof. Apply Lemma 9.2 to An = {|Xn | }. The probability that |Xn | i.o. is 0,
which means that |Xn | finitely many times a.s. Since this holds true for all > 0,
we must have Xn 0 a.s.
1
A small remark: suppose that P (|X
1
Pn | 1) n . Then, Xn 0 in probability since
n 0 as n , but the sum n=1 n diverges, which means that this is too weak to
apply Corollary 9.3. We will have to be a little more clever!

9.2.2 Strong Law of Large Numbers


We now have all of the tools necessary to state the SLLN.
CHAPTER 9. CONVERGENCE 107

Theorem
Pn 9.4 (Strong Law of Large Numbers). Let Xi be i.i.d. with E[|Xi |] < . Then
1
n i=1 Xi E[Xi ] a.s.

Notice that the only assumption we made is that E[|Xi |] < . In particular, we do not
require var(Xi ) to be finite.

The SLLN, as stated above, is complicated to prove. However, we can prove the SLLN under
weaker assumptions more easily:

Theorem 9.5 (4th Moment SLLN). Let Xi be i.i.d. with E[Xi4 ] < . Write Sn for
n
n1 i=1 Xi . Then n1 Sn E[Xi ] a.s.
P

Proof. By considering Xi E[Xi ] instead of Xi , we can assume that the Xi are zero-mean.
The first step is a calculation of the fourth moment:
n X
X n X
n X
n
E[Sn4 ] = E[Xi Xj Xk Xl ]
i=1 j=1 k=1 l=1

Although this expression looks complicated, observe that only terms of the form E[Xi4 ]
and E[Xi2 Xj2 ] survive; all other terms vanish because of independence and the zero-
mean assumption. There are n terms of the form E[Xi4 ] and 3n(n 1) terms of the form
E[Xi2 Xj2 ]. By the Cauchy-Schwarz Inequality (3.26),
q
E[Xi2 Xj2 ] E[Xi4 ]E[Xj4 ] = E[Xi ]4

so a bound on the fourth moment is E[Sn4 ] 3n2 E[Xi4 ]. Now, we can apply Markovs
Inequality (3.22) with the function f (x) = x4 to obtain

E[Sn4 ] 3E[Xi4 ]
P (|Sn | n) = P (Sn4 n4 4 ) .
n 4 4 n2 4
Summing across n, we obtain

3E[Xi4 ] X 1
 
X Sn
P 4 2
<
n=1
n n=1
n

so an application of Corollary 9.3 shows that n1 Sn 0 a.s.

9.3 Convergence of Distribution


Let FXn be a sequence of distribution functions, so that FXn (x) = P (Xn x). We say
that FXn converges in distribution to FX (where FX is also a distribution function) if
limn FXn (x) = FX (x) for every point of continuity x.
CHAPTER 9. CONVERGENCE 108

Example 9.6. Let Mn denote the maximum of n i.i.d. Expo() random variables. In-
tuitively, taking the maximum of an increasing number of exponential random variables
will grow without bound, so we should not expect Mn to converge in distribution. How-
ever, if we define Xn = Mn 1 log n, then we will see that Xn converges in distribution.
To verify this, we check

P (Xn x) = P (Mn x + 1 log n) = FMn (x + 1 log n) = (1 exp(x log n))n


 n
exp(x)
= 1 exp( exp(x))
n

as n . Hence, Xn converges in distribution to FX (x) = exp( exp(x)).

9.4 CLT Proof Sketch


Our approach for providing justification for the Central Limit Theorem will be to ignore
the finer points of convergence and analysis in favor of relying on important, yet plausible,
results. If the following treatment of the Central Limit Theorem is unsatisfactory to you,
then I urge you to take additional probability and statistics courses (as well as real analysis,
as the mathematical background for measure-theoretic probability).

9.4.1 Characteristic Functions


One problem with the moment-generating function is that it does not always exist, because
the expectation may not exist. The characteristic function applies more generally to these
situations.

Definition 9.7. The characteristic function of a probability distribution for a ran-


dom variable X is defined to be:

X (t) := E[eitX ] (9.2)

Perhaps, you may recognize the characteristic function of a probability density as the Fourier
transform of the probability distribution. The characteristic function has many nice prop-
erties, and the one that we will use in particular is that if the characteristic functions of
a sequence of random variables converges to a single characteristic function X , then the
sequence of random variables converges in distribution to X. The result we are quoting is
commonly known as L evys Continuity Theorem.

We can make the theorem plausible by making a few non-rigorous appeals to intuition: the
characteristic function is another representation of the probability distribution, just like the
CDF or the survival function. From the characteristic function, it is possible to use the
inverse Fourier transform to recover the density function. The characteristic function, in
a sense, captures all of the information in a probability distribution, so convergence of the
CHAPTER 9. CONVERGENCE 109

characteristic functions should be equivalent to convergence of the distributions themselves.

We attempt to compute the characteristic function of the standard normal distribution. We


make use of the following fact:

x2 1 2 t2
+ itx = (x it)
2 2 2
We have that
Z Z Z
1 2 1 x2 /2+itx 1 2 2 /2
X (t) = itx
e ex /2 dx = e dx = et /2 e(xit) dx.
R 2 2 R 2 R

You may expect the integral to come out to be 2 since the integral looks like the integral
over
density of a normal distribution with mean it and variance 1. The integral is indeed
2, although some facts from complex analysisR are necessary to establish this rigorously.
2
(The rigorous argument would say: the integral R e(xa) /2 dx depends analytically on the

parameter a andequals the value 2 for all real a. By the uniqueness principle, the integral
will also equal 2 for complex values of the parameter a.)

In any case, we have that the characteristic function is


2 /2
X (t) = et . (9.3)

A quick result about characteristic functions:

Theorem 9.8. Let X1 , . . . , Xn be i.i.d. random variables with X being their sum. Then:

X (t) = (Xi (t))n (9.4)

Proof.

X (t) = E[eitX ] = E[eit(X1 ++Xn ) ] = E[eitX1 eitXn ] = E[eitX1 ] E[eitXn ]


= X1 (t) Xn (t) = (Xi (t))n

The proof above uses the i.i.d. assumption.

9.4.2 Proof Sketch


We will give a proof sketch of the Central Limit Theorem in which the Xi variables are i.i.d.
with mean 0 and variance 1. Our plan of attack is to show that the characteristic functions
converge to et /2 , which is the characteristic function of the N (0, 1) distribution.
2

Theorem 9.9 (Central Limit Theorem). Let (Xn ) be a sequence i.i.d. random variables
CHAPTER 9. CONVERGENCE 110

with mean 0 and variance 1. Define


n
! n
1 1X 1 X
Zn := Xi = Xi .
1/ n n i=1 n i=1

Then (Zn ) is a sequence of random variables which converges in distribution to the


N (0, 1) distribution.

Proof. Recall that if


X = X 1 + + Xn
n
2
where E[Xi ] = and var(Xi ) = , then E[X] = and var(X)
= var(Xn )/n. Here, we
have = 0 and = 1, so Xn has mean 0 and variance 1/ n. Hence, the variable Zn
can be thought of as the standardized version of X n (standardization is a property we
desire if we want to converge to the standard normal distribution). Using the theorem
we proved for characteristic functions,

Zn (t) = (Xi /n (t))n = (E[eitXi / n
])n .

Let us calculate the last quantity by expanding ex as a Taylor series:


n  n
t2 2 t2
 
it it 2
Zn (t) = E 1 + Xi Xi + = 1 + E[Xi ] E[Xi ] +
n 2n n 2n
n
t2
 
1
2n

In the last line, we have carried out several steps at once. We used the fact that E[Xi ] = 0
and that E[Xi2 ] = var(Xi ) = 1. Also, since we will be taking the limit as n , we
dropped terms which were of order 1/n2 or higher. What were are left with is
n
t2

lim Zn (t) = lim 1 .
n n 2n

Recalling that  x n
ex = lim 1+ , (9.5)
n n
we can see that
2 /2
lim Zn (t) = et .
n

We have our desired result!


Chapter 10

Stochastic Processes

Currently, this chapter is a draft and is therefore incomplete. For now, I am


collecting interesting examples of the Poisson process.

10.1 Poisson Process


Suppose you are operating a store, awaiting the irregular arrivals of customers. Let Xn be the
random variable which represents the amount of time it takes for the nth customer to arrive,
after the (n 1)th customer has arrived. The Xn are known as interarrival times. If the
interarrival times are i.i.d. Expo(), then this is known as the Poisson process with rate .

If the interarrival times are exponentially distributed, then why is the process named after
the Poisson distribution? It is a famous theorem that the number of customers which arrive
in a certain interval of time has the Poisson distribution:

Theorem 10.1 (Poisson Process). In a Poisson process with rate , the number of
customers which arrive in the first t seconds has the Pois(t) distribution.

Remark: The probability that we have no customers in the first t seconds is et (using
the CDF of the exponential distribution), and et = P (Pois(t) = 0), so the theorem is
certainly true for this special case.
Proof. Let N (t) represent the number of customers who arrive in the first t seconds. Our
goal is to compute P (N (t) = k).

If there are exactly k arrivals in the first t + dt seconds, this could arise from two
possibilities:

There are exactly k 1 arrivals in the interval (0, t) and one arrival in the interval
(t, t+dt). The former has probability P (N (t) = k1) and the latter has probability
dt (conditioned on the kth exponential surviving t seconds, by the Memoryless
Property the probability it will die in the next dt seconds is dt).

111
CHAPTER 10. STOCHASTIC PROCESSES 112

There are k arrivals in the interval (0, t) and no arrivals in the interval (t, t + dt).
The former has probability P (N (t) = k) and the latter has probability 1 dt.

Here, we are regarding the interval (t, t+dt) as infinitesimally small, so there is no chance
of multiple arrivals. Putting together the two pieces, we have

P (N (t + dt) = k) = P (N (t) = k 1) dt + P (N (t) = k)(1 dt).

Rearranging this equation, we obtain

P (N (t + dt) = k) P (N (t) = k)
= (P (N (t) = k 1) P (N (t) = k)).
dt
Fixing k and viewing f (t) = P (N (t) = k) as a function of t, we recognize the LHS as
the derivative w.r.t. t, so we have obtained a differential equation for P (N (t) = k).

d
P (N (t) = k) = (P (N (t) = k 1) P (N (t) = k))
dt
To complete the proof, we need only show that the Pois(t) distribution satisfies this
equation.

d et (t)k et (t)k et (t)k1


 t
e (t)k1 et (t)k

= + k =
dt k! k! k! (k 1)! k!

10.1.1 Erlang Distribution


We next turn our attention to Tn , the time of the nth arrival. Note that Tn = nk=1 Xk , so
P
Tn is the sum of n i.i.d. Expo() random variables. Theorem 10.1 provides a straightforward
method of deriving the density of Tn . P (Tn (t, t + dt)) is the probability that the nth
arrival falls in the interval (t, t + dt), which is the probability that there are exactly n 1
intervals in the first t seconds and one arrival in the interval (t, t + dt). The former has the
probability P (Pois(t) = n 1), while the latter has probability dt.

et (t)n1
fTn (t) dt = P (Tn (t, t + dt)) = dt
(n 1)!

Therefore, we can identify the density of Tn as:

n tn1 et
fTn (t) = , t>0 (10.1)
(n 1)!

This is known as the Erlang distribution of order n. By linearity of expectation and


variance, we have
n
E[Tn ] = , (10.2)

n
var(Tn ) = 2 . (10.3)

CHAPTER 10. STOCHASTIC PROCESSES 113

10.1.2 Gamma Distribution


We take a quick detour to introduce a distribution which is closely related to the Erlang
distribution. What if we took the Erlang distribution and instead of confining n to be
integer-valued, we replaced n with a continuously varying parameter ? Of course, we
would quickly run into the problem that ! is not defined when is not an integer, but
we can replace the use of the factorial with the gamma function, which is the continuous
interpolation of the factorial function. The gamma function is defined as:
Z
() = x1 ex dx (10.4)
0

Initially, this function does not seem to resemble the factorial function at all, so let us first
see that it satisfies some of the familiar properties of the factorial function. Using integration
by parts, we observe:
x= Z
1 x
() = x e + ( 1) x2 ex dx = ( 1)( 1) (10.5)
x=0 0

Of course, we can evaluate


Z x=
x x
(1) = e dx = e = 1. (10.6)
0 x=0

Using (10.5) and (10.6), we can see that (2) = 1, (3) = 2, (4) = 6, and in general, for n
a non-negative integer,
(n) = (n 1)! , nN (10.7)
Therefore, the gamma function is indeed a continuous, shifted analog of the factorial function.
Let us now define the Gamma distribution of shape and rate in the following way:
if X (, ), then the density of X is

x1 ex
fX (x) = , x>0 (10.8)
()
We must check that the density is properly normalized:
Z Z
1 1 x 1 ()
x e dx = (x)1 ex d(x) = =1
() 0 () 0 ()
Now, we proceed to calculate the MGF.
Z 1 x Z
x x e ( ) x1 e()x
MX () = e dx = dx
0 () ( ) 0 ()
However, we recognize the integral on the right as the integral of a gamma distribution with
shape and rate , so it must integrate to 1. Hence, we have found
 

MX () = (10.9)

CHAPTER 10. STOCHASTIC PROCESSES 114

and
CX () = (log() log( )). (10.10)
We can now compute the moments of the distribution:

0
E[X] = CX (0) = = (10.11)
=0

00
var(X) = CX (0) = = (10.12)

( )2 =0 2

Theorem 10.2. Suppose that X (, ) and Y (, ) are independent. Then


X + Y ( + , ).

Proof. The proof is easy using MGFs.


     +

MX+Y () = MX ()MY () = =

We recognize the RHS as the MGF of a ( + , ) distribution.

10.1.3 Poisson Process Examples


Consider a Poisson process with rate . We will use the following notation:

N (s, t) represents the number of arrivals in the interval (s, t). Therefore, we know that
N (s, t) Pois((t s)).

Tk represents the time until k arrivals, so Tk Erlang(k, ).

Example 10.3. What is the probability that we see no arrivals by time s and at most
k arrivals by time t?

The desired probability is P (N (0, s) = 0, N (s, t) k), which by independence of incre-


ments is:

P (N (0, s) = 0, N (s, t) k) = P (N (0, s) = 0)P (N (s, t) k)


k k
s
X e(ts) ((t s))j t
X ((t s))j
=e =e
j=0
j! j=0
j!

Example 10.4. What is the probability that it takes more than t seconds for k arrivals?

This is equivalent to the probability that in the first t seconds, we have at most k 1
CHAPTER 10. STOCHASTIC PROCESSES 115

arrivals.
k1
t
X (t)j
P (N (0, t) k 1) = e
j=0
j!

One can also integrate the Erlang distribution, because this is equivalent to P (Tk > t).

Example 10.5. What is the probability that the kth arrival occurs in the interval (s, t)?

One way to write down the answer is the integral of the Erlang distribution of order k:
Z t k k1 x
x e
P (s < Tk < t) = dx
s (k 1)!

Another way is to write the probability as



X et (t)j X es (s)j
P (N (0, t) k) P (N (0, s) k) = .
j=k
j! j=k
j!

Example 10.6. What is the probability that the kth arrival happens within t seconds
of the first arrival?

The difference between the first and the kth arrival has the Erlang(k 1, ) distribution.
Z t k1 k2 x
x e
P (Tk1 t) = dx
0 (k 2)!

Alternatively, this is the complement of the event that the kth arrival takes longer than
t seconds after the first arrival, in which case the number of arrivals in (T1 , T1 + t) must
be at most k 2.
k2 t
X e (t)j
P (Tk1 t) = 1 P (Tk1 > t) = 1 P (N (0, t) k 2) = 1
j=0
j!

Example 10.7. Suppose s < t and we have k arrivals by time t. What is the probability
that we have j arrivals by time s?

Each of the k arrivals has probability s/t of landing in the interval (0, s) and can be
regarded as an independent Bernoulli trial. Hence, we have
    kj
k s j ts
P (N (0, s) = j | N (0, t) = k) = .
j t t
CHAPTER 10. STOCHASTIC PROCESSES 116

Example 10.8. Given that we have seen k arrivals by time t, what is the probability
that the first arrival occurs before 1 second has passed?

Note that P (T1 1 | N (0, t) = k) is the complement of the probability that in the first
second, we see no arrivals. Each of the k arrivals is uniformly distributed over the interval
(0, t), so the probability that an arrival lands in the interval (1, t) is (t 1)/t = 1 1/t.
The probability for all k arrivals to land after 1 second is (1 1/t)k , so
 k
1
P (T1 1 | N (0, t) = k) = 1 P (N (0, 1) = 0 | N (0, t) = k) = 1 1 .
t

Example 10.9. Given that the first arrival takes longer than s seconds to arrive, what
is the probability that the kth arrival takes longer than t seconds?

By the Memoryless Property, once we condition on the first arrival surviving at least s
seconds, we can treat s as the beginning of our process. Hence,

P (Tk > t | T1 > s) = P (Tk > t s).

This is easy to compute.

Example 10.10. Given that in the first s seconds, there is one arrival, what is the
probability that the kth arrival takes longer than t seconds?

The first arrival appears in (0, s), and we know that the second exponential survives
until time s. Conditioned on this fact, the Memoryless Property tells us that we need
only find the probability that k 1 arrivals survive for t s more seconds.

P (Tk > t | N (0, s) = 1) = P (Tk1 > t s)

Example 10.11. Suppose that there are m arrivals in the first s seconds and n arrivals
in the first t seconds, where n m and t > s. What is the probability that the kth
arrival (where k > n) takes longer than seconds?

The key idea is that knowing there are m arrivals in the first s seconds tells you no more
information if you already know that there are n arrivals in t seconds (because t is the
later time). Hence, we need only find the probability that k n arrivals survive for t
seconds.

P (Tk > | N (0, s) = m, N (0, t) = n) = P (Tkn > t)

Example 10.12. Suppose that the jth arrival occurs at time s. What is the probability
that the kth arrival (for k > j) takes longer than t seconds?
CHAPTER 10. STOCHASTIC PROCESSES 117

There are k j additional arrivals which must survive for at least t s seconds.

P (Tk > t | Tj = s) = P (Tkj > t s)

Example 10.13. Suppose that the jth arrival happens before time s. What is the
probability that the kth arrival happens after time t, where k > j and t > s?

Apply the definition of conditional probability. Note that P (Tj < s) can be written as
the probability that in the first s seconds, we have at least j arrivals.
P
P (Tk > t, N (0, s) j) n=j P (Tk > t, N (0, s) = n)
P (Tk > t | Tj < s) = = P
P (N (0, s) j) n=j P (N (0, s) = j)
Pk1
n=j P (Tk > t | N (0, s) = n)P (N (0, s) = n)
= P
n=j P (N (0, s) = j)
Pk1
n=j P (Tkn > t s)P (N (0, s) = n)
= P
n=j P (N (0, s) = j)

Example 10.14. Suppose that the nth arrival appears at time t. What is the proba-
bility that the jth arrival (for j n) appears before time s?

If the nth arrival lands exactly at time t, there are n 1 arrivals in the interval (0, t).
Each of these arrivals has the distribution Unif(0, t). In particular, the jth arrival must
be the jth order statistic. Recall that if X1 , . . . , Xn are i.i.d. and X (k) denotes the kth
smallest point, then
 
n1
fX (k) (x) = n fX (x)(FX (x))k1 (1 FX (x))nk .
k1

Applying this to the case of n 1 i.i.d. Unif(0, t) random variables, we have


 
n 2 1 j1
fTj | Tn =t (x) = (n 1) n1
x (1 x)nj1 , 0 < x < t.
j1 t

Our answer can be found by integrating the density.


Z s  
n 2 1 j1
P (Tj < s | Tn = t) = (n 1) n1
x (1 x)nj1 dx
0 j 1 t

Of course, there is also a simple approach. If we regard each point as an independent


trial, where success means landing in the interval (0, s), then we require at least j
CHAPTER 10. STOCHASTIC PROCESSES 118

successes. Each success has probability s/t, so we are looking for


n1    nm1
X n 1  s m t s
P (Tj < s | Tn = t) = .
m=j
m t t

Example 10.15. Suppose that there is one arrival in (0, s) and the time of the kth
arrival is t, for t > s + 1. What is the expected number of arrivals between s and s + 1?

Since the time of the kth arrival is t, we have k 1 arrivals in (0, t), and one arrival in
(0, s). Therefore, we have k2 arrivals in (s, t), so each of the k2 arrivals has probability
1/(t s) of landing in the interval (s, s + 1). Since this is a binomial distribution, we
have
k2
E[N (s, s + 1) | N (0, s) = 1, Tk = t] = .
ts

Example 10.16. What is the joint density of (Tj , Tk ), where j < k?

Consider P (Tj (x, x+dx), Tk (y, y +dx)). We need j 1 points in (0, x), one point in
(x, x + dx) (with probability dx), k j 1 points in (x, y), and one point in (y, y + dy)
(with probability dy). Hence,

fTj ,Tk (x, y) dx dy = P (Tj (x, x + dx), Tk (y, y + dy))


= 2 P (N (0, x) = j 1)P (N (x, y) = k j 1) dx dy
ex (x)j1 e(yx) ((y x))kj1
= 2 dx dy,
(j 1)! (k j 1)!
so

k y xj1 (y x)kj1
fTj ,Tk (x, y) = e , 0 < x < y < .
(j 1)! (k j 1)!

Example 10.17. Suppose that T has the density fT (t) = tet for t > 0, and let
X Unif[0, T ]. What is the joint density of X and T X?

Observe that T has the density of the second arrival time of a Poisson process with rate
1. Conditioned on T = t, X is uniform in the interval [0, t], so X represents the first
arrival time. Therefore, we should expect X and Y = T X to be independent Expo(1)
random variables. We can verify this idea with direct calculation:
Z Z Z
1 t
fX,Y (x, y) = fX,Y | T (x, y | t)fT (t) dt = te dt = et dt
0 x+y t x+y

= e(x+y) , x, y > 0
CHAPTER 10. STOCHASTIC PROCESSES 119

We have recovered the joint density of i.i.d. Expo(1) random variables. Here, we see that
a Poisson process interpretation can sometimes lead to intuitive answers.
Chapter 11

Martingales

An important application of conditional expectation is the subject of martingales, which


are often described as models of a fair game. Martingales have a natural gambling inter-
pretation, but their applications range far beyond games and wagers. Due to the numerous
useful properties of martingales, many problems can be solved by first interpreting the system
as a martingale, and then applying notable martingale results.

11.1 Introduction
Definition 11.1. A sequence of random variables {Xn : n N} is a martingale if
E[|Xn |] < and the sequence satisfies

E[Xn+1 | X1 , . . . , Xn ] = Xn . (11.1)

The gambling interpretation is that Xn is your fortune at time n, and you are making a series
of bets. The martingale property says that, conditioned on your fortune being X1 , . . . , Xn
in the first n time steps, your expected fortune at time n + 1 is Xn , that is, you dont expect
to win or lose any money on the (n + 1)th bet. A concise way of describing a martingale is:
a martingale describes a fair game. Note that the martingale property implies

E[Xn+1 ] = E[E[Xn+1 | X1 , . . . , Xn ]] = E[Xn ]

so that E[Xn ] = E[Xn1 ] = = E[X0 ], i.e. your expected fortune is always the same.

Another observation is that for m > n, we have E[Xm | X1 , . . . , Xn ] = Xn .

If we replace the equality in (11.1) with , then we have a submartingale. If we replace


the equality with , then we have a supermartingale. Observe that a martingale is both
a submartingale and a supermartingale.

Let us see a few examples of martingales before we prove results about them.

120
CHAPTER 11. MARTINGALES 121

Example 11.2. SupposePthat Xi are independent random variables with E[Xi ] = 0.


Then the sequence Sn = ni=1 Xi is a martingale. We verify the martingale property:

E[ Sn+1 | X1 , . . . , Xn ] = E[Sn | X1 , . . . , Xn ] + E[Xn+1 | X1 , . . . , Xn ]


|{z}
Sn +Xn+1

= E[Sn | X1 , . . . , Xn ] + E[Xn+1 ] = Sn
| {z }
=0

First, we split Sn+1 into Sn + Xn+1 and apply linearity of expectation. Since Sn is a
function of X1 , . . . , Xn , then E[Sn | X1 , . . . , Xn ] = Sn . (Think about the MMSE prop-
erty: the best function of X1 , . . . , Xn to estimate Sn is simply Sn itself.) Since Xn+1 is
independent of X1 , . . . , Xn , then E[Xn+1 | X1 , . . . , Xn ] = E[Xn+1 ] = 0.

Intuitively, this example reinforces our interpretation of martingales as fair games: Xn


is the bet you make at time n, and if we assume that each bet is fair (E[Xn ] = 0), then
your fortune (the sum of your winnings so far) is a martingale.

Example 11.3. SupposeQthat Xi are independent random variables with E[Xi ] = 1.


Then the sequence Pn = ni=1 Xi is a martingale. We verify the martingale property:

E[ Pn+1 | X1 , . . . , Xn ] = E[Pn Xn+1 | X1 , . . . , Xn ] = Pn E[Xn+1 | X1 , . . . , Xn ]


| {z }
Pn Xn+1

= Pn E[Xn+1 ] = Pn
| {z }
=1

Since Pn is a function of X1 , . . . , Xn , conditioning on X1 , . . . , Xn effectively turns Pn into


a constant that can be moved outside of the conditional expectation. Again, Xn+1 is
independent of X1 , . . . , Xn , so E[Xn+1 | X1 , . . . , Xn ] = E[Xn+1 ] = 1.

Here, the interpretation is that your fortune is multiplied by Xn at time n, and E[Xn ] = 1
ensures that the bets are still fair.

11.2 Martingale Transforms


Suppose we have a martingale Xn . The martingale represents your fortune at time n, so
Xn Xn1 represents how much money you gained on the nth bet. Although we cannot
control the outcome of the wagers, perhaps we can profit by varying the amount of money
that we wager. Formally, we say that a sequence of random variables Hn is predictable if
it is a function of X1 , . . . , Xn1 (which is to say that Hn can only depend on the information
you have before making the nth bet). We interpret Hn to be the amount of money you wager
on the nth bet. Originally, we gained Xn Xn1 on the nth bet, and now we stand to gain
Yn Yn1 = Hn (Xn Xn1 ) on the nth bet, where Yn is a sequence of random variables
representing the fortune at time n under this new betting system.
CHAPTER 11. MARTINGALES 122

Notation: We write Y = H X and call Y a martingale transform or a discrete-time


stochastic integral.

Theorem 11.4 (Martingale Transforms). Let Xn be a martingale and Hn be predictable.


Then Y = H X is also a martingale.

Additionally, if Hn 0 and Xn is a submartingale or supermartingale, then Y = H X


is also a submartingale or supermartingale respectively.

Proof. First, we observe that the martingale property is equivalent to

E[Xn+1 Xn | X1 , . . . , Xn ] = 0. (11.2)

We verify this version of the martingale property.

E[Yn+1 Yn | Y1 , . . . , Yn ] = E[Hn+1 (Xn+1 Xn ) | Y1 , . . . , Yn ]


= Hn+1 E[Xn+1 Xn | Y1 , . . . , Yn ] = 0,

since Hn+1 is a function of Y1 , . . . , Yn and Xn is a martingale.

Now, suppose that Hn 0. If Xn is a submartingale,

E[Yn+1 Yn | Y1 , . . . , Yn ] = Hn+1 E[Xn+1 Xn | Y1 , . . . , Yn ] 0

so Yn is a submartingale. If Xn is a supermartingale, then

E[Yn+1 Yn | Y1 , . . . , Yn ] = Hn+1 E[Xn+1 Xn | Y1 , . . . , Yn ] 0

so Yn is a supermartingale. In both of these cases, Hn 0 is necessary to ensure that


the direction of the inequality does not flip.

Saying that Yn is a martingale means that Yn is also a fair game. Under the new betting
system, no matter how inventive, you stand to do no better on average than your original
Xn ! Therefore, we can interpret the theorem as saying: you cannot cheat a fair game.

11.3 Stopping Times


Definition 11.5. A stopping time is a random variable T which takes on values in N,
such that 1(T =n) is a function of X1 , . . . , Xn for every n.

Think of T as the time at which you stop betting. The condition that 1(T =n) is a function of
X1 , . . . , Xn is interpreted as: your decision to stop betting after the nth bet can only depend
on information up until the nth bet, and not on the future.
CHAPTER 11. MARTINGALES 123

Proposition 11.6. In order to show that T is a stopping time, it is equivalent to show


that 1(T n) is a function of X1 , . . . , Xn for every n.

Proof. Sufficient:

1(T =n) = 1(T n) 1(T n1)

and both 1(T n) and 1(T n1) are functions of X1 , . . . , Xn by assumption, so 1(T =n) is a
function of X1 , . . . , Xn .

Necessary: Suppose that 1(T =n) is a function of X1 , . . . , Xn for all n. Then


n
X
1(T n) = 1(T =i)
i=0

and each 1(T =i) is a function of X1 , . . . , Xn , so 1(T n) is a function of X1 , . . . , Xn .

Example 11.7. Fix a constant c and let T = min {n N : Xn c} be the first time
that Xn is at least c. Note that {T > n} is the event that none of X1 , . . . , Xn exceed c,
so we see that

1(T >n) = 1(X1 <c,...,Xn <c)

is a function of X1 , . . . , Xn , for every n. Since {T > n} is the complement of {T n},


then 1(T n) = 1 1(T >n) is also a function of X1 , . . . , Xn , so T is a stopping time.

Can we start with a martingale and profit by choosing a judicious stopping time? Quit
while youre ahead, in a nutshell. The next result says no:

Theorem 11.8. Suppose that Xn is a martingale and let T be a stopping time. Define
Yn to be the stopped process, that is Yn = Xmin(T,n) (so Yn = Xn until T , and then
Yn = XT afterwards). Then Yn is a martingale.

Proof. Define Hn = 1(nT ) . This reflects the following betting strategy: bet one dollar at
the start, and continue until time T , and then bet no more afterwards. We claim that Hn
is predictable: note that the event {n T } is the complement of {n > T } = {T n1},
Pn) = 1 1(T n1) ,Pwhich
so Hn = 1(nT is a function of X1P, . . . , Xn1 . Next, we observe
that Yn = i=1 (Yi Yi1 ) = i=1 (Xi Xi1 )1(iT ) = ni=1 Hi (Xi Xi1 ), so Yn is
n

indeed a martingale transform of Xn . Finally, we apply Theorem 11.4 to Yn = H X


and we obtain that Y is a martingale as well.
CHAPTER 11. MARTINGALES 124

11.4 Martingale Convergence


One of the most important properties of martingales is that, under appropriate conditions,
martingales converge to a limit X . Here, we give a simple example:

Example 11.9 (Polyas Urn). Suppose we have an urn with one black ball and one
white ball. We repeatedly pick a ball from the urn, and put back in its place two balls
of the same color. Let Xn be the fraction of black balls in the urn at the nth draw.
We verify that Xn is a martingale. Suppose that there are x black balls at time n, so
Xn = x/(n + 1). With probability x/(n + 1), we choose a black ball, and the new fraction
of black balls is (x + 1)/(n + 2). With probability 1 x/(n + 1), we choose a white ball,
and the new fraction of black balls is x/(n + 2). Hence,
x x+1 nx+1 x x
E[Xn+1 | Xn ] = + = = Xn .
n+1 n+2 n+1 n+2 n+1
Let Bn be the number of black balls at time n. We compute the distribution of Bn as
follows: first, we compute the probability that the next k draws are all black, and then
the remaining n k draws are all white. The probability is

1 k 1 nk k! (n k)!
= .
2 k+1 k+2 n+1 (n + 1)!

If we want P (Bn = k +1), then we must draw k balls as above, but there are nk different


sequences of k black and n k white balls to achieve the same outcome. Hence,

(n

n! k! k)! 1

P (Bn = k + 1) =  = .
(n k)! (n + 1)!

k!

 
n+1

We see that Bn Unif{0, . . . , n}. Therefore, we have found P (Xn = k/n) = (n + 1)1
for all k {0, . . . , n}. The limit is X Unif[0, 1].
Chapter 12

Random Vectors & Linear Algebra

Here, we consider multi-dimensional analogues of the one-dimensional results we have already


discussed. Stating and proving the multi-dimensional results will require a study of linear
algebra, so we also provide a review of important linear algebra results.

12.1 Introduction
Let X and Y be vector-valued random variables in Rm and Rn respectively (in particular,
it is not necessary to assume that m = n). We denote by E[X] the vector whose ith entry
is the expectation of the ith entry of the random variable X. We define the covariance
matrix of X and Y, X,Y , to be

X,Y := E[XYT ] E[X]E[Y]T (12.1)

and write X for X,X . The (i, j) entry of X,Y is E[Xi Yj ] E[Xi ]E[Yj ] = cov(Xi , Yj ),
where Xi is the random variable for the ith coordinate of X and Yj is the random variable
for the jth coordinate of Y. Hence, the entries of X,Y are the pairwise covariances between
the entries of X and the entries of Y.

In the following theorem, suppose that X is a random m n matrix.

TheoremP 12.1
Pn(Linearity of Expectation). Suppose f is a linear function of X, that is,
f (X) = m
i=1 j=1 i,j Xi,j . Then f (E[X]) = E[f (X)].

Proof.
m X
X n
E[f (X)] = i,j E[Xi,j ] = f (E[X])
i=1 j=1

The result also works for matrix-valued functions F(X), by applying Theorem 12.1 to each
entry of F. For some examples of linear functions, consider:
X + Y (the (i, j) entry is Xi,j + Yi,j ).

125
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 126
Pm
AX (left-multiplication by a constant matrix; the (i, j) entry is k=1 Ai,k Xk,j ).
Pn
XA (right-multiplication by a constant matrix; the (i, j) entry is k=1 Xi,k Ak,j ).

XT (the (i, j) entry is Xj,i ).

tr(X) = ni=1 Xi,i .


P

Next, we prove the scaling property for variance. Recall that (AX)T = XT AT .

Theorem 12.2 (Scaling of Variance). Let X be a random matrix and A be a constant


matrix. Then AX = AX AT .

Proof. We follow the definition of AX , using the fact that the transpose is a linear
operator.

AX = E[AX(AX)T ] E[AX]E[AX]T = E[AXXT AT ] E[AX]E[XT AT ]


= AE[XXT ]AT AE[X]E[XT ]AT = A (E[XXT ] E[X]E[XT ]) AT
| {z }
X

12.2 Symmetric Matrices & Spectral Theorem


Recall the following definitions from linear algebra.

Definition 12.3. A matrix A is symmetric if A = AT .

Definition 12.4. A matrix Q is orthogonal if QT Q = QQT = I.

In other words, Q1 = QT .

An important fact is that symmetric matrices can be diagonalized with orthogonal eigenvec-
tors. This is a key result in linear algebra, known as the spectral theorem for real, symmetric
matrices. We state and prove the result below:

Theorem 12.5. If is an eigenvalue of a symmetric matrix A, then is real.

Proof. Denote by the complex conjugate of , and similarly for u . Let u be an


eigenvector corresponding to . From the definition of an eigenvalue, we have

Au = u. (12.2)
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 127

Taking the complex conjugate, we also have

A u,
u = (12.3)

T and (12.3) by uT and subtracting, we have


since A is real. Left-multiplying (12.2) by u

T Au uT A
u Tu
u = ( )u .

The LHS is 0, since uT A


u = (uT A
u)T = u
T Au, by the symmetry of A. Since uT u
> 0,
we must have = .

Theorem 12.6 (Spectral Theorem). Let A be a n n real, symmetric matrix. Then


A admits a decomposition A = UUT , where U is an orthogonal matrix containing the
eigenvectors of A and is a diagonal matrix containing the eigenvalues of A.

Proof. We proceed by induction on n. The case of n = 1 is trivial, so assume that the


result holds for (n 1) (n 1) real, symmetric matrices.

Let 1 and u1 be an eigenvalue and the corresponding normalized eigenvector. Let Vn1
denote the subspace of vectors which are orthogonal to u1 , and take any u Vn1 .

(Au)T u1 = uT Au1 = 1 uT u1 = 0,

the operator A restricted


so Au is orthogonal to u1 as well. Therefore, if we denote by A
is a (n1)(n1) real, symmetric matrix. By the inductive hypothesis,
to Vn1 , then A

A has n 1 orthogonal eigenvectors, which are all orthogonal to u1 . Hence, A is
diagonalizable with orthogonal eigenvectors, which completes the proof.

12.2.1 Positive Semi-Definite Matrices


The following definition from linear algebra generalizes the notion of a positive real number.

Definition 12.7. A matrix A is positive semi-definite if it is symmetric and for any


vector x, xT Ax 0.

The following result is the analogue of var(X) 0.

Theorem 12.8. If X is a random matrix, then X is positive semi-definite.

Proof. It is clear that X is symmetric because cov(Xi , Xj ) = cov(Xj , Xi ).


CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 128

Fix a vector x. Using the alternate form X = E[(X E[X])(X E[X])T ], we have

uT X u = E[uT (X E[X]) (X E[X])T u] 0,


| {z }| {z }
zT z

since zT z 0. Hence, X is positive semi-definite.

12.2.2 Cholesky Decomposition


The following result provides a characterization of positive semi-definite matrices. For a
matrix A, we can write
 
a1,1 A1,2
A= . (12.4)
A2,1 A2,2

Lemma 12.9. Suppose A is a positive semi-definite matrix, written as (12.4). Then


a1,1 0 and

:= A2,2 1 A2,1 A1,2


A (12.5)
a1,1

is positive semi-definite.

Proof. One can see that A is symmetric because A2,2 is symmetric and A2,1 = AT . To
1,2
is positive semi-definite, let x
show that A Rn1 be a vector. Define
 
A1,2 x
/a1,1
x := .

x

Since A is positive semi-definite, one has


  
T
  a1,1 A1,2 A1,2 x
T /a1,1
0 x Ax = A1,2 x /a1,1 x

A2,1 A2,2
x
 
 0
x.
T A

= A1,2 x T
/a1,1 x =x
Ax
One then uses Lemma 12.9 to prove:

Theorem 12.10 (Cholesky Decomposition). A is positive semi-definite if and only if


there exists a lower triangular matrix L such that A = LLT .

12.3 Multivariate Regression


The ideas in this section are only slight generalizations of the one-dimensional regression we
have already considered. By analogy with the one-dimensional case, we write:
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 129

Definition 12.11. Let X and Y be random vectors such that X is non-singular, i.e.
X is invertible.
L(Y | X) := E[Y] + Y,X 1
X (X E[X]) (12.6)

Observe that L(Y | X) should be a m 1 random vector, and


Y,X 1 X (X E[X])
| {z } |{z} | {z }
mn nn n1

has the correct dimensions.

Theorem 12.12 (Projection Property). For any linear function AX + b of X (here,


A and b are constant),

E[(Y L(Y | X))(AX + b)T ] = 0 (12.7)

Proof. We need only show that the following hold:

E[(Y L(Y | X))bT ] = 0 (12.8)


T
E[(Y L(Y | X))(AX) ] = 0 (12.9)

where 1 is the vector of ones. The first equation is evident because E[Y L(Y | X)] = 0
and the second follows from

E[(Y L(Y | X))(AX)T ] = E[YXT E[Y]XT Y,X 1 T T


X (XX E[X]X )]A
T

= (Y,X Y,X 1 T
X X )A = 0.

In linear algebra terms, Y L(Y | X) is orthogonal to the (n + 1)-dimensional subspace


generated by 1, X1 , . . . , Xn .

Theorem 12.13 (Multivariate LLSE). For any constants A and b:

E[kY L(Y | X)k22 ] E[kY AX bk22 ] (12.10)

Proof. We use the standard decomposition.

E[kY AX bk22 ] = E[kY L(Y | X) + L(Y | X) AX bk22 ]

Since the trace is a linear operator, we have tr(E[A]) = E[tr(A)]. Applying this to the
matrix (Y L(Y | X))T (L(Y | X) AX b) (which we abbreviate as ZT W) and the
cylic property of the trace operator, tr(AB) = tr(BA), we find that

E[ZT W] = E[WT Z] = E[tr(WT Z)] = E[tr(ZWT )] = tr(E[ZWT ])


CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 130

= tr(E[(Y L(Y | X))(L(Y | X) AX b)T ]) = 0

by (12.7). Therefore, the cross-term vanishes and we have

E[kY AX bk22 ] = E[kY L(Y | X)k22 ] + E[kL(Y | X) AX bk22 ].

The first-term corresponds to irreducible error, and the last term is clearly minimized
when AX + b = L(Y | X).

12.3.1 Non-Bayesian Multivariate Regression


Switching notation now, suppose that we gather data points (x1 , y1 ), . . . , (xN , yN ), where
each xi is a vector of p features and yi is a scalar quantity that we wish to predict. Let X be
the N p matrix such that the ith row is xi , and let y be the N 1 column vector whose
ith entry is yi . We seek to solve the regression problem, that is, find 0 , 1 , . . . , p (denote
to be the column vector containing 1 , . . . , p ) so that if we predict


yi = 0 + xTi (12.11)

then we minimize the sum of squared residuals N i )2 . This is the non-Bayesian


P
i=1 (yi y
formulation. Often, we like to append the entry 1 to the beginning of each xi , making each
xi into a (p + 1)-dimensional vector. This has the effect of absorbing 0 into , that is,
T
we can simply write yi = xi . Therefore, we ignore 0 for the moment, and remark that

there is a simple closed-form solution for .

Assume for simplicity that we have normalized all of the features so that they are zero-
mean, that is, N
P
x
i=1 i = 0. Then, observe that cov(Y, Xj ) (where Xj is the jth feature and
Y is the response) is E[Y Xj ] = N 1 N T T
P
i=1 yi xi,j = (y X)j , so N Y,X = y X. Similarly,
N 1 1 T 1
X = (X X) . However, (12.6) and (12.11) have opposite conventions regarding the
weight vector, so we must take the transpose again. This yields the closed-form solution
= (XT X)1 XT y.
(12.12)

12.4 Matrix Calculus


This section is devoted to computing derivatives w.r.t. matrices. Consider a function f (x),
where f (x) is a scalar and x is an n 1 vector. Then the derivative of f (x) w.r.t. x is also
a vector, whose ith entry is the derivative of f (x) w.r.t. xi :

d

dx1 f (x)
d ..
f (x) = . (12.13)
dx
d


f (x)
dxn
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 131

The same principle applies for differentiation w.r.t. a matrix X: the result is a matrix of the
same size as X, whose (i, j) entry is the derivative of f (X) w.r.t. the entry Xi,j . How about
the derivative of a vector y w.r.t. a vector x? The result is a matrix, whose (i, j) entry is
the derivative of yi w.r.t. xj .

In principle, computations can be carried out using the above definitions, but it is helpful
to have a toolbox of common matrix calculus identities in order to speed up computations.

12.4.1 Inner Products


First, we consider an inner product of the form xT y, where x and y are n 1 vectors. If we
consider the derivative w.r.t. xi , then we have
n
d T d X
x y= xk y k = y i ,
dxi dxi k=1

since all but the ith term in the summation vanish. Since the derivative of xT y w.r.t. x is
the vector whose ith entry is yi , we see that
d T
x y = y. (12.14)
dx
(Compare the above equation with the derivative of f (x) = cx.) Using the symmetry of the
inner product, we also have
d T d T
x y= y x = x. (12.15)
dy dy
Now, suppose that both x and y are functions of z. If we take the derivative w.r.t. z, the
ith entry is
n n   X n     !
d T d X X dxk dyk dx dy
x y= xk y k = yk + xk = yk + xk ,
dzi dzi k=1 k=1
dzi dzi
k=1
dz k,i dz k,i

which gives the formula


d T dy dx
x y = xT + yT . (12.16)
dz dz dz
This is the product rule of matrix calculus.

12.4.2 Matrix-Vector Products


Now, consider the derivative of Ax with respect to x. The (i, j) entry is
n
d d X
(Ax)i = Ai,k xk = Ai,j ,
dxj dxj k=1
which gives
d
Ax = A. (12.17)
dx
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 132

12.4.3 Quadratic Forms


A quadratic form is an expression of the form xT Ax. We can write this as xT y, where
y = Ax. Then, applying the Product Rule (12.16) and rule (12.17), we have
d T d T d
x Ax = x y = xT Ax + (Ax)T I = xT (A + AT ). (12.18)
dx dx dx
When A is symmetric, we immediately obtain
d T
x Ax = 2xT A. (12.19)
dx

12.5 Kernel Regression


12.5.1 Kernels
Although linear regression is a particularly elegant and straightforward theory, it is also a
restrictive model because it imposes a globally linear structure onto the data. Sometimes,
we would prefer to have a locally linear fit, which is the subject of local linear regression.

Suppose that we have collected data {(x1 , y1 ), . . . , (xN , yN )}. Our goal is to predict f(x0 ) at
a target point x0 , with the intention that f(x0 ) should be close to the true value y0 . Local
regression is based on the idea that when we make our prediction f(x0 ), we should pay more
attention to the training points (xi , yi ) which are near x0 . In other words, the closer that xi
lies to x0 , the more weight we should place on the training point (xi , yi ).

A kernel function is a method of assigning these weights. Formally, a kernel function


K (x0 , x) is a function which gives the weight for point x; is a scale parameter and often
controls the width of the kernel. A larger width for the kernel means that training points
which are further away from x0 are given more consideration, whereas a smaller width re-
stricts our focus to only the points which are in the immediate vicinity of x0 .

One example of a kernel is the k-nearest neighbors kernel, which places a weight of 1 on
the k closest points to x0 , and a weight of 0 for other points. Other times, we would like a
smoother kernel (a smoother kernel yields a smoother fit). Many kernels are of the form
 
|x x0 |
K (x0 , x) = D , (12.20)

where D(w) is a function that determines the shape of the kernel. For example,
(
(3/4)(1 w2 ), |w| 1,
D(w) = (12.21)
0, |w| > 1,
is the Epanechikov quadratic kernel, and
(
(1 |w|3 )3 , |w| 1,
D(w) = (12.22)
0, |w| > 1,
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 133

is the tri-cube kernel. The tri-cube kernel is differentiable at w = 1, since the derivative
vanishes at 1. Another option is the Gaussian kernel D(w) = (w), where is the
standard Gaussian density, but for computational purposes, we typically favor kernels with
bounded support to avoid keeping negligible contributions from distant points.

12.5.2 Local Linear Regression


One point should be emphasized: kernel regression is not meant to produce a best-fit curve
which yields predictions for all values of x. Instead, kernel regression is designed to make a
prediction at a specific point x0 .

The goal is to find parameters 0 and 1 that solve the minimization problem
N
X
(0 , 1 ) = arg min K (x0 , xi )(yi 0 1 xi )2 , (12.23)
0 ,1 i=1

which is a least-squares criterion, weighted by the kernel function. Note that the parameters
0 and 1 depend on the choice of target point x0 , which means they are not suitable for
predictions at other target points.

Let us rewrite (12.23) in matrix notation. Let W denote a diagonal matrix whose (i, i) entry
is K (x0 , xi ). We will denote bi to be the column vector with entries 1 and xi , and denote
by the column vector with entries 0 and 1 . Then, we can write 0 + 1 xi as bTi . If we
denote by B the N 2 matrix whose ith row is bTi , then (12.23) is
= arg min (y B)T W(y B),
(12.24)

where y is the N 1 vector whose ith entry is yi . We write y B = z and compute the
derivative w.r.t. using (12.19) (note that W is symmetric):
n
X zT Wz zk X n
d T
z Wz = = 2zk Wk,k (Bk,i )
di k=1
z k i
k=1

In matrix notation, we have


d
(y B)T W(y B) = 2(y B)T WB.
d
Setting the above equation to 0 and taking the transpose, we find
= BT Wy,
BT WB

which yields the solution


= (BT WB)1 BT Wy.
(12.25)
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 134

12.5.3 Local Polynomial Regression


We mention in passing that (12.23) can be extended to polynomial regression, by minimizing
the criterion
N p
!2
X X
(0 , . . . , p ) = arg min K (x0 , xi ) yi j xj . (12.26)
0 ,...,p i=1 j=0

12.6 Multivariate Change of Variable


Here, we discuss multivariate change of variables in a more formal context. Consider a contin-
uously differentiable mapping T : U Rn Rn . T has the form T (x) = (t1 (x), . . . , tn (x)).
Continuously differentiable means that the partial derivatives xj ti all exist and are con-
tinuous on U (xj is the partial derivative operator with respect to the jth coordinate).

Define the Jacobian matrix JT (x) to be the n n matrix whose (i, j) entry is xj ti . In-
tuitively, the Jacobian at a point x describes how the change of variable T behaves in an
infinitesimally small neighborhood of x, and the determinant of the Jacobian describes the
local change in volume. (The sign of the determinant does not matter in this context; a
negative sign indicates that the orientation of the axes has flipped.)

Although the full details of the theorem require a more formal study of integration, we
present the theorem and sketch the proof below.

Lemma 12.14. Let T : Rn Rn be an invertible linear mapping. Then for A Rn ,


Z Z
dx = |det(T )| dx. (12.27)
T (A) A

Proof Sketch. Due to key results in integration theory, it suffices to check (12.27) in the
case when A is a rectangle. The idea is that any invertible linear mapping T can be
represented as the composition of three different types of elementary operations:

1. Permute the coordinates of x.

2. Add one coordinate of x to another: (x1 , . . . , xn ) 7 (x1 + x2 , . . . , xn ).

3. Multiply one of the coordinates of x by a scalar c.

We must verify that in each of the three cases, the determinant and the change in volume
of the rectangle agree.

1. Permutation of the coordinates does not change the volume of the rectangle. Since
the determinant is multiplied by 1 in this case, |det(T )| is not affected.
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 135

2. The second case corresponds to translating one of the dimensions of the rectangle,
so the volume does not change. The determinant is also unchanged.

3. The third case corresponds to multiplying one of the side lengths of the rectangle
by c, where multiplication by a negative number produces a reflection. Therefore,
the volume is multiplied by |c| and the determinant is multiplied by c.

Since (12.27) holds when A is a rectangle, (12.27) holds for all A.

Theorem 12.15 (Change of Variable). Let T : U Rn Rn be a continuously differ-


entiable, one-to-one mapping such that det(JT (x)) 6= 0 for all x. Let f 0.
Z Z
f (T (x))|det(JT (x))| dx = f (x) dx (12.28)
U T (U )

Proof Sketch. Again, due to key results in integration theory, it suffices to check (12.28)
in the case when f = 1T (A) , the indicator function for the set T (A), where A is a rectangle
lying inside U . Then, (12.28) reduces to
Z Z
|det(JT (x))| dx = dx.
A T (A)

The idea is to break up A into finitely many rectangles which are small enough so that
T (x) can be locally approximated (around the point x0 ) by T (x0 ) + JT (x0 )(x x0 ).
However, this is an invertible linear transformation, so the change in volume of a small
rectangle is given by Lemma 12.14. Specifically, if we partition A into the rectangles
A1 , . . . , Ak with xi Ai , then
k
X k
X Z
vol(T (A)) vol(T (Ai )) |det(JT (xi ))| vol(Ai ) |det(JT (x))| dx.
i=1 i=1 A

The tricky part lies in the details.

Technical Remark: The proof sketches of Lemma 12.14 and Theorem 12.15 both begin
with the words it suffices to check. . . , but there are different reasons behind each statement.
In Lemma 12.14, the rectangles form a generating class, that is, other sets can be formed
out of intersections and unions of rectangles, and a result about uniqueness of measures
concludes the proof. In Theorem 12.15, one can extend the proof from indicator functions to
simple functions (linear combinations of indicator functions) by linearity; and then to non-
negative functions by taking an increasing sequence of simple functions fn f and using the
Monotone Convergence Theorem. (The latter is a very common style of proof: prove the
statement for indicators, and extend the argument to general functions.)
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 136

12.7 Multivariate Normal Distribution


Suppose that Z1 , . . . , Zn are i.i.d. N (0, 1) random variables. Then Z = (Z1 , . . . , Zn ) is a
random vector which takes on values in Rn . We say that Z has the standard multivariate
normal distribution. By the assumption of independence, the density function for Z is
the product of n copies of the standard normal density:
n
!  
1 1X 2 1 1 T
(z) = exp z = exp z z (12.29)
(2)n/2 2 i=1 i (2)n/2 2

Since each coordinate is a standard Gaussian, the expected value of each coordinate is 0:

E[Z] = 0 (12.30)

In place of the variance, we have the variance-covariance matrix = X,X . (See (12.1).)
By the assumption of independence, for i 6= j, we have cov(Xi , Xj ) = 0, and for i = j, we
have cov(Xi , Xi ) = var(Xi ) = 1. Hence, the variance-covariance matrix is

= I, (12.31)

where I is the n n identity matrix.

Suppose Z N (0, I) be a standard multivariate Gaussian. To obtain the general case,


let X = + AZ, where A is non-singular. By the properties of expectation and variance,
E[X] = and = AAT . We say that X N (, ) has the general multivariate Gaussian
distribution.

From Theorem 12.15, we can compute the density by taking the standard density (12.29),
replacing z by A1 (x ) and dividing by det(JT (x)) = A. In differential notation,

dx
(x) dx = ( + Az) dz = (z) dz,
dz

where

dx
= det(A).
dz

(see (12.17)). Replace z by A1 (x ) to obtain

(x) det(A) dz = (A1 (x )) dz,

or equivalently,
 
1 1 1 T 1
(x) = exp (A (x )) (A (x )) .
(2)n/2 det(A) 2
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 137

Now, since det() = det(AAT ) = det(A) det(AT ) = det(A)2 , we can replace det(A) by
det()1/2 . Also, we can simplify the argument of the exponential as
(x )T (A1 )T A1 (x ) = (x )T (AAT )1 (x ) = (x )T 1 (x ),
which gives the density of the general multivariate Gaussian:
 
1 1 T 1
fX (x) = exp (x ) (x ) (12.32)
(2)n/2 ||1/2 2
By the Cholesky Decomposition (Theorem 12.10), we can also start with any positive definite
matrix and decompose it into AAT . (Note: here, we want to be positive definite, that
is, the inequality xT x > 0 is strict for any x 6= 0. This guarantees that A has strictly
positive diagonal entries, i.e. A is non-singular.)

12.7.1 Uncorrelated Gaussians


Recall from Example 4.4 that uncorrelated random variables are not necessarily independent.
For the case of jointly Gaussian random variables, however, the statement does hold: if X
and Y are uncorrelated, then they are independent. To see this, observe that for the case of
jointly Gaussian X and Y ,
(x X )2 (y Y )2
 2  
  X 0 x X
x X y Y = +
0 Y2 y Y X2
Y2
so that
1 (x X )2 (y Y )2
  
1
fX,Y (x, y) = exp 2
+ . (12.33)
2X Y 2 X Y2
We can clearly see that the joint density factors into the product of the marginal densities
of X and Y :
1 2 2 1 2 2
fX,Y (x, y) = e(xX ) /(2X ) e(yY ) /(2Y )
2X 2Y
Hence, X and Y are independent.

In the general case, when X has the multivariate Gaussian distribution and X is diagonal
(so the pairwise covariances are all 0), then the components of X are independent.

12.8 Curse of Dimensionality


It is worth noting that many tasks, such as regression, are usually far more difficult to
perform in high numbers of dimensions, i.e. when there are many different features in the
training data. The number of possible configurations grows exponentially with the number of
dimensions, so any reasonable training set will contain examples which are sparsely located
in the high-dimensional space. Furthermore, most data points tend to lie far away from
the origin. We give an example to illustrate the idea:
CHAPTER 12. RANDOM VECTORS & LINEAR ALGEBRA 138

Example 12.16. Pick N points (assume that N is even) uniformly at random inside
a p-dimensional unit ball. What is the distance from the origin of the median of the
closest point, d(p, N )?

Let Xi denote the distance from the origin of point i and X = mini Xi . For a single
point, the CDF of Xi should be rp for 0 < r < 1, by the following argument: the volume
of a p-dimensional ball of radius r is crp (for some constant c) and the volume of the unit
ball is c 1p = c (with the same constant c), so the probability of lying within a radius r
is crp /c = rp . Then, for 0 < x < 1, P (X > x) = P (Xi > x)N = (1 xp )N . We seek the
value of x such that P (X > x) = 1/2, so d(p, N ) = (1 21/N )1/p .

The result does not scale favorably with large p; there are very few points located near the
origin and most points lie near the boundary of the unit ball.

Collectively, the adverse conditions of high-dimensional statistics are termed the curse of
dimensionality.
Chapter 13

Statistical Estimators & Hypothesis


Testing

Here, we explore the most direct application of probability theory: the study of how to draw
inferences about the population given sampled data.

13.1 Important Sampling Distributions


In order to study sampling theory, we will require knowledge of more distributions, most of
which are derived from the normal distribution.

13.1.1 Squared Normal Distribution


Suppose Z N (0, 1). What is the density of Z 2 ?

The answer can be found using the change of variables formula. There is a minor difficulty
in that f (x) = x2 is not a one-to-one function, but as long as we remember to account for
this, the derivation is relatively straightforward.

FZ 2 (z) = P (Z 2 z) = P ( z Z z) = ( z) ( z) = 1 2( z)

Differentiating the CDF, we obtain


1 1
fZ 2 (z) = ( z) = ez/2 , z>0 (13.1)
z 2z

Although this distribution may seen contrived, observe that fZ 2 (z) = (2)1/2 z 1/2 ez/2
2
is actually the gamma distribution with shape 1/2 and rate 1/2, i.e. Z (1/2, 1/2).
Incidentally, this tells us that (1/2)1/2 (1/2)1 = (2)1/2 , so (1/2) = . We can find
the mean and variance using the formulae we derived for the gamma distribution:

E[Z 2 ] = 1 (13.2)
var(Z 2 ) = 2 (13.3)

139
CHAPTER 13. STATISTICAL ESTIMATORS & HYPOTHESIS TESTING 140

The MGF is  1/2


1/2
MZ 2 () = = (1 2)1/2 . (13.4)
1/2

13.1.2 Chi-Square Distribution


Define the chi-square distribution with n degrees of freedom to be the sum of n i.i.d.
random variables, each with the squared normal distribution. We denote the chi-square
distribution by X 2n . As a consequence of the definition and the results derived above,
we quickly obtain

E[X] = n, (13.5)
var(X) = 2n, (13.6)
MX () = (1 2)n/2 . (13.7)

Also as a consequence of the definition, if X 2m and Y 2n , then X+Y 2m+n . Since the
chi-square distribution with one degree of freedom has the (1/2, 1/2) distribution and the
sum of independent gamma random variables is also gamma, we see that 2n = (n/2, 1/2).
Using the formula for the density of the gamma distribution,

xn/21 ex/2
fX (x) = n/2 , x > 0. (13.8)
2 (n/2)

13.1.3 t Distribution
Let Z N (0, 1) and X 2n be independent. The distribution of T = Z/
p
X/n is known
as the t distribution with n degrees of freedom, denoted as T tn .

Computing the density is


tedious, but straightforward. We proceed in three steps. First, we
compute the density of X:

P ( X < x) = P (X < x2 ) = FX (x2 )

Differentiating, we obtain
2 2
2 xn2 ex /2 xn1 ex /2
fX (x) = 2xf X (x ) = 2x = , x > 0.
2n/2 (n/2) 2n/21 (n/2)

Next, we use (7.12) to obtain the density of Z/ X. Here, the equation simplifies further
since X 0.
Z Z 2

1 x2 z2 /2 xn1 ex /2
fZ/ X (z) = xfZ (xz)f X (x) dx = x e n/21 dx
0 0 2 2 (n/2)
Z
1 2 2
= (n1)/2 xn ex ((z +1)/2) dx
2 (n/2) 0
CHAPTER 13. STATISTICAL ESTIMATORS & HYPOTHESIS TESTING 141

Write (z 2 + 1)/2 = and use the change of variables = x2 (so we have xn = n/2 and
dx = (2)1/2 d). The integral we obtain has a familiar form:
Z
1
fZ/X (z) = (n+1)/2 (n1)/2 e d
2 (n/2) 0
This is precisely the integral of an un-normalized gamma distribution with rate and shape
= (n + 1)/2. Using (10.1) and the normalization condition, we see that
((n + 1)/2)
fZ/X (z) = (1 + z 2 )(n+1)/2 .
(n/2)

Finally,
we make the change
of variables T = n(Z/ X), which amounts to replacing z by
t/ n and dividing by n in the expression above. Our final result is
(n+1)/2
t2

((n + 1)/2)
fT (t) = 1+ . (13.9)
(n/2) n n
We can observe that the distribution is symmetric w.r.t. t, which implies
E[T ] = 0. (13.10)

13.2 Statistical Estimators


13.2.1 Bias & Consistency
Before delving into examples of statistical estimators, we will first introduce important prop-
erties that we would like our statistical estimators to have. Let be the parameter that we
wish to estimate, and let n be a statistical estimator for , that is, n is a random variable
computed from the sample, which is supposed to produce a reasonable estimate of the true
parameter . Note that n is indexed by n, which represents the sample size. Generally
speaking, larger samples (larger values of n) produce better statistical estimators.

We would like our estimates to produce correct estimates on average:

Definition 13.1. A statistical estimator n for a parameter is unbiased if E[n ] = .

Although unbiased estimators are desirable, sometimes we are content with estimators that
are unbiased in the long-run:

Definition 13.2. A statistical estimator n for a parameter is asymptotically un-


biased if limn E[n ] = .

In other words, each n may be biased, but the bias shrinks as n .

A stronger notion is that of statistical consistency, which we define below:


CHAPTER 13. STATISTICAL ESTIMATORS & HYPOTHESIS TESTING 142

Definition 13.3. A statistical estimator n for a parameter is consistent if n


in probability.

The reason why consistency is stronger than the property of being asymptotically unbiased
is that bias relates to the expectation of the estimator, while consistency enforces the addi-
tional requirement that the variance goes to 0.

Example 13.4. Suppose that we are trying to measure the mean of a quantity, and
we collect measurements x1 , . . . , xn . An example of an unbiased estimator is simply the
first element of the sample, x1 . However, x1 is not consistent, because it is not the case
that x1 in probability.

13.2.2 Standard Error


Return now to the setting and notation of 12.3.1. Suppose also that the true data distribution
is of the form
y = X + , (13.11)
so that y has a linear dependence on X, with an additive error term. We will assume that
follows the multivariate Gaussian distribution, where the different i are uncorrelated:
N (0, 2 I). Suppose that we have already incorporated the bias term into the xi , so
that X is a N (p + 1) design matrix. Can we find an unbiased estimator for 2 ?

Having taken the time to interpret linear regression as a projection onto a subspace, the rest
is easier than you may expect. The predictions y are the projection of y onto the (p + 1)-
dimensional feature space, which means that y y (the residuals) is the projection of y onto
the orthogonal complement of the feature space, R, which has N (p + 1) dimensions. Using
(13.11), we observe that the projection of X onto R is 0 (this is a manifestation of (12.7)),
so we are left with the projection of onto R (call this PR ). Since the projection of a
N -dimensional Gaussian onto N p 1 dimensions is a (N p 1)-dimensional Gaussian,
we see that PR N (0, 2 I).

Define the RSS, the residual sum of squares, to be


N
X
=
RSS() (yi yi )2 . (13.12)
i=1

Observe that RSS() is the squared norm of the residuals, i.e. RSS()
= ky y k22 . Since
each of the N p 1 components of y y is a Gaussian with mean 0 and variance 2 , we
see that the squared norm represents the sum of N p 1 independent random variables,
which are each the square of a Gaussian. This should sound familiar, and we will make the
connection explicit:
1 2
2
RSS() N p1 (13.13)

CHAPTER 13. STATISTICAL ESTIMATORS & HYPOTHESIS TESTING 143

(The 2 factor is to convert the Gaussians into standard Gaussians, of course.) Taking
expectations, we find that
= 2 (N p 1)
E[RSS()]

2 for 2 is:
so that an unbiased estimator
N
2 1 X

:= (yi yi )2 (13.14)
N p 1 i=1

Specializing to the case where p = 0, the linear regression is simply y, the average of the
response values. This suggests that if we collect N samples y1 , . . . , yN , assumed to be drawn
from a N (, 2 ) distribution, then an unbiased estimator for 2 is the sample standard
deviation:
N
2 1 X

:= (yi y)2 (13.15)
N 1 i=1
PN
Note that, contrary to what you might have expected, N 1 )2
i=1 (yi y is not an unbiased
estimator for 2 .

13.2.3 Sample Mean


Assume that Y N (, 2P
) and we collect i.i.d. samples y1 , . . . , yn . We can estimate using
the sample mean, y = n1 ni=1 yi , and we know that y is an unbiased estimator: E[ y ] = .
How can we determine if we are satisfied with our estimator?

Let us standardize our data and work with the quantity


n
1 X yi y
z := , (13.16)
n i=1

is defined in (13.15). Let zi = (yi y)/. We can rewrite z as


where
n
n1 ni=1 zi
P
1 X yi y
z= = pPn 2 .
n i=1
i=1 zi / n 1

Although the expression seems complicated, the numerator Pnhas 2the standard normal distri-
2
bution, and from the discussion in the previous section, z
i=1 i has the n1 distribution.
In symbols, z N (0, 1)/ n1 /(n 1), which means that z tn1 .
p 2

If we knew somehow, then we could use the estimate n1 ni=1 (yi y)/, and this would
P
have the N (0, 1) distribution. However, since we do not have information about , we must
estimate with , and as a result, our estimator instead has the tn1 distribution. Compared
to the normal distribution, the t distribution places slightly more area on its tails, reflecting
a greater degree of uncertainty in our estimate. More importantly, knowing the distribution
of z allows us to construct confidence intervals as a measure of quantifying our uncertainty.
CHAPTER 13. STATISTICAL ESTIMATORS & HYPOTHESIS TESTING 144

13.3 Maximum Likelihood Estimation


Now that we have seen some examples of statistical estimators, we will introduce a principled
method of estimating parameters known as maximum likelihood estimation (MLE).
Suppose we have a model with unknown parameters that we would like to estimate, and
we have gathered data X = {x1 , . . . , xN }. The principle of maximum likelihood states that
we should use the values for which maximize the likelihood of the data, that is,
N
Y
= arg max P (X) = arg max
P (xi ), (13.17)
i=1

where P (X) is the probability of drawing the samples X under the parameters . In the
second equality, we have made the common assumption that the data points are i.i.d., so
P (X) decomposes into the product of the likelihoods for each point.

Often, for both computational and analytical ease, we prefer to work with the logarithm of
the probability. (Maximizing log(P (X)) occurs under the same value of as maximizing
P (X).) In this case, we seek
N
X
= arg max
log(P (xi )). (13.18)
i=1

Equivalently, since (13.18) is the sum over the data points, we can think of (13.18) as
maximizing the expectation
= arg max E[log(P (xi ))],
(13.19)

where the expectation is taken w.r.t. the empirical distribution of the sampled data.
Chapter 14

Randomized Algorithms

Randomness plays a crucial role in a number of important randomized algorithms. There


are two types of randomized algorithms: a Las Vegas algorithm is guaranteed to output
a correct solution, but the amount of resources it consumes is random (for example, its time
complexity may be random, so it may occasionally require a large amount of time before
terminating). A Monte Carlo algorithm is an algorithm which returns a solution, but
the solution is not guaranteed to be optimal; here, the randomness pertains to the quality
of the solution, as measured by some appropriately defined criteria.

14.1 EM Algorithm
The EM algorithm (short for Expectation and Maximization, the two stages of the algo-
rithm) is a popular algorithm for fitting probabilistic models to data when there are hidden
or latent variables that are not observed as part of the data. The two stages proceed as
follows: in the Expectation step, we produce tentative assignments to the hidden variables;
in the Maximization step, we use the assignments of the hidden variables to fit the other
parameters. We will use a mixture of Gaussians model to illustrate the idea.

14.1.1 Mixture of Gaussians


The mixture of Gaussians model is a popular and flexible model which assumes that the
data is drawn from one of several Gaussian distributions. The model works by specifying
K different Gaussian distributions, where the kth Gaussian is N (k , k2 ), and a probability
distribution over {1, . . . , K} which specifies the probability of choosing that Gaussian. The
mixture of Gaussians model is a generative model, which means that it explicitly pre-
scribes a recipe for generating samples: first, choose an integer in {1, . . . , K} according to
the class distribution, and then sample from the corresponding Gaussian distribution.

We will focus on the case K = 2, so we have two Gaussian distributions: N (0 , 02 )


and N (1 , 12 ). Let be the probability of choosing N (1 , 12 ). Suppose we have data
{x1 , . . . , xN }. Our goal is to find suitable estimates for the parameters (five in total: 02 ,
0 ,
12 , and
1 , ) so that the log-likelihood of the model is maximized. Direct maximization

145
CHAPTER 14. RANDOMIZED ALGORITHMS 146

of the log-likelihood is difficult, so instead we consider what we would do if we knew, for


each data point xi , the Gaussian from which xi was generated. Let i be 0 if xi came from
N (0 , 02 ), and 1 otherwise. The best estimates for
0 and
1 are
PN PN
i=1 (1 i )xi i xi

0 = PN , 1 = Pi=1
N
. (14.1)
i=1 (1 i ) i=1 i

0 is the sample average of the data points which were generated from N (0 , 02 ), and
1 is
the sample average of the data points which were generated from N (1 , 1 ). Similarly, we
2

can estimate 02 and 12 with the quantities


PN 2
PN
(1 i )(x i
0 ) 1 )2
i (xi
02 = i=1PN
, 12 = i=1PN
. (14.2)
i=1 (1 i ) i=1 i

Now, let us turn to the problem of finding i . Although we do not know i , we can replace
i with its expectation, conditioned on the estimated parameters. Let ,2 denote the
density function of a N (, 2 ) random variable. We have
P (i = 1)1 ,12 (xi )
i := E[i | xi ] = P (i = 1 | xi ) = , (14.3)
P (i = 0)0 ,02 (xi ) + P (i = 1)1 ,12 (xi )

where P (i = 1) = . The EM algorithm proceeds by alternating two key steps: first, we


use (14.3) to assign responsibilities to each data point, i.e. the likelihood that the data
point came from N (1 , 12 ); second, we use the responsibilities to fit the parameters of the
Gaussians. The full algorithm is presented below.

1. Initialize 02 ,
0 , 12 , and
1 , .

2. Repeat until convergence:

(a) Expectation. Calculate the responsibility for each data point, i = 1, . . . , N :



1 ,12 (xi )
i = (14.4)
(1
)0 ,02 (xi ) + 1 ,12 (xi )

(b) Maximization. Estimate the parameters.


PN PN
(1 i )xi i xi
0 = Pi=1N
, 1 = Pi=1
N
, (14.5)
i=1 (1 i ) i=1
i
PN PN
2 (1 i )(xi 0 )2 i (xi
i=1 1 )2

0 = i=1PN
, 12 =
PN , (14.6)
i=1 (1 i ) i=1 i

N
1 X

= i . (14.7)
N i=1
References

Regression & Conditional Expectation: The discusson on the correlation coefficient


was influenced by [5].

Markov Chains: The Bernoulli-Laplace model of diffusion is from [1]. The queueing model
is from [7]. The gamblers ruin analysis is mainly taken from [1]. The proofs of the theorems
are mostly from [1], as well as the random walk on a circle.

Continuous Probability I: The queueing example with exponential service times is from
[7]. The bounds on the tail probability of the Gaussian is from [2].

Moment-Generating Functions: The use of cumulants was taken from [1]. The proba-
bility generating functions are from [7].

Convergence: The section on convergence in distribution was inspired by the treatment in


[1].

Martingales: The example of Polyas urn was taken from [2].

Random Vectors & Linear Algebra: The notation for the non-Bayesian multivariate
regression is from [4]. The discussion on kernel regression is also heavily inspired by [4].
The multivariate change of variables theorem is a simplified treatment of the corresponding
theorem in [1]. The discussion on the curse of dimensionality is from [4].

Statistical Estimators & Hypothesis Testing: The discussion of the 2n and tn distri-
butions were inspired by the exposition in [6]. The example of an inconsistent estimator is
taken from [3]. The distribution of the linear regression coefficients follows the discussion in
[4], with added details.

Randomized Algorithms: The discussion of the EM algorithm is from [4].

147
Bibliography

[1] Patrick Billingsley. Probability & Measure. New York, New York: John Wiley & Sons,
Inc., 1995.
[2] Rick Durrett. Probability: Theory & Examples. New York, New York: Cambridge Uni-
versity Press, 2010.
[3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.
deeplearningbook.org. MIT Press, 2016.
[4] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical
Learning: Data Mining, Inference, & Prediction. Springer, 2008.
[5] Jim Pitman. Probability. New York, New York: Springer-Verlag New York, Inc., 1997.
[6] John A. Rice. Mathematical Statistics & Data Analysis. Belmont, California: Thomson
Higher Education, 2007.
[7] Howard M. Taylor and Samuel Karlin. An Introduction to Stochastic Modeling. San
Diego, California: Academic Press, 1998.

148

You might also like