You are on page 1of 55

ST2334 Probability and Statistics

A/P Ajay Jasra


Oce 06-18, S16
National University of Singapore
E-Mail:staja@nus.edu.sg
Department of Statistics and Applied Probability
National University of Singapore
6 Science Drive 2, Singapore, 117546
Contents
1 Introduction to Probability 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Probability Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Theorem of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Random Variables and their Distributions 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Random Variables and Distribution Functions . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Probability Mass Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.5 Conditional Distributions and Expectations . . . . . . . . . . . . . . . . . . . 15
2.4 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.4 Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.5 Conditional Distributions and Expectations . . . . . . . . . . . . . . . . . . . 24
2.4.6 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Introduction to Statistics 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Examples of Computing the MLE . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Constructing Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
i
4 Miscellaneous Results 43
4.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Exponential Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Integration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
ii
1
Course Information
Lecture times: This course is held from January 2013-May 2013. Lecture times are at 0800-
1000 Wednesday and 1200-1400 Thursday at LT 27. The notes are available on my website:
http://www.stat.nus.edu.sg/~staja/. I do not use IVLE.
Oce Hour: My oce hour is at 1600 on Thursday in 06-18, Department of Statistics and Applied
Probability. I am not available at other times, unless there is a time-table clash.
Assessed Coursework: During the course there will be two assignments which make up 40% of
the nal grade (equal weighting). The dates of when these assessments will be handed out will be
provided at least two weeks beforehand and you are given two weeks to complete the assessment.
The assessments are to be handed to me in person or, at the statistics oce, S16, level 7. Due
to the number of students on this course I do not accept assessments via e-mail (unless there are
extreme circumstances, which would need to be veried by the department before-hand). There is
NO mid-term examination.
Exam:. There is a 2 hour nal exam of 4 questions on 4th May 1-3pm. No calculators are allowed
and the examination is closed-book. You will be given a formula sheet to assist you in the examina-
tion (Tables 4.1 and 4.2).
Problem Sheets: There are also 10 non-assessed problem sheets available on my website. Typed
solutions will be available on my website. In some weeks, we will discuss the solutions of the assess-
ments after the deadline.
Course Details: These notes are not sucient to replace lectures. In particular, many examples
and clarications are given during the class. This course investigates the following concepts: Basic
concepts of probability, conditional probability, independence, random variables, joint and marginal
distributions, mean and variance, some common probability distributions, sampling distributions,
estimation and hypothesis testing based on a normal population. There are three chapters of course
content, the rst covering the foundations of probability. We then move on to random variables
and their distributions which features the main content of the course and is a necessary prerequisite
for further work in statistical modeling and randomized algorithms. The third chapter is a basic
introduction to statistics and gives some basic notions which would be used in statistical modeling.
The nal chapter of the notes provides some mathematical background for the course, which you
are strongly advised to read in the rst week. You are expected to know everything that is in this
chapter. In particular the notion of double summation and integration is used routinely in this course
and you should spend some time recalling these ideas.
References: The recommended reference for this course is [1] (Chapters 1-5). For a non-technical
introduction the book [2] provides an entertaining and intuitive look at probability.
Chapter 1
Introduction to Probability
1.1 Introduction
This Chapter provides a basic introduction to the notions that underlie the basics of probability. The
level of notes is below that required for complete mathematical rigor, but does provide a technical
development of these ideas. Essentially, the basic ideas of probability theory start with a probability
triple of an event-space (e.g. in ipping a coin a head or tail), sets of events (e.g.tail or tail and
head) and a way to compare the likelihood of events by a probability distribution (the chance of
obtaining a tail). Moving on from these notions, we consider the probability of events, given other
events are known to have occurred (conditional probability) (the probability a coin lands tails, given
we know it has two heads). Some events have probabilities which decouple in a special way and
are called independent (for example, ipping a fair coin twice, the outcome of the rst ip, may not
inuence that of the second).
The structure of this Chapter is as follows: In Section 1.2, we discuss the idea of a probability
triple; in Section 1.3 conditional probability is introduced and in Section 1.4 we discuss independence.
1.2 Probability Triples
As mentioned in the introduction, probability begins with a triple:
A sample space (possible outcomes)
A collection of sets on outcomes (eld)
A way to compare the likelihood of events (probability measure)
We will slowly introduce and analyze these concepts.
1.2.1 Sample Space
The basic notion we begin with is that of an experiment, with a collection of possible outcomes to
which we cannot (usually) determine exactly what will happen. For example, if we ip a coin, or if
we watch a football game and so on; in general we do not know for certain what will happen. The
idea of a sample space is as follows.
Denition 1.2.1. The set of all possible outcomes of an experiment is called the sample space
and is denoted by .
Example 1.2.2. Consider rolling a six sided fair die, once. There are 6 possible outcomes, and
thus: = {1, 2, . . . , 6}. We may be interested (for example, for betting purposes) in the following
events:
1. we roll a 1
2. we roll an even number
3. we roll an even number, which is less than 3
2
3
Notation Set Terminology Probability Terminology
Collection of objects Sample Space
Member of Elementary event
A Subset of Event that an outcome in A occurs
A
c
Complement of A Event that no outcome in A occurs
A B Intersection Both A and B
A B Union A or B
A\ B Dierence A but not B
AB Symmetric dierence A or B but not both
A B Inclusion If A then B
Empty set Impossible event
Table 1.1: Terminology of Set and Probability Theory
4. we roll an odd number
One thing that we immediately realize is that each of the events in the above example are subsets
of , that is each event (1)-(4) correspond to:
1. {1}
2. {2,4,6}
3. {2}
4. {1,3,5}
As a result, we think of events as subsets of the sample space . Such events can be constructed
by unions, intersections and complements of other events (we will formally explain why, below). For
example, letting A = {2, 4, 6} in the case (2) above, we immediately yield that the event (4) is A
c
.
Similarly letting A be the event in (2) and B be the event of rolling a number less than 4 we obtain
that event (3) is A B. For those of you that have forgotten basic set notations and terminologies,
see Table
1
1.1. In particular, we will think of as the certain event (we will roll a 1, 2, . . . , 6) and
its complement
c
= as the impossible event (we must roll something).
1.2.2 Fields
Now we have a notion of an event, in particular, that they are subsets of of . A particular question
of interest is then: are all subsets of events? The answer actually turns out to be no, but the
technical reasons for this lie far out of the scope of this course. We will content ourselves to use a
particular collection of sets F (a set of sets) of which contains all the events that we can make
probability statements about. This collection of sets is a eld.
Denition 1.2.3. A collection of sets F of is called a -eld if it satises the following conditions:
1. F
2. If A
1
, . . . , F then

i=1
A
i
F
3. If A F then A
c
F
It can be shown that elds are closed under countable intersections (i.e. if A
1
, . . . , F then

i=1
A
i
F). Whilst it may seem quite abstract, it will turn out that, for this course, all the sets
we consider will lie in the eld F.
Example 1.2.4. Consider ipping a single coin, letting H denote a head and T denoting a tail then:
= {H, T} and F = {, , {H}, {T}}.
Thus, to summarize, so far (, F) is sample space and a eld. The former is the collection
of all possible outcomes and the latter is collection of sets on (the events) that follow Denition
1.2.3. Our next objective is dene a way of assigning a likelihood to each event.
1
For those of you that have forgotten set theory, Section 4.1 has a refresher
4
Exercise 1.2.5. You are given De-Morgans Laws:
_
_
i
A
i
_
c
=

i
A
c
i
_

i
A
i
_
c
=
_
A
c
i
.
Show that if A, B F, A B and A\ B are in F (just use the rules of -elds).
1.2.3 Probability
We now introduce a way to assign a likelihood to an event, via probability. One possible interpretation
is the following: suppose my experiment is repeated many times, then the probability of any event
is the limit of the ratio of times the event occurs over the number of events. We note that this is
not the only interpretation of probability, but we do not diverge into a discussion of the philosophy
of probability. Formally, we introduce the notion of a probability measure:
Denition 1.2.6. A probability measure P on (, F) is a function P : F [0, 1] which satises:
1. P() = 1 and P() = 0
2. For A
1
, A
2
, . . . disjoint (i = j, A
i
A
j
= ) members of F then
P
_

_
i=1
A
i
_
=

i=1
P(A
i
).
The triple (, F, P), comprising a set , a eld F of subsets of and a probability measure P,
is called a probability space (or probability triple).
The idea is to associate the probability space with an experiment:
Example 1.2.7. Consider ipping a coin as in Example 1.2.4, with = {H, T} and F = {, , {H}, {T}}.
Then we can set:
P({H}) = p P({T}) = 1 p p [0, 1].
If p = 1/2 we say that coin is fair.
Example 1.2.8. Consider rolling a 6-sided die, then = {1, . . . , 6} and F = {0, 1}

(the power
set of the set of all subsets of ). Dene a probability measure on (, F) as:
P(A) =

iA
p
i
A F
where i {1, . . . , 6}, 0 p
i
1 and

6
i=1
p
i
= 1. If p
i
= 1/6, i {1, . . . , 6} then:
P(A) =
Card(A)
6
where Card(A) is the cardinality of A (the number of elements in the set A).
We now consider a sequence of results on probability spaces.
Lemma 1.2.9. We have the following properties on probability space (, F, P):
1. For any A F, P(A
c
) = 1 P(A).
2. For any A, B F, if A B then, P(B) = P(A) +P(B \ A) P(A).
3. For any A, B F, P(A B) = P(A) +P(B) P(A B).
4. (inclusion-exclusion formula) For any A
1
, . . . , A
n
F then:
P
_
n
_
i=1
A
i
_
=

i
P(A
i
)

i<j
P(A
i
A
j
) +

i<j<k
P(A
i
A
j
A
k
) + (1)
n+1
P
_
n

i=1
A
i
_
.
where, for example,

i<j
=

n
j=1

j1
i=1
etc.
5
Proof. For (1), A A
c
= , so 1 = P() = P(A A
c
) = P(A) + P(A
c
) (A A
c
= ). Thus, one
concludes by rearranging.
For (2), as B = A (B \ A) and A (B \ A) = , (recall Exercise 1.2.5) P(B) = P(A) + P(B \ A).
The inequality is clear from the denition of a probability.
For (3), A B = A (B \ A), which is a disjoint union, and B \ A = (B \ (A B)) hence
P(A B) = P(A) +P(B \ A) = P(A) +P(B \ (A B)).
Now (A B) B so by (2)
P(B \ (A B)) = P(B) P(A B)
and thus
P(A B) = P(A) +P(B) P(A B).
For (4), this can be proved by induction and is an exercise.
To conclude the Section, we introduce two terms:
An event A F is null if P(A) = 0.
An event A F occurs almost surely if P(A) = 1.
Null events are not impossible. We will see this when considering random variables which take values
in the real line.
1.3 Conditional Probability
We now move onto the notion of conditional probability. It takes the concept of a probability of an
event, given another event is known to have occurred. For example, what is the probability that it
rains today, given that it rained yesterday? We formalize the notion of conditional probability:
Denition 1.3.1. Consider probability space (, F, P) and let A, B F with P(B) > 0. Then the
conditional probability that A occurs given B occurs is dened to be:
P(A|B) :=
P(A B)
P(B)
.
Example 1.3.2. A family has two children of dierent ages. What is the probability that both
children are boys, given that at least one is a boy? The older and younger child may be either boys
or girl so the sample space is:
= {GG, BB, GB, BG}
and we assume that all outcomes are equally likely P(GG) = P(BB) = P(GB) = P(BG) = 1/4 (the
uniform distribution). We are interested in:
P(BB|GB BB BG) =
P(BB (GB BB BG))
P(GB BB BG)
=
P(BB)
P(GB BB BG)
=
1/4
3/4
=
1
3
.
One can also ask the question: what is the probability that in a family of two children, where the
younger of two is a boy, that both are boys? We want:
P(BB|BG BB) =
P(BB (BG BB))
P(BG BB)
=
P(BB)
P(BG BB)
=
1/4
1/2
=
1
2
.
6
1.3.1 Bayes Theorem
An important result in conditional probability is Bayes theorem.
Theorem 1.3.3. Consider probability space (, F, P) and let A, B F with P(A), P(B) > 0. Then
we have:
P(B|A) =
P(A|B)P(B)
P(A)
.
Proof. By denition
P(B|A) =
P(A B)
P(A)
.
As P(B) > 0, P(A B) = P(A|B)P(B) and hence:
P(B|A) =
P(A|B)P(B)
P(A)
.
1.3.2 Theorem of Total Probability
We will use Bayes Theorem after introducing the following result (the Theorem of Total Probability.
We begin with a preliminary denition:
Denition 1.3.4. A family of sets B
1
, . . . , B
n
is called a partition of if:
i = j B
i
B
j
= and
n
_
i=1
B
i
= .
Lemma 1.3.5. Consider probability space (, F, P) and let for any xed n 2, B
1
, . . . , B
n
F be
a partition of , with P(B
i
) > 0, i {1, . . . , n}. Then, for any A F:
P(A) =
n

i=1
P(A|B
i
)P(B
i
).
Proof. We give the proof for n = 2, the other cases being very similar. We have A = (AB
1
)(AB
2
)
(recall B
1
B
2
= , and B
2
= B
c
1
), so
P(A) = P(A B
1
) +P(A B
2
)
= P(A|B
1
)P(B
1
) +P(A|B
2
)P(B
2
).
Example 1.3.6. Consider two factories which manufacture a product. If the product comes from
factory I, it is defective with probability 1/5 and if it is from factory II, it is defective with probability
1/20. It is twice as likely that a product comes from factory I. What is the probability that a given
product operates properly (i.e. is not defective)? Let A denote the event of operates properly and B
denote the event the product is made in factory I. Then:
P(A) = P(A|B)P(B) +P(A|B
c
)P(B
c
)
=
4
5
2
3
+
19
20
1
3
=
51
60
.
If a product is defective, what is the probability that it came from factory I? This is P(B|A
c
); by
Bayes Theorem
P(B|A
c
) =
P(A
c
|B)P(B)
P(A
c
)
=
(1/5)(2/3)
9/60
=
8
9
.
7
1.4 Independence
The idea of independence is roughly that the probability of an event A is unaected by the occur-
rance of a (non-null) event B; that is, P(A) = P(A|B), when P(B) > 0. More formally:
Denition 1.4.1. Consider probability space (, F, P) and let A, B F. A and B are indepen-
dent if
P(A B) = P(A)P(B).
More generally, a family of Fsets A
1
, . . . , A
n
( > n 2) are independent if
P
_
n

i=1
A
i
_
=
n

i=1
P(A
i
).
Two important concepts include pairwise independence:
P(A
i
A
j
) = P(A
i
)P(A
j
) i = j.
This does not necessarily mean that A
1
, . . . , A
n
are independent events. Another important concept
is conditional independence; given an event C F, with P(C) > 0 , A, B F are conditionally
independent events if
P(A B|C) = P(A|C)P(B|C).
This can be extended to families of sets A
i
given C.
Example 1.4.2. Consider a pack of playing cards, and one chooses a card completely at random
(i.e. no card counting etc). One can assume that the probability of choosing a suit (e.g. spade) is
independent of its rank. So, for example:
P(spade king) = P(spade)P(king) =
13
52
4
52
=
1
52
.
Chapter 2
Random Variables and their
Distributions
2.1 Introduction
Throughout the Chapter we assume that there is a probability space (, F, P), but its presence is
minimized from the discussion. This is the main Chapter of the course and we focus upon random
variables and their distribution (Section 2.2). In particular, a random variable is (essentially) just a
map from to some subset of the real line. Once we are there we introduce notions of probability
through distribution functions. Sometimes the random variables take values on a countable space
(possibly countably innite) and we call such random variables discrete (Section 2.3). The proba-
bilities of such random variables are linked to probability mass functions and we use this concept
to revisit independence and conditional probability. We also consider expectation (the theoretical
average) and conditional expectation for discrete random variables. These ideas are revisited when
the random variables are continuous (Section 2.4). The Chapter is concluded when we discuss the
convergence of random variables (Section 2.5) and, in particular, the famous central limit theorem.
2.2 Random Variables and Distribution Functions
2.2.1 Random Variables
In general, we are often concerned by an experiment itself, but by an associated consequence of the
outcome. For example, a gambler is often interested in his or hers prot or loss, rather that in the
result that occurs. To deal with this issue, we introduce the notion of a random variable:
Denition 2.2.1. A random variable is a function X : R such that for each x R,
{ : X() x} F. Such a function is said to be Fmeasurable.
For the purpose of this course, the technical constraint of Fmeasurability can be ignored; you
can content yourself to think of X() as a mapping from the sample space to the real line and
continue onwards. In general we omit the argument and just write X, with possible realized values
(numbers) written in lower-case x. The distinction between a random variable X and its realized
value is very important and you should maintain this convention. To move from the random variable,
back to our probability measure P, we write events { : X() x} as {X x}, and hence we
write P({X x}) as P(X x) to mean P({ : X() x}) (recall { : X() x} F
and so this makes sense). This leads us to the notion of a distribution function:
2.2.2 Distribution Functions
Denition 2.2.2. The distribution function of a random variable X is the function F : R [0, 1]
given by F(x) = P(X x), x R.
Example 2.2.3. Consider ipping a fair coin twice; then = {HH, TT, HT, TH}, with F =
{0, 1}

and P(HH) = P(TT) = P(HT) = P(TH) = 1/4. Dene a random variable X as the number
of heads; so
X(HH) = 2 X(HT) = X(TH) = 1 X(TT) = 0.
8
9
The associated distribution function is:
F(x) =
_

_
0 if x < 0
1
4
if 0 x < 1
3
4
if 1 x < 2
1 if x 2
A distribution function has the following properties, which we state without proof. See [1] for
the proof. The lemma characterizes a distribution function: a function F is a distribution function
for some random variable if and only if it satises the three properties in the following lemma.
Lemma 2.2.4. A distribution function F has the following properties:
1. lim
x
F(x) = 0, lim
x
F(x) = 1,
2. If x < y then F(x) F(y),
3. F is right continuous; lim
0
F(x + ) = F(x) for each x R.
Example 2.2.5. Consider ipping a coin as in Example 1.2.4, which lands heads with probability
p (0, 1). Let X : R be given by
X(H) = 1 X(T) = 0.
The associated distribution function is
F(x) =
_
_
_
0 if x < 0
1 p if 0 x < 1
1 if x 1
X is said to have a Bernoulli distribution.
We end the Section with another lemma, whose proof can be found in [1].
Lemma 2.2.6. A distribution function F of random variable X, has the following properties:
1. P(X > x) = 1 F(x) for any xed x R,
2. P(x < X y) = F(y) F(x) for any xed x, y R with x < y,
3. P(X = x) = F(x) lim
yx
F(y) for any xed x R.
2.3 Discrete Random Variables
2.3.1 Probability Mass Functions
We now move onto an important class of random variable called discrete random variables.
Denition 2.3.1. A random variable is said to be discrete if it takes values in some countable subset
X = {x
1
, x
2
, . . . } of R.
A discrete random variable takes values only at countable values and hence its distribution is a
jump function, in that it shifts between dierent values when they change. An important concept is
the probability mass function (PMF):
Denition 2.3.2. The probability mass function of a discrete random variable X, is the function
f : X [0, 1] dened by f(x) = P(X = x).
Remark 2.3.3. We generally use sans-serif notation X, Z to denote the supports of a
random variable. That is the range of values for which its PMF (or PDF - as will
be dened much later on) is (potentially) non-zero. Note however, the PMF may be
dened on say Z or R but is taken as (potentially) non-zero only at those points in X -
it is always zero outside X. We call this the support of a random variable.
10
We remark that
F(x) =

i:xix
f(x) f(x) = F(x) lim
yx
F(y)
which provides and association between the distribution function and the PMF. We have the following
result, whose proof follows easily from the above denitions and results.
Lemma 2.3.4. A PMF satises:
1. the set of x such that f(x) = 0 is countable,
2.

xX
f(x) = 1.
Example 2.3.5. A coin is ipped n times and the probability one obtains a head is p (0, 1);
= {H, T}
n
. Let X denote the number of heads, which takes values in the set X = {0, 1, . . . , n} and
is thus a discrete random variable. Consider x X, exactly
_
n
x
_
points in give us x heads and each
point occurs with probability p
x
(1 p)
nx
; hence
f(x) =
_
n
x
_
p
x
(1 p)
nx
x X.
The random variable X is said to have a Binomial distribution and this is denoted X B(n, p).
Note that
_
n
x
_
=
n!
(n x)!x!
with x! = x (x 1) 1 and 0! = 1.
Example 2.3.6. Let > 0 be given. A random variable X that takes values on X = {0, 1, 2, . . . , }
is said have a Poisson distribution with parameter (denoted X P()) if its PMF is:
f(x) =

x
e

x!
x X.
2.3.2 Independence
We now extend the notion of independence of events to the domain of random variables. Recall that
events A and B are independent if the joint probability is equal to the product (A does not aect
B).
Denition 2.3.7. Discrete random variables X and Y are independent if the events {X = x} and
{Y = y} are independent for each (x, y) X Y.
To understand this idea, let X = {x
1
, x
2
, . . . }, Y = {y
1
, y
2
, . . . }, then X and Y are independent
if and only if the events A
i
= {X = x
i
}, B
j
= {Y = y
j
} are independent for every possible pair i, j.
Example 2.3.8. Consider ipping a coin once, which lands tail with probability p (0, 1). Let X
be the number of heads seen and Y the number of tails. Then:
P(X = Y = 1) = 0
and
P(X = 1)P(Y = 1) = (1 p)p = 0
so X and Y cannot be independent.
A useful result (which we do not prove) that is worth noting is the following.
Theorem 2.3.9. If X and Y are independent random variables and g : X R, h : Y R, then
the random variables g(X) and h(Y ) are also independent.
In full generality
1
, consider a sequence of discrete random variables X
1
, X
2
, . . . , X
n
, X
i
X
i
;
they are said to be independent if the events {X
1
= x
1
}, . . . , {X
n
= x
n
} are independent for every
possible (x
1
, . . . , x
n
) X
1
X
n
. That is:
P(X
1
= x
1
, . . . , X
n
= x
n
) =
n

i=1
P(X
i
= x
i
) (x
1
, . . . , x
n
) X
1
X
n
.
1
this point will not make too much sense, until we see Section 2.3.4
11
2.3.3 Expectation
Throughout your statistical training, you may have encountered the notion of an average or mean
value of data. In this Section we consider the idea of the average value of a random variable, which
is called the expected value.
Denition 2.3.10. The expected value of a discrete random variable X on X, with PMF f, denoted
E[X] is dened as
E[X] :=

xX
xf(x)
whenever the sum on the R.H.S. is absolutely convergent.
Example 2.3.11. Recall the Poisson random variable from Example 2.3.6, X P(). The expected
value is:
E[X] =

x=0
x

x
e

x!
= e

x=1

x1
(x 1)!
= e

=
where we have used the Taylor series expansion for the exponential function on the third line.
We now look at the idea of an expectation of a function of a random variable:
Lemma 2.3.12. Given a discrete random variable X on X, with PMF f and g : X R:
E[g(X)] =

xX
g(x)f(x)
whenever the sum on the R.H.S. is absolutely convergent.
Example 2.3.13. Returning to Example 3.2.1, we have
E[X
2
] =

x=0
x
2

x
e

x!
= e

x=1
x

(x1)
(x 1)!
= e

x=1
d
d
(
x
)
1
(x 1)!
= e

d
d
_

x=1
(
x1
)
1
(x 1)!
__
= e

d
d
(e

)
=
2
+
where we have assumed that it is legitimate to swap dierentiation and summation (it turns out that
this is true here).
An important concept is the moment generating function:.
Denition 2.3.14. For a discrete random variable X the moment generating function (MGF)
is
M(t) = E[e
Xt
] =

xX
e
xt
f(x) t T
where T is the set of t for which

X
e
xt
f(x) < .
12
Exercise 2.3.15. Show that
E[X] = M

(0) E[X
2
] = M
(2)
(0)
when the right-hand derivatives exist.
The moment generating function is thus a simple way to obtain moments, if it is simple to dier-
entiate M(t). Note also, that it can be proven that the MGF uniquely characterizes a distribution.
Example 2.3.16. Let X P() then:
E[e
Xt
] =

x=0
e
xt

x
x!
e

x=0
(e
t
)
x
x!
e

= exp{(e
t
1)}.
Then
M

(0) = .
An important special case of functions of random variables are:
Denition 2.3.17. Given a discrete random variable X on X, with PMF f and k Z
+
, the k
th
moment of X is
E[X
k
]
and the k
th
central moment of X is
E[(X E[X])
k
].
Of particular importance, is the idea of the variance:
Var[X] := E[(X E[X])
2
].
Now, we have the following calculations:
E[(X E[X])
2
] =

xX
(x E[X])
2
f(x)
=

xX
(x
2
2xE[X] +E[X]
2
)f(x)
=

xX
x
2
f(x) 2E[X]

xX
xf(x) +E[X]
2

xX
f(x)
= E[X
2
] 2E[X]
2
+E[X]
2
= E[X
2
] E[X]
2
.
Hence we have shown that
Var[X] = E[X
2
] E[X]
2
. (2.3.1)
Note a very important point: for any absolutely convergent sum

a
n
, if a
n
0 for each n, then

a
n
0. Now as Var[X] := E[(X E[X])
2
] =

xX
(x E[X])
2
f(x) and clearly
(x E[X])
2
f(x) 0 x X
it follows that
Variances cannot be negative
2
.
Example 2.3.18. Returning to Examples 3.2.1 and 2.3.13, when X P(), we have shown that
E[X
2
] =
2
+ E[X] = .
Hence on using (2.3.1):
Var[X] = .
That is, for a Poisson random variable, its mean and variance are equal.
2
This is very important and you will not be told again
13
Exercise 2.3.19. Compute the mean and variance for a Binomial distribution B(n, p) random vari-
able. [Hint: consider the dierentiation, w.r.t. x, of the equality
n

k=0
_
n
k
_
x
k
= (1 + x)
n
.
].
We now state a Theorem, whose proof we do not give. In general, it is simple to establish, once
the concept of a joint distribution has been introduced; we have not done this, so we simply state
the result.
Theorem 2.3.20. The expectation operator has the following properties:
1. if X 0, E[X] 0
2. if a, b R then E[aX + bY ] = aE[X] + bE[Y ]
3. if X = c R always, then E[X] = c.
An important result that we will use later on and whose proof is omitted is as follows.
Lemma 2.3.21. If X and Y are independent then E[XY ] = E[X]E[Y ].
A notion of dependence (linear dependence) is correlation. This will be discussed in details, later
on.
Denition 2.3.22. X and Y are uncorrelated if E[XY ] = E[X]E[Y ].
It is important to remark that random variables that are independent are uncorrelated. However,
uncorrelated variables are not necessarily independent; we will explore this idea later on.
We end the section with some useful properties of the variance operator.
Theorem 2.3.23. For random variables X and Y
1. For a R, Var[aX] = a
2
Var[X],
2. For X, Y uncorrelated Var[X + Y ] = Var[X] +Var[Y ].
Proof. For 1. we have:
Var[aX] = E[(aX)
2
) E[aX]
2
= a
2
E[X
2
] a
2
E[X]
2
= a
2
Var[X]
where we have used Theorem 2.3.20 2. in the second line.
Now for 2. we have:
Var[(X + Y )] = E[(X + Y E[X + Y ])
2
)
= E[X
2
+ Y
2
+ 2XY +E[X + Y ]
2
2(X + Y )E[X + Y ]]
= E[X
2
+ Y
2
+ 2XY +E[X]
2
+E[Y ]
2
+ 2E[XY ]] 2E[X + Y ]
2
= E[X
2
] +E[Y
2
] + 4E[X]E[Y ] +E[X]
2
+E[Y ]
2
2(E[X]
2
+E[Y ]
2
+ 2E[X]E[Y ])
= E[X
2
] +E[Y
2
] E[X]
2
E[Y ]
2
= Var[X] +Var[Y ]
where we have repeatedly used Theorem 2.3.20 2..
2.3.4 Dependence
As we saw in the previous Section, there as a need to dene distributions on more than one random
variable. We will do that in this Section. We start with the following denition (of which one can
easily generalize to a collection of n 1 discrete random variables):
Denition 2.3.24. The joint distribution function F : R
2
[0, 1] of X, Y where X and Y are
discrete random variables, is given by
F(x, y) = P(X x Y y).
Their joint mass function f : R
2
[0, 1] is given by
f(x, y) = P(X = x Y = y).
14
y = 1 y = 0 y = 2 f(x)
x = 1
1
18
3
18
2
18
6
18
x = 2
2
18
0
3
18
5
18
x = 3 0
4
18
3
18
7
18
f(y)
3
18
7
18
8
18
Table 2.1: The Joint Mass Function in Example 2.3.26. The row totals are the marginal mass
function on X (and sum to 1) and respectively the column totals are the marginal mass function on
Y .
In general, it may be that the random variables are dened on a space Z which may not be
decomposable into a cartesian product XY. In such scenarios we write the joint support as simply
Z and omit X and Y, in this scenario, we will use the notation

x
or

y
to denote sums over the
supports induced by Z. This concept will be claried during a reading of the subsequent text.
Of particular importance, are the marginal PMFs of X and Y :
f(x) =

y
f(x, y) f(y) =

x
f(x, y).
Note that, clearly

x
f(x) = 1 =

y
f(y).
Example 2.3.25. Consider Theorem 2.3.20 2. and for simplicity suppose Z = XY. Now, we have
E[aX + bY ] =

xX

yY
(ax + by)f(x, y)
=

xX
ax

yY
f(x, y) +

yY
by

xX
f(x, y)
= a

xX
xf(x) + b

yY
yf(y)
= aE[X] + bE[Y ].
Example 2.3.26. Suppose X = {1, 2, 3} and Y = {1, 0, 2} then an example of a joint PMF can be
found in Table 2.3.26. From the table, we have:
E[XY ] =

xX

yY
xyf(x, y) =
29
18
(just sum the 9 values in the table, multiplying each time by the product of the associated x and y.
Similarly
E[X] =
6
18
+ 2
5
18
+ 3
7
18
=
37
18
and
E[Y ] = 1
3
18
+ 0
7
18
+ 2
8
18
=
13
18
.
We now formalize independence in a result, which provides us a more direct way to ascertain if
two random variables X and Y are independent.
Lemma 2.3.27. The discrete random variables X and Y are independent if an only if
f(x, y) = f(x)f(y) x, y R.
More generally, X and Y are independent if and only if f(x, y) factorizes into the product g(x)h(y)
with g a function only of x and h a function only of y.
Example 2.3.28. Consider the joint PMF:
f(x, y) =

x
e

x!

y
e

y!
X = Y = {0, 1, . . . , }.
15
Now clearly
f(x) =

x
e

x!
x X f(y) =

y
e

y!
y Y.
It is also clear that via Lemma 2.3.27 the random variables X and Y are independent and identically
distributed (and Poisson distributed).
As in the case of a single variable, we are interested in the expectation of a functional of two
random variables (strictly, in the proof of Theorem 2.3.23 we have already used the following result):
Lemma 2.3.29. E[g(X, Y )] =

(x,y)XY
g(x, y)f(x, y).
Recall that in the previous Section, we mentioned a notion of dependence called correlation. In
order to formally dene this concept, we introduce rst the covariance and then correlation.
Denition 2.3.30. The covariance between X and Y is
Cov[X, Y ] := E[(X E[X])(Y E[Y ])].
The correlation between X and Y is
(X, Y ) =
Cov[X, Y ]
_
Var[X]Var[Y ]
.
Exercise 2.3.31. Show that
E[(X E[X])(Y E[Y ])] = E[XY ] E[X]E[Y ].
Thus, independent random variables have zero correlation.
Example 2.3.32. Returning to Example 2.3.26, we have:
Var[X] = 233/324 Var[Y ] = 461/324
and
Cov[X, Y ] = 29/18 (37/18)(13/18) = 41/324.
Thus
(X, Y ) = 41/

107413.
Some useful facts about correlations that we do not prove:
|(X, Y )| 1. Random variables with correlation 1 or -1 are said to be perfectly positively or
negatively correlated.
The correlation coecient is 1 i Y increases linearly with X and -1 i Y decreases linearly as
X increases.
When we consider continuous random variables an example of uncorrelated random variables that
are not independent, will be given.
2.3.5 Conditional Distributions and Expectations
In Section 1.3 we discussed the idea of conditional probabilities associated to events. This idea can
then be extended to discrete-valued random variables:
Denition 2.3.33. The conditional distribution function of Y given X, written F
Y |x
(|x), is dened
by
F
y|x
(y|x) = P(Y y|X = x)
for any x with P(X = x) > 0. The conditional PMF of Y given X = x is dened by
f(y|x) = P(Y = y|X = x)
when x is such that P(X = x) > 0.
16
We remark that, in particular, the conditional mass function of Y given X = x is:
f(y|x) =
f(x, y)
f(x)
=
f(x, y)

y
f(x, y)
.
Example 2.3.34. Suppose that Y P() and X|Y = y B(y, p). Find the conditional probability
mass function of Y |X = x. Note that the random variables lie in the space Z = {(x, y) : y
{0, 1, . . . }, x {0, 1, . . . , y}}. We have
f(y|x) =
f(x, y)
f(x)
=
f(x|y)f(y)
f(x)
(x, y) Z
which is a version of Bayes Theorem for discrete random variables. Note that for (x, y) Z all the
PMFs above are positive. Now for x {0, 1, . . . }
f(x) =

y=x
_
y
x
_
p
x
(1 p)
yx

y
e

y!
= (p)
x
e

x!

y=x
((1 p))
yx
(y x)!
=
(p)
x
x!
e
p
.
Thus, we have for y {x, x + 1, . . . }:
f(y|x) =
_
y
x
_
p
x
(1 p)
yx
y
e

y!
(p)
x
x!
e
p
which after some algebra, becomes:
f(y|x) =
((1 p))
yx
e
(1p)
(y x)!
y {x, x + 1, . . . }.
Given the idea of a conditional distribution, we move onto the idea of a conditional expectation:
Denition 2.3.35. The conditional expectation of a random variable Y , given X = x is
E[Y |X = x] =

y
yf(y|x)
given that the conditional PMF is well-dened. We generally write E[Y |X] or E[Y |x].
An important result associated to conditional expectations is as follows:
Theorem 2.3.36. The conditional expectation satises:
E[E[Y |X]] = E[Y ]
assuming the expectations all exist.
Proof. We have
E[Y ] =

y
yf(y)
=

(x,y)Z
yf(x, y)
=

(x,y)Z
yf(y|x)f(x)
=

x
[

y
yf(y|x)]f(x)
= E[E[Y |X]].
17
Example 2.3.37. Let us return to Example 2.3.34. Find E[X|Y ], E[Y |X] and E[X]. From Exercise
2.3.19, you should have derived that if Z B(n, p) then E[Z] = np. Then
E[X|Y = y] = yp.
Thus
E[X] = E[E[X|Y = y]] = pE[Y ].
As Y P()
E[X] = p.
From Example 2.3.34,
f(y|x) =
((1 p))
yx
e
(1p)
(y x)!
y {x, x + 1, . . . }.
Thus
E[Y |X = x] =

y=x
y
((1 p))
yx
e
(1p)
(y x)!
.
Setting u = y x in the summation, one yields
E[Y |X = x] =

u=0
(u + x)
((1 p))
u
e
(1p)
u!
=

u=0
u
((1 p))
u
e
(1p)
u!
+ x
= (1 p) + x.
Here, we have used the fact that U P((1 p)).
We end the Section with a result which can be very useful in practice.
Theorem 2.3.38. We have, for any g : R R
E[E[Y |X]g(X)] = E[Y g(X)]
assuming the expectations all exist.
Proof. We have
E[Y g(X)] =

(x,y)Z
yg(x)f(x, y)
=

(x,y)Z
yg(x)f(y|x)f(x)
=

x
g(x)[

y
yf(y|x)]f(x)
= E[E[Y |X]g(X)].
2.4 Continuous Random Variables
In the previous Section, we considered random variables which produce real numbers within a count-
ably nite or innite set. However, there are of course a wide variety of applications where one sees
numerical outcomes of an experiment that can lie potentially anywhere on the real-line. Thus we
extend each of concepts that we saw for discrete random variables to continuous ones. In a rather
informal (and incorrect) manner, one can simply think of the idea of replacing summation with
integration; of course things will be more challenging than this, but one should keep this idea in the
back of your mind.
18
2.4.1 Probability Density Functions
A random variable is said to be continuous if its distribution function F(x) = P(X x), can be
written as
F(x) =
_
x

f(u)du
where f : R [0, ), the R.H.S. is the usual Riemann integral and we will assume that the R.H.S. is
dierentiable.
Denition 2.4.1. The function f is called the probability density function (PDF) of the con-
tinuous random variable X.
We note that, under our assumptions, f(x) = F

(x).
Example 2.4.2. One of the most commonly used PDFs is the Gaussian or normal distribution:
f(x) =
1

2
exp{
1
2
2
(x )
2
} x X = R
where R,
2
> 0. We use the notation X N(,
2
).
One of the key points associated to PDFs is as follows. The numerical value f(x) does not
represent the probability that X takes the value x. The technical explanation goes far beyond the
mathematical level of this course, but perhaps an intuitive reason is simply that there are uncountably
innite points in X (so assigning a probability to each point is seemingly impossible). In general, one
assigns probability to sets of non-zero width. For example let A = [a, b], < a < b < , then
one might expect:
P(X A) =
_
A
f(x)dx.
Indeed this holds true, but we are deliberately vague about this. We give the following result, which
is not proved and should be taken as true.
Lemma 2.4.3. If X has a PDF f, then
1.
_
X
f(x)dx = 1.
2. P(X = x) = 0 for each x X.
3. P(X [a, b]) =
_
b
a
f(x)dx, a < b .
Example 2.4.4. Returning to Example 2.4.2, we have
P(X X) =
_

2
exp{
1
2
2
(x )
2
}dx
=
1

2
_

e
u
2
/2
du = 1.
Here, we have used the substitution u = (x )/ to go to the second line and the fact that
_

e
u
2
/2
du =

2.
2.4.2 Independence
To dene an idea of independence for continuous random variables, we cannot use the one for discrete
random variables (recall Denition 2.3.7); the sets {X = x} and {Y = y} have zero probability and
are hence trivially independent. Thus we use a new denition for independence:
Denition 2.4.5. The random variables X and Y are independent if
{X x} {Y y}
are independent events for each x, y R.
As for PMFs one can consider independence through PDFs; however, as we are yet to dene the
notion of joint PDFs, we will leave this idea for now. We note that one can show that for any real
valued functions h and g (at least within some technical conditions which are assumed in this course)
that if X and Y are independent, so are the random variables h(X) and g(Y ). We do not prove why.
19
2.4.3 Expectation
As for discrete random variables one can consider the idea of the average of a random variable. This
is simply brought about by replacing summations with integrations:
Denition 2.4.6. The expectation of a continuous random variable X with PDF f is given by
E[X] =
_
X
xf(x)dx
whenever the integral exists.
Example 2.4.7. Consider the exponential density:
f(x) = e
x
x X = [0, ) > 0.
We use the notation X E(). Then
E[X] =
_

0
xe
x
dx = [xe
x
]

0
+
_

0
e
x
dx = [
1

e
x
]

0
=
1

where we have used integration by parts.


Example 2.4.8. Let us return to Example 2.4.2. Then
E[X] =
_

x
1

2
exp{
1
2
2
(x )
2
}dx
=
_

(u + )
1

2
e
u
2
/2
du
=
_

u
1

2
e
u
2
/2
du +
=

2
[e
u
2
/2
]

+
= .
Here, we have used the substitution u = (x )/ to go to the second line and the fact that
_

2
e
u
2
/2
du = 1
to go to the third.
Example 2.4.9. Consider the gamma density:
f(x) =

()
x
1
e
x
x X = [0, ) , > 0
where () =
_

0
t
1
e
t
dt. We use the notation X G(, ) and note that if X G(1, ) then
X E(). Now
E[X] =

()
_

0
x

e
x
dx
=

()
1

+1
_

0
u

e
u
du
=
1
()
( + 1) =

.
Where we have used the substitution u = x to go the second line and (+1) = () on the nal
line (from herein you may use that identity without proof ).
We now state a useful technical result
Theorem 2.4.10. If X and g(x) (g : R R) are continuous random variables
E[g(X)] =
_
X
g(x)f(x)dx.
20
We remark that all the extensions to expectations discussed in Section 2.3.3 can be extended to
the continuous case. In particular Denitions 2.3.17 and 2.3.22 can be imported to the continuous
case. In addition the results: Theorems 2.3.20 and 2.3.23 and Lemma 2.3.21 all can be extended. It
is assumed that this is the case from herein.
Example 2.4.11. Let us return to Example 2.4.7
E[X
2
] =
_

0
x
2
e
x
dx
= [x
2
e
x
]

0
+
2

_

0
xe
x
dx
=
2

2
.
Thus, for an exponential random variable:
Var[X] =
1

2
.
Example 2.4.12. Let us return to Example 2.4.2. Then
E[X
2
] =
_

x
2
1

2
exp{
1
2
2
(x )
2
}dx
=
_

(u + )
2
1

2
e
u
2
/2
du
=
2
_

u
2
1

2
e
u
2
/2
du +
2
+ 2
_

u
1

2
e
u
2
/2
du
=

2

2
{[ue
u
2
/2
]

+
_

e
u
2
/2
du} +
2
=

2

2 +
2
.
Here we have used
_

u
1

2
e
u
2
/2
du = 0 and
_

e
u
2
/2
du =

2. Thus, for a normal random


variable:
Var[X] =
2
.
Example 2.4.13. Let us return to Example 2.4.9 Now
E[X
2
] =

()
_

0
x
+1
e
x
dx
=

()
1

+2
_

0
u
+1
e
u
du
=
1

2
()
( + 2) =
( + 1)

2
where we have used ( + 2) = ( + 1)(). Thus, for a gamma random variable:
Var[X] =

2
.
To end this section we consider an important concept; the MGF for continuous random variables.
Denition 2.4.14. For a continuous random variable X the moment generating function
(MGF) is
M(t) = E[e
Xt
] =
_
X
e
xt
f(x)dx t T
where T is the set of t for which
_
X
e
xt
f(x)dx < .
Exercise 2.4.15. Show that
E[X] = M

(0) E[X
2
] = M
(2)
(0)
when the right-hand derivatives exist.
21
As for discrete random variables, the moment generating function is a simple way to obtain
moments, if it is simple to dierentiate M(t). Note also, that it can be proven that the MGF
uniquely characterizes a distribution.
Example 2.4.16. Suppose X E() then, supposing > t
M(t) =
_

0
e
x(t)
dx
=
_


( t)
e
x(t)
_

0
=

( t)
.
Clearly
M

(t) =

( t)
2
and thus E[X] = 1/.
Example 2.4.17. Suppose X N(,
2
) then
M(t) =
1

2
_

exp{
1
2
2
(x )
2
+ xt}dx
=
1

2
_

exp{
1
2
2
[(x ( + t
2
))
2
( + t
2
)
2
+
2
]}dx
=
1

2
exp{t +

2
t
2
2
}
_

exp{
1
2
2
(x ( + t
2
))
2
}dx
= exp{t +

2
t
2
2
}
where we have used a change-of-variables u = (x ( + t
2
))/ to deal with the integral.
2.4.4 Dependence
Just as for discrete random variables, one can consider the idea of joint distributions for continuous
random variables.
Denition 2.4.18. The joint distribution function of X and Y is the function F : R
2
[0, 1]
given by
F(x, y) = P(X x, Y y).
Again, as for discrete random variables, one requires a PDF:
Denition 2.4.19. The random variables are jointly continuous with joint PDF f : R
2
[0, ) if
F(x, y) =
_
y

_
x

f(u, v)dudv
for each (x, y) R
2
.
In this course, we will assume (generally) that
f(x, y) =

2
xy
F(x, y).
Note that, P(X = x, Y = y) = 0 and for suciently well dened sets A Z R
2
P((X, Y ) A) =
_
A
f(x, y)dxdy
note that the R.H.S. is a double integral, but we will only write one integral sign in such contexts.
As for discrete random variables, we can introduce the idea of marginal PDFs. Here, we take a
little longer to consider these ideas:
22
Denition 2.4.20. The marginal distribution functions of X and Y are
F(x) = lim
y
F(x, y) F(y) = lim
x
F(x, y).
As
F(x) =
_
x

f(u, y)dydu F(y) =


_
y

f(x, u)dxdu
the marginal density functions of X and Y
f(x) =
_

f(x, y)dy f(y) =


_

f(x, y)dx.
We remark that expectation is much the same for joint distributions (as for Section 2.4.3) of
continuous random variables:
E[g(X, Y )] =
_
Z
g(x, y)f(x, y)dxdy =
_

g(x, y)f(x, y)dxdy.


So as before if g(x, y) = ah(x) + bg(y)
E[g(X, Y )] = aE[h(X)] + bE[g(Y )].
We now turn to independence; we state the following result with no proof. If you are unconvinced,
simply take the below as a denition.
Theorem 2.4.21. The random variables X and Y are independent if and only if
F(x, y) = F(x)F(y)
or equivalently
f(x, y) = f(x)f(y).
Example 2.4.22. Let Z = (R
+
)
2
= X Y and
f(x, y) =
2
e
(x+y)
(x, y) Z, > 0.
Then one has
f(x) = e
x
x X f(y) = e
y
y Y
In addition, what is the probability that X > Y ?
P(X > Y ) =
_

0
_
x
0
f(x, y)dydx
=
_

0
_
x
0
e
y
dye
x
dx
=
_

0
_
e
y
_
x
0
e
x
dx
=
_

0
(1 e
x
)e
x
dx
=
_
e
x
+
1
2
e
2x
_

0
= 1
1
2
=
1
2
.
Again, one has the idea of covariance and correlation for continuous valued random variables:
Cov[X, Y ] = E[(X E[X])(Y E[Y ])] = E[XY ] E[X]E[Y ]
=
_
Z
xyf(x, y)dxdy
_
xf(x)dx
_
yf(y)dy
(X, Y ) =
Cov[X, Y ]
_
Var[X]Var[Y ]
23
Example 2.4.23. Let Z = R
2
and dene
f(x, y) =
1
2
_
1
2
exp{
1
2(1
2
)
(x
2
+ y
2
2xy)} (x, y) Z [1, 1].
Let us check if this is a PDF; clearly it is non-negative on Z. So
_
Z
f(x, y)dxdy =
_

1
2
_
1
2
exp{
1
2(1
2
)
(x
2
+ y
2
2xy)}dxdy
=
1
2
_
1
2
_

exp{
1
2(1
2
)
(y
2
y
2

2
)}
_

exp{
1
2(1
2
)
(x y)
2
}dxdy
=
1

2
_

exp{
1
2(1
2
)
(y
2
y
2

2
)}dy
=
1

2
_

exp{
1
2(1
2
)
(1
2
)y
2
}dy = 1.
Here we have completed the square in the integral in x and used the change of variable u = (x
y)/(1
2
)
1/2
. Thus f(x, y) denes a joint PDF. Moreover, it is called the standard bi-variate
normal distribution. Now let us consider the marginal PDFs:
f(x) =
_

1
2
_
1
2
exp{
1
2(1
2
)
(x
2
+ y
2
2xy)}dy x R
=
1
2
_
1
2
exp{
1
2(1
2
)
(x
2
x
2

2
)}
_

exp{
1
2(1
2
)
(y x)
2
}dy
=
1

2
e
x
2
/2
.
That is X N(0, 1). By Similar arguments, Y N(0, 1). Now, let us consider:
E[XY ] =
_

xy
1
2
_
1
2
exp{
1
2(1
2
)
(x
2
+ y
2
2xy)}dxdy
=
1
2
_
1
2
_

y exp{
1
2(1
2
)
(y
2
y
2

2
)}
_

xexp{
1
2(1
2
)
(x y)
2
}dxdy
=
1
2
_
1
2
_

y exp{
1
2(1
2
)
(y
2
y
2

2
)}
_

[u(1
2
)
1/2
+ y](1
2
)
1/2
e
u
2
/2
dudy
=
1
2
_
1
2
_

y exp{
1
2(1
2
)
(y
2
y
2

2
)}[(1
2
)
1/2
{(1
2
)
1/2

2E[X] + y

2}]dy
=
1
2
_
1
2
_

y exp{
1
2(1
2
)
(y
2
y
2

2
)}(1
2
)
1/2
y

2dy
=

2
_

y
2
e
y
2
/2
dy = .
Here we have used the results in Examples 2.4.8 and 2.4.12 that the rst and second (raw) moment
of N(,
2
) are and
2
+
2
. Therefore
Cov[X, Y ] =
(X, Y ) =

1
= .
Note also that if = 0 then:
f(x, y) =
1
2
_
1
2
exp{
1
2
(x
2
+ y
2
)} (x, y) Z.
so as
f(x) =
1

2
e
x
2
/2
x R f(y) =
1

2
e
y
2
/2
y R
f(x, y) = f(x)f(y).
That is standard bi-variate normal random variables are independent if and only they are uncorre-
lated. Note that this does not always occur for other random variables.
24
We conclude this section with a mention of joint moment generating functions:
Denition 2.4.24. For a continuous random variables X and Y the joint moment generating
function is
M(t
1
, t
2
) = E[e
Xt1+Y t2
] =
_
Z
e
xt1+yt2
f(x, y)dxdy (t
1
, t
2
) T
where T is the set of t
1
, t
2
for which
_
Z
e
xt1+yt2
f(x, y)dxdy < .
Exercise 2.4.25. Show that
E[X] =

t
1
M(t
1
, t
2
)

t1=t2=0
, E[Y ] =

t
2
M(t
1
, t
2
)

t1=t2=0
, E[XY ] =

2
t
1
t
2
M(t
1
, t
2
)

t1=t2=0
when the derivatives exist.
2.4.5 Conditional Distributions and Expectations
The idea of conditional distributions and expectations can be further extended to continuous random
variables. However, we recall that the event {X = x} has zero probability, so how can we condition
on it? It turns out that we cannot have a rigorous answer to this question (i.e. at the level of
mathematics in this course), so some faith must be extended towards the following denitions.
Denition 2.4.26. The conditional distributions function of Y given X = x is the function:
F(y|x) =
_
y

f(x, v)
f(x)
dv
for any x such that f(x) > 0.
Denition 2.4.27. The conditional density function of F(y|x), written f(y|x) is given by:
f(y|x) =
f(x, y)
f(x)
for any x such that f(x) > 0.
We remark that clearly
f(y|x) =
f(x, y)
_

f(x, y)dy
.
Example 2.4.28. Suppose that Y |X = x N(x, 1) and X N(0, 1), Z = R
2
. Find the PDFs:
f(x, y) and f(x|y). We have
f(x, y) = f(y|x)f(x)
so
f(x, y) =
1
2
exp{
1
2
((y x)
2
+ x
2
)} (x, y) Z.
Now
f(x|y) =
f(x, y)
f(y)
=
f(y|x)f(x)
f(y)
which is Bayes theorem for continuous random-variables. Now for y R
f(y) =
_

1
2
exp{
1
2
((y x)
2
+ x
2
)}dx
=
1
2
exp{y
2
/2}
_

exp{
1
2
(2x
2
2xy)}dx
=
1
2
exp{y
2
/2}
_

exp{[(x y/2)
2
y
2
/4]}dx
=
1
2
exp{y
2
/4}
_

e
u
2
/2
1

2
du
=
1
2
exp{y
2
/4}

2
=
1

2
exp{y
2
/4}.
25
So Y N(0, 2). Then:
f(x|y) =
1
2
exp{
1
2
((y x)
2
+ x
2
)}
1

2
exp{y
2
/4}
x R.
After some calculations (exercise) one obtains:
f(x|y) =
_
2
2
exp{(x
y
2
)
2
} x R
that is X|Y = y N(y/2, 1/2).
Example 2.4.29. Let X and Y have joint PDF:
f(x, y) =
1
x
(x, y) Z = {(x, y) : 0 y x 1}.
Then for x [0, 1]
f(x) =
_
x
0
1
x
dy = 1.
In addition, (x, y) Z:
f(y|x) =
1/x
1
=
1
x
that is if x [0, 1], then conditional upon x Y U
[0,x]
(the uniform distribution on [0, x]). Now,
suppose we are interested in P(X
2
+ Y
2
1|X = x). Let x 0, and dene
A(x) := {y R : 0 y x, x
2
+ y
2
1}
that is, A(x) = [0, x (1 x
2
)
1/2
], where a b = min{a, b}. Thus (x [0, 1])
P(X
2
+ Y
2
1|X = x) =
_
A(x)
f(y|x)dy
=
_
x(1x
2
)
1/2
0
1
x
dy
=
x (1 x
2
)
1/2
x
.
To compute P(X
2
+ Y
2
1), we note, by setting A = {(x, y) : (x, y) Z, x
2
+ y
2
1}, we are
interested in P(A), which is:
P(X
2
+ Y
2
1) =
_
A
f(x, y)dydx
=
_
1
0
_
A(x)
f(y|x)dyf(x)dx
=
_
1
0
x (1 x
2
)
1/2
x
dx.
Now x
2
(1 x
2
)
1/2
if
x 1/

2
hence
P(X
2
+ Y
2
1) =
_
1/

2
0
dx +
_
1
1/

2
(1 x
2
)
1/2
x
dx. (2.4.1)
Now we have, setting x = cos()
_
1
1/

2
(1 x
2
)
1/2
x
dx =
_
/4
0
sin() tan()d
=
_
log(|sec() + tan()|) sin()
_
/4
0
= log(1 +

2)
1

2
26
where we have used integration tables to obtain the integral
3
. Thus, returning to (2.4.1):
P(X
2
+ Y
2
1) =
1

2
+ log(1 +

2)
1

2
= log(1 +

2).
Given the previous discussion, the notion of a conditional expectation for continuous random
variables:
E[g(Y )|X = x] =
_
g(y)f(y|x)dy.
As for discrete-valued random variables, we have the following result, which is again not proved.
Theorem 2.4.30. Consider jointly continuous random variables X and Y with g, h : R R with
h(X), g(Y ) continuous. Then:
E[h(X)g(Y )] = E[E[g(Y )|X]h(X)] =
_
_
_
g(y)f(y|x)dy
_
h(x)f(x)dx
whenever the expectations exist.
Example 2.4.31. Let us return to Example 2.4.28. Then
E[XY ] = E[Y E[X|Y ]]
= E[Y (Y/2)]
=
1
2
[2] = 1.
Thus as Var[X] = 1 and Var[Y ] = 2
(X, Y ) =
1

2
.
Example 2.4.32. Suppose Y |X = x N(x,
2
1
) and X N(,
2
2
). To nd the marginal distribu-
tion of Y , one can use MGFs and conditional expectations. We have:
M(t) = E[e
Y t
]
= E[E[e
Y t
|X]]
= E[e
Xt+(
2
1
t
2
)/2
]
= e
(
2
1
t
2
)/2
E[e
Xt
]
= e
(
2
1
t
2
)/2
e
t+(
2
2
t
2
)/2
= exp{t + (
2
1
+
2
2
)t
2
/2}.
Here we have used Example 2.4.17 for the MGF of a normal distribution. Thus we can conclude that
Y N(,
2
1
+
2
2
).
2.4.6 Functions of Random Variables
Consider a random variable X and g : R R a continuous and invertible function (continuity can
be weakened, but let us use it for now). Suppose Y = g(X); what is the distribution of Y ? We have:
F(y) = P(Y y) = P(g(X) y) = P(X g
1
((, y)) =
_
g
1
(,y]
f(x)dx.
Note that for A R, then g
1
A = {x R : g(x) A}.
Example 2.4.33. Let X N(0, 1) and set
(x) =
_
x

2
e
u
2
/2
du
to denote the standard normal CDF. Suppose Y = X
2
, then for y 0 (note Y = R
+
)
P(Y y) = P(X
2
y) = P(

y X

y).
3
you will not be expected to perform such an integration in the examination
27
Now
P(

y X

y) = P(X

y) P(X

y)
= (

y) (

y)
= (

y) [1 (

y)]
= 2(

y) 1.
Then
f(y) =
d
dy
F(y) =
1

y) =
1

2y
e
y/2
y Y
that is Y G(1/2, 1/2). This is also called the chi-squared distribution on one degree of freedom.
We remark that Y and X are clearly not independent (if X changes so does Y and in a deterministic
manner). Now
E[XY ] = E[X
3
] = 0.
In addition E[X] = 0 and E[Y ] = 1. So X and Y are uncorrelated, but they are not independent.
We now move onto a more general change of variable formula. Suppose X
1
and X
2
have a joint
density on Z and we set
(Y
1
, Y
2
) = T(X
1
, X
2
)
where T : Z T where T is dierentiable and invertible and T R
2
. What is the joint density
function of (Y
1
, Y
2
) on T? As T is invertible, we set
X
1
= T
1
1
(Y
1
, Y
2
) X
2
= T
1
2
(Y
1
, Y
2
)
Then we dene
J(y
1
, y
2
) :=

T
1
1
y1
T
1
2
y1
T
1
1
y2
T
1
2
y2

=
T
1
1
y
1
T
1
2
y
2

T
1
2
y
1
T
1
1
y
2
to be the Jacobian of transformation. Then we have the following result, which simply follows from
the change of variable rule for integration:
Theorem 2.4.34. If (X
1
, X
2
) have joint density f(x, y) on Z, then for (Y
1
, Y
2
) = T(X
1
, X
2
), with
T as described above, the joint density of (Y
1
, Y
2
), denoted g is:
g(y
1
, y
2
) = f(T
1
1
(y
1
, y
2
), T
1
2
(y
1
, y
2
))|J(y
1
, y
2
)| (y
1
, y
2
) T.
Example 2.4.35. Suppose Z = (R
+
)
2
and
f(x
1
, x
2
) =
2
e
(x1+x2)
(x
1
, x
2
) Z.
Let
(Y
1
, Y
2
) = (X
1
+ X
2
, X
1
/X
2
).
To nd the joint density of (Y
1
, Y
2
) we rst note that T = Z and that
(X
1
, X
2
) =
_
Y
1
Y
2
1 + Y
2
,
Y
1
1 + Y
2
_
.
Then the Jacobian is
J(y
1
, y
2
) :=

y2
1+y2
1
1+y2
y1
(1+y2)
2

y1
(1+y2)
2

=
y
1
(1 + y
2
)
2
.
Thus:
g(y
1
, y
2
) =
y
1
(1 + y
2
)
2

2
e
y1
(y
1
, y
2
) T.
One can check that indeed Y
1
and Y
2
are independent and, that the marginal densities are:
g(y
1
) =
2
y
1
e
y1
y
1
R
+
g(y
2
) =
1
(1 + y
2
)
2
y
2
R
+
.
28
Example 2.4.36. Suppose we have (X
1
, X
2
) as in the case of Example 2.4.35, except we have the
mapping:
(Y
1
, Y
2
) = (X
1
, X
1
+ X
2
).
Now, clearly Y
2
Y
1
, so this transformation induces the support for (Y
1
, Y
2
) as:
T = {(y
1
, y
2
) : 0 y
1
y
2
, y
2
R
+
}.
Then
(X
1
, X
2
) = (Y
1
, Y
2
Y
1
)
and clearly J(y
1
, y
2
) = 1; thus
g(y
1
, y
2
) =
2
e
y2
(y
1
, y
2
) T.
Example 2.4.37. Let X
1
and X
2
be independent N(0, 1) random variables (Z = R
2
) and let
(Y
1
, Y
2
) = (X
1
, X
1
+
_
1
2
X
2
) [1, 1]
Then T = Z and
(X
1
, X
2
) =
_
Y
1
,
Y
2
Y
1
_
1
2
_
.
Then the Jacobian is
J(y
1
, y
2
) :=

1
2
0
1

1
2

=
1
_
1
2
.
Thus
g(y
1
, y
2
) =
1
2
_
1
2
exp{
1
2
(y
2
1
+ (y
2
y
1
)
2
/(1
2
)} (y
1
, y
2
) T
=
1
2
_
1
2
exp{
1
2(1
2
)
(y
2
1
+ y
2
2
2y
1
y
2
)} (y
1
, y
2
) T.
On inspection of Example 2.4.23, (Y
1
, Y
2
) follow a standard bivariate normal distribution.
Distributions associated to the Normal Distribution
In the following Subsection we investigate some distributions that are related to the normal distri-
bution. They appear frequently in hypothesis testing, which is something that we will investigate
in the following Chapter. We note that, in a similar way to the way in which joint distributions are
dened (Denition 2.4.19), one can easily extend to joint distributions of n 2 variables (so that
there is a joint distribution F(x
1
, . . . , x
n
) and density f(x
1
, . . . , x
n
)).
We begin with the idea of the chi-square distribution on n > 0 degrees of freedom:
f(x) =
1
(2)
n/2
(n/2)
x
n/21
e
x/2
x X = R
+
.
We write X X
2
n
; you will notice that also X G(n/2, 1/2), so that E[X] = n. Note that from
Table 4.2, we have that
M(t) =
1
(1 2t)
n/2
1
2
> t.
We have the following result:
Proposition 2.4.38. Let X
1
, . . . , X
n
be independent and identically distributed (i.i.d.) standard
normal random variables (i.e. X
i
i.i.d.
N(0, 1)). Let
Z = X
2
1
+ + X
2
n
.
Then Z X
2
n
.
29
Proof. We will use moment generating functions. We have
M(t) = E[e
Zt
]
= E[e
t

n
i=1
X
2
i
]
=
n

i=1
E[e
tX
2
i
]
= E[e
X
2
1
t
]
n
.
Here we have used the independent property on the third line and the fact that the random variables
are identically distributed on the last line. Now
E[e
X
2
1
t
] =
1

2
_

exp{
1
2
(1 2t)x
2
1
}dx
1
1
2
> t
=
1

2(1 2t)
1/2
_

exp{
1
2
u
2
}du
=
1
(1 2t)
1/2
.
Hence
M(t) =
1
(1 2t)
n/2
1
2
> t
and we conclude the result.
The standard tdistribution on ndegrees of freedom is (n > 0):
p(x) =
([n + 1]/2)

n(n/2)
_
1 +
x
2
n
_
(n+1)/2
x X = R.
We write X T
n
. Then we have the following important result.
Proposition 2.4.39. Let X N(0, 1) and independently Y X
2
n
, n > 0. Let
T =
X
_
Y
n
.
Then T T
n
.
Proof. We have
f(x, y) =
1

2
e

1
2
x
2 1
2
n/2
(n/2)
y
n/21
e

1
2
y
Z = R R
+
.
We will use Theorem 2.4.34, with the transformation dened by
T =
X
_
Y
n
S = Y
and then marginalize out S. The Jacobian of transformation is:
J(t, s) :=

_
s
n
0
t
2

sn
1

=
_
s
n
.
Then
f(t, s) =
1

2n2
n/2
(n/2)
s
n/21/2
exp{
s
2
[t
2
/n + 1]} (t, s) R R
+
.
Then we have for t R
f(t) =
1

2n2
n/2
(n/2)
_

0
s
n/21/2
exp{
s
2
[t
2
/n + 1]}ds
=
1

2n2
n/2
(n/2)
_

0
1
(
1
2
+
t
2
2n
)
n+1
2
u
n/21/2
e
u
du
=
1

2n2
n/2
(n/2)
1
(
1
2
+
t
2
2n
)
n+1
2
([n + 1]/2)
=
([n + 1]/2)

n(n/2)
_
1 +
t
2
n
_
(n+1)/2
30
and we conclude.
The last distribution that we consider is the standard Fdistribution on d
1
, d
2
> 0 degrees of
freedom:
f(x) =
1
B(
d1
2
,
d2
2
)
_
d
1
d
2
_
d1/2
x
d1/21
_
1 +
d
1
d
2
x
_

d
1
+d
2
2
x X = R
+
where
B(
d
1
2
,
d
2
2
) =
(
d1
2
)(
d2
2
)
(
d1+d2
2
)
.
We write X F
(d1,d2)
. We have the following result.
Proposition 2.4.40. Let X X
2
d1
and independently Y X
2
d2
, d
1
, d
2
> 0. Let
F =
X/d
1
Y/d
2
.
Then F F
(d1,d2)
.
Proof. We have
f(x, y) =
1
2
d1/2
(d
1
/2)
x
d1/21
e

1
2
x
1
2
d2/2
(d
2
/2)
y
d2/21
e

1
2
y
Z = R
+
R
+
.
We will use Theorem 2.4.34, with the transformation dened by
T =
X/d
1
Y/d
2
S = Y
and then marginalize out S. The Jacobian of transformation is:
J(t, s) :=

d1s
d2
0
d1t
d2
1

=
d
1
s
d
2
.
Then
f(t, s) =
1
2
d1/2+d2/2
(d
1
/2)(d
2
/2)
d
1
s
d
2
_
d
1
st
d
2
_
d1/21
e

d
1
st
2d
2
s
d2/21
e
s/2
T = (R
+
)
2
.
To shorten the subsequent notations, let
g =
1
2
d1/2+d2/2
(d
1
/2)(d
2
/2)
_
d
1
d
2
_
d1/2
.
Then, for t R
+
f(t) = gt
d1/21
_

0
s
d
1
+d
2
2
1
exp{s(
1
2
[1 +
d
1
t
d
2
])}ds
= gt
d1/21
_

0
u
(d1+d2)/21
e
u
(
1
2
[1 +
d
1
t
d
2
])
(d1+d2)/2
du
= gt
d1/21
(
1
2
[1 +
d
1
t
d
2
])
(d1+d2)/2
((d
1
+ d
2
)/2)
= g((d
1
+ d
2
)/2)2
d
1
+d
2
2
t
d1/21
_
1 +
d
1
d
2
t
_

d
1
+d
2
2
=
1
B(
d1
2
,
d2
2
)
_
d
1
d
2
_
d1/2
t
d1/21
_
1 +
d
1
d
2
t
_

d
1
+d
2
2
and we conclude.
31
2.5 Convergence of Random Variables
In the following Section we provide a brief introduction to convergence of random variable and in
particular two modes of convergence:
Convergence in Distribution
Convergence in Probability
These properties are rather important in probability and mathematical statistics and provide a way
to justify many statistical and numerical procedures. The properties are rather loosely associated
to the notion of a sequence of random variables X
1
, X
2
, . . . , X
n
(there will be innitely many of
them typically) and we will be concerned with the idea of what happens to the distribution or some
functional as n . The second part of this Section will focus on perhaps the most important result
in probability: the central limit theorem. This provides a characterization of a random variable:
S
n
=
1

n
n

i=1
X
i
with X
1
, . . . , X
n
independent and identically distributed with zero mean and unit variance. This
particular result is very useful in hypothesis testing, which we shall see later on. Throughout, we
will focus on continuous random variables, but one can extend this notion.
2.5.1 Convergence in Distribution
Consider a sequence of random variables X
1
, X
2
, . . . , with associated distribution functions F
1
, F
2
, . . . ,
we then have the following denition:
Denition 2.5.1. We say that the sequence of distribution functions F
1
, F
2
, . . . converges to a
distribution function F, if lim
n
F
n
(x) = F(x) at each point where F is continuous. If X has
distribution function F, we say that X
n
converges in distribution to X and write X
n
d
X.
Example 2.5.2. Consider a sequence of random variables X
n
X
n
= [0, n], n 1, with
F
n
(x) = 1
_
1
x
n
_
n
x X
n
.
Then for any xed x R
+
lim
n
F
n
(x) = 1 e
x
.
Now if X E(1):
F(x) =
_
x
0
e
u
du = 1 e
x
.
So X
n
d
X, X E(1).
Example 2.5.3. Consider a sequence of random variables X
n
X = [0, 1], n 1, with
F
n
(x) = 1
sin(2nx)
2n
.
Then for any xed x [0, 1]
lim
n
F
n
(x) = x.
Now if X U
[0,1]
, x [0, 1]:
F(x) =
_
x
0
du = x.
So X
n
d
X, X U
[0,1]
.
32
2.5.2 Convergence in Probability
We now consider an alternative notion of convergence for a sequence of random variables X
1
, X
2
, . . .
Denition 2.5.4. We say that X
n
converges in probability to a constant c R (written X
n
P
c) if
for every > 0
lim
n
P(|X
n
c| > ) = 0.
We note that convergence in probability can be extended to convergence to a random variable
X, but we do not do this, as it is beyond the level of this course. We have the following useful result,
which we do not prove.
Theorem 2.5.5. Consider a sequence of random variables X
1
, X
2
, . . . , with associated distribution
functions F
1
, F
2
, . . . . If for every x X
n
lim
n
F
n
(x) = c
then X
n
P
c.
Example 2.5.6. Consider a sequence of random variables X
n
X = R
+
, n 1, with
F
n
(x) =
_
x
1 + x
_
n
.
Then for any xed x X
lim
n
F
n
(x) = 0.
Thus X
n
P
0.
We nish the Section with a rather important result, which we again do not prove. It is called
the weak law of large numbers:
Theorem 2.5.7. Let X
1
, X
2
, . . . be a sequence of independent and identically distributed random
variables with E[|X
1
|] < . Then
1
n
n

i=1
X
i
P
E[X
1
].
Note that this result extends to functions; i.e. if g : R R with E[|g(X
1
)|] < then
1
n
n

i=1
g(X
i
)
P
E[g(X
1
)].
Example 2.5.8. Consider an integral
I =
_
R
g(x)dx
where we will suppose that g(x) = 0. Suppose we cannot calculate I. Consider any non-zero pdf f(x)
on R. Then
I =
_
R
g(x)
f(x)
f(x)dx
and suppose that
_
R

g(x)
f(x)

f(x)dx < .
Then by the weak law of large numbers, if X
1
, . . . , X
n
are i.i.d. with pdf f then
1
n
n

i=1
g(X
i
)
f(X
i
)
P
I.
This provides a justication for a numerical method (called the Monte Carlo method) to approximate
integrals.
33
2.5.3 Central Limit Theorem
We close the Section and Chapter with the central limit theorem (CLT) which we state without
proof.
Theorem 2.5.9. Let X
1
, X
2
, . . . be a sequence of independent and identically distributed random
variables with E[|X
1
|] < and 0 < Var[X
1
] < . Then

n
_
1
n
n

i=1
(X
i
E[X
1
])
_
Var[X
1
]
_
d
Z
where Z N(0, 1).
The CLT asserts that the distribution of the summation

n
i=1
X
i
suitably normalized and cen-
tered, will converge to that of a normal distribution. An often used idea, via the CLT, is normal
approximation. That is, for n large

n
_
1
n
n

i=1
(X
i
E[X
1
])
_
Var[X
1
]
_

N(0, 1)
where

means approximately distributed as, so we have
4
,
n

i=1
X
i

N(nE[X
1
], nVar[X
1
]).
Example 2.5.10. Let X
1
, . . . , X
n
be i.i.d. G(a/n, b) random variables. Then the distribution of
Z
n
=

n
i=1
X
i
is for b > t:
M(t) = E[e
Znt
] =
n

i=1
E[e
Xit
] = E[e
x1
]
n
=
__
b
b t
_
a/n
_
n
where we have used the i.i.d. property and the MGF of a Gamma random variable (see Table 4.2).
Thus Z
n
G(a, b) for any n 1. However, from the CLT one can reasonably approximate Z
n
, when
n is large by a normal random variable with mean
n
a
bn
=
a
b
and variance
n
a
nb
2
=
a
b
2
.
4
If Z N(0, 1) you can verify that if X = Z + then X N(,
2
)
Chapter 3
Introduction to Statistics
3.1 Introduction
In this nal Chapter, we give a very brief introduction to statistical ideas. Here the notion is that
one has observed data from some sampling distribution the population distribution and one wishes
to infer what the properties of the population are on the basis of observed samples. In particular
we will be interested in estimating the parameters of sampling distributions such as the maximum
likelihood method (Section 3.2) as well as testing hypotheses about parameters (3.3). We end the
Chapter with an introduction to Bayesian statistics (Section 3.4) which is another alternative way
to estimate parameters, which is more complex, but much richer than the MLE method.
3.2 Maximum Likelihood Estimation
3.2.1 Introduction
So far in this course, we have proceeded to discuss pmfs and pdfs such as P() or N(,
2
), but we
have not discussed what the parameters or might be. We discuss in this Section, a particularly
important way to estimate parameters called maximum likelihood estimation (MLE).
The basic idea is this. Suppose one observes data x
1
, x
2
, . . . , x
n
and we hypothesize that they
follow some joint distribution F

(x
1
, . . . , x
n
), ( is the parameter space e.g. = R). Then the
idea of maximum likelihood estimation is to nd the parameter which maximizes the joint pmf/pdf
of the data, that is:

n
= argmax

(x
1
, . . . , x
n
).
3.2.2 The Method
Throughout this section, we assume that X
1
, . . . , X
n
are mutually independent. So, we have X
i
i.i.d.

, and the joint pmf/pdf is:


f

(x
1
, . . . , x
n
) = f

(x
1
) f

(x
2
) f

(x
n
) =
n

i=1
f

(x
i
).
We call f

(x
1
, . . . , x
n
) the likelihood of the data. As maximizing a function is equivalent to maxi-
mizing a monotonic increasing transformation of the function, we often work with the log-likelihood:
l

(x
1
, . . . , x
n
) = log
_
f

(x
1
, . . . , x
n
)
_
=
n

i=1
log
_
f

(x
i
)
_
.
If is some continuous space (as it generally is for our examples) and = (
1
, . . . ,
d
), then we can
compute the gradient vector:
l

(x
1
, . . . , x
n
) =
_
l

(x
1
, . . . , x
n
)

1
, . . . ,
l

(x
1
, . . . , x
n
)

d
_
and we would like to solve, for (below 0 is the ddimensional vector of zeros)
l

(x
1
, . . . , x
n
) = 0. (3.2.1)
34
35
The solution of this equation (assuming it exists) is a maximum if the hessian matrix is negative
denite:
H() :=
_

2
l

(x1,...,xn)

2
1

2
l

(x1,...,xn)
12


2
l

(x1,...,xn)
1
d
.
.
.
.
.
.
.
.
.
.
.
.

2
l

(x1,...,xn)

d
1

2
l

(x1,...,xn)

d
2


2
l

(x1,...,xn)

2
d
_

_
.
If the d numbers
1
, . . . ,
d
which solve|I
d
H()| = 0, with I
d
the d d identity matrix, are all
negative, then is a local maximum of l

(x
1
, . . . , x
n
). If d = 1 then this just boils down to checking
whether the second derivative of the log-likelihood is negative at the solution of (3.2.1).
Thus in summary, the approach we employ is as follows:
1. Compute the likelihood f

(x
1
, . . . , x
n
).
2. Compute the log-likelihood l

(x
1
, . . . , x
n
) and its gradient vector l

(x
1
, . . . , x
n
).
3. Solve l

(x
1
, . . . , x
n
) = 0, with respect to , call this solution

n
(we are assuming there
is only one

n
).
4. If H(

n
) is negative denite, then

n
=

n
.
In general, point 3. may not be possible analytically (so for example, one can use Newtons method).
However, you will not be asked to solve l

(x
1
, . . . , x
n
) = 0, unless there is an analytic solution.
3.2.3 Examples of Computing the MLE
Example 3.2.1. Let X
1
, . . . , X
n
be i.i.d. P() random variables. Let us compute the MLE of = ,
given observations x
1
, . . . , x
n
. First, we have
f

(x
1
, . . . , x
n
) =
n

i=1

xi
x
i
!
e

= e
n

n
i=1
xi
1

n
i=1
x
i
!
.
Second the log-likelihood is:
l

(x
1
, . . . , x
n
) = log(f

(x
1
, . . . , x
n
)) = n +
_
n

i=1
x
i
_
log() log(
n

i=1
x
i
!).
The gradient vector is a derivative:
dl

(x
1
, . . . , x
n
)
d
= n +
1

_
n

i=1
x
i
_
.
Thirdly
n +
1

_
n

i=1
x
i
_
= 0
so,

n
=
1
n
n

i=1
x
i
.
Fourthly,
d
2
l

(x
1
, . . . , x
n
)
d
2
=
1

2
_
n

i=1
x
i
_
< 0
for any > 0 (assuming there is at least one i such that x
i
> 0). Thus assuming there is at least
one i such that x
i
> 0

n
=
1
n
n

i=1
x
i
.
36
Remark 3.2.2. As a theoretical justication of

n
as in Example 3.2.1, we note that
1
n
n

i=1
x
i
will converge in probability (see Theorem 2.5.7) to E[X
1
] = if our assumptions hold true. That is,
we recover the true parameter value; such a property is called consistency - we do not address this
issue further.
Example 3.2.3. Let X
1
, . . . , X
n
be i.i.d. E() random variables. Let us compute the MLE of = ,
given observations x
1
, . . . , x
n
. First, we have
f

(x
1
, . . . , x
n
) =
n

i=1
e
xi
=
n
exp{
n

i=1
x
i
}.
Second the log-likelihood is:
l

(x
1
, . . . , x
n
) = log(f

(x
1
, . . . , x
n
)) = nlog()
n

i=1
x
i
.
The gradient vector is a derivative:
dl

(x
1
, . . . , x
n
)
d
=
n

i=1
x
i
.
Thirdly
n

i=1
x
i
= 0
so

n
= (
1
n
n

i=1
x
i
)
1
.
Fourthly,
d
2
l

(x
1
, . . . , x
n
)
d
2
=
n

2
< 0.
Thus

n
= (
1
n
n

i=1
x
i
)
1
.
Example 3.2.4. Let X
1
, . . . , X
n
be i.i.d. N(,
2
) random variables. Let us compute the MLE of
(,
2
) = , given observations x
1
, . . . , x
n
. First, we have
f

(x
1
, . . . , x
n
) =
n

i=1
1

2
2
exp{
1
2
2
(x
i
)
2
}
=
_
1

2
2
_
n
exp{
1
2
2
n

i=1
(x
i
)
2
}.
Second the log-likelihood is:
l

(x
1
, . . . , x
n
) = log(f

(x
1
, . . . , x
n
)) =
n
2
log(2
2
)
1
2
2
n

i=1
(x
i
)
2
.
The gradient vector is
_
l

(x
1
, . . . , x
n
)

,
l

(x
1
, . . . , x
n
)

2
_
=
_
1

2
n

i=1
(x
i
),
n
2
2
+
1
2(
2
)
2
n

i=1
(x
i
)
2
_
.
37
Thirdly, we must solve the equations
1

2
n

i=1
(x
i
) = 0 (3.2.2)

n
2
2
+
1
2(
2
)
2
n

i=1
(x
i
)
2
= 0 (3.2.3)
simultaneously for and
2
. Since (3.2.2) can be solved independently of (3.2.3), we have:

n
=
1
n
n

i=1
x
i
.
Thus substituting into (3.2.3), we have
1
2(
2
)
2
n

i=1
(x
i

n
)
2
=
n
2
2
that is:

2
n
=
1
n
n

i=1
(x
i

n
)
2
.
Fourthly the hessian matrix is:
H() =
_

n

2

1
(
2
)
2

n
i=1
(x
i
)

1
(
2
)
2

n
i=1
(x
i
)
n
2(
2
)
2

1
(
2
)
3

n
i=1
(x
i
)
2
_
.
Note that when

n
= (
n
,
2
n
) the o-diagonal elements are exactly 0, so when solving |I
2
H(

n
)| =
0, we simply need to show that the diagonal elements are negative. Clearly for the rst diagonal
element

n

2
n
< 0
if (x
i

n
)
2
> 0 for at least one i (we do not allow the case
2
n
, in that we assume this doesnt
occur). In addition
n
2(
2
n
)
2

1
(
2
n
)
3
n

i=1
(x
i

n
)
2
=
n
2(
2
n
)
2

n
(
2
n
)
2
=
n
2(
2
n
)
2
< 0.
Thus

n
=
_
1
n
n

i=1
x
i
,
1
n
n

i=1
_
x
i

1
n
n

j=1
x
j
_
2
_
.
3.3 Hypothesis Testing
3.3.1 Introduction
The theory of hypothesis testing is concerned with the problem of determining whether or not a
statistical hypothesis, that is, a statement about the probability distribution of the data, is consis-
tent with the available sample evidence. The particular hypothesis to be tested is called the null
hypothesis and is denoted by H
0
. The ultimate goal is to accept or reject H
0
.
In addition to the null hypothesis H
0
, one may also be interested in a particular set of deviations
from H
0
, called the alternative hypothesis and denoted by H
1
. Usually, the null and the alternative
hypotheses are not on an equal footing: H
0
is clearly specied and of intrinsic interest, whereas H
1
serves only to indicate what types of departure from H
0
are of interest.
A statistical test of a null hypothesis H
0
is typically based on three elements:
1. a statistic T, called a test statistic
2. a partition of the possible values of T into two distinct regions: the set K of values of T that
are regarded as inconsistent with H
0
, called the critical or rejection region of the test, and its
complement K
c
, called the non-rejection region
3. a decision rule that rejects the null hypothesis H
0
as inconsistent with the data if the observed
value of T falls in the rejection region K, and does not reject H
0
if the observed value of T
belongs instead to K
c
.
38
3.3.2 Constructing Test Statistics
In order to construct test-statistics, we start with an important result, which we do not prove. Note
that all of our results are associated to normally distributed data.
Theorem 3.3.1. Let X
1
, . . . , X
n
be i.i.d. N(,
2
) random variables. Dene

X
n
:=
1
n
n

i=1
X
i
s
2
n
:=
1
n 1
n

i=1
(X
i


X
n
)
2
Then
1.

X
n
N(0,
2
/n).
2. (n 1)s
2
n
/
2
X
2
n1
.
3.

X
n
and (n 1)s
2
n
/
2
are independent random variables.
We note then, that by Proposition 2.4.39, it follows that
T() :=
(

X
n
)/(/

n)
_
s
2
n
/
2
=
(

X
n
)
_
s
2
n
/n
T
n1
.
Note that as one might imagine Proposition 2.4.38 is used to prove Theorem 3.3.1. Note that one
can show that the T
n1
distribution has a symmetric pdf, around 0.
Now consider testing H
0
: =
0
against the two-sided alternative H
1
: =
0
(that is, we are
testing whether the population mean, the true mean of the data, is a particular value). Now we
know for any R that T() T
n1
, thus if the null hypothesis is true T(
0
) should be a random
variable that is consistent with a T
n1
random variable. To construct the rejection region of the
test, we must choose a condence level, typically 95%. Then we want T(
0
) to lie in 95% of the
probability. However, this still does not tell us what the rejection region is; this is informed by the
alternative hypothesis, that H
1
: =
0
; which indicates that a value which is inconsistent with the
T
n1
distribution lies in each tail of the distribution. Thus the procedure is:
1. Compute T(
0
).
2. Decide upon your condence level (1 )%. This denes the rejection region.
3. Compute the tvalues (t, t). These are the numbers in R such that the probability (under a
T
n1
random variable) of exceeding t is /2 and the probability of being less than t is /2.
4. If t < T(
0
) < t then we do not reject the null hypothesis, otherwise we reject the null
hypothesis.
A number of remarks are in order. First, a test can only disprove a null hypothesis. The fact
that we do not reject the null on the basis of the sample evidence does not mean that the hypothesis
is true.
Second, it is useful to distinguish between two types of errors that can be made:
Type I error: Reject H
0
when it is true.
Type II error: Do not reject H
0
when it is false.
In statistics, Type I error is usually regarded as the most serious. The analogy is with the judiciary
system, where condemning an innocent is typically considered a much more serious problem than
letting a guilty person free.
A second test statistic we consider is for testing whether two samples have the same variance. Let
X
1
, . . . , X
n
and Y
1
, . . . , Y
m
be independent sequences of random variables where X
i
i.i.d.
N(
X
,
2
X
)
and Y
i
i.i.d.
N(
Y
,
2
Y
). Now we know from Theorem 3.3.1 that
(n 1)s
2
X,n
/
2
X
X
2
n1
39
and independently:
(m1)s
2
Y,m
/
2
Y
X
2
m1
.
Now suppose that we want to test H
0
:
2
X
=
2
Y
against H
1
:
2
X
=
2
Y
. Now if H
0
is true, by
Theorem 3.3.1 and Proposition 2.4.40.
F(
2
X
) := (n 1)s
2
X,n
/
2
X
/(n 1)
_
(m1)s
2
Y,m
/
2
X
/(m1) = s
2
X,n
/s
2
Y,m
F
n1,m1
Thus, we perform the following procedure:
1. Compute F(
2
X
).
2. Decide upon your condence level (1 )%. This denes the rejection region.
3. Compute the Fvalues (f, f). These are the numbers in R
+
such that the probability (under
a F
n1,m1
random variable) of exceeding f is /2 and the probability of being less than f
is /2.
4. If f < F(
2
X
) < f then we do not reject the null hypothesis, otherwise we reject the null
hypothesis.
Remark 3.3.2. All of our results concern Normal samples. However one can use the CLT (Theorem
2.5.9) to extend these tests to non-normal data. We do not follow this idea in this course.
3.4 Bayesian Inference
3.4.1 Introduction
So far we have considered point estimation methods for the unknown parameter. In the following
Section, we consider the idea of Bayesian inference for unknown parameters. Recall, from Examples
2.3.34 and 2.4.28, that there is a Bayes theorem for discrete and continuous random variables. Indeed,
there is version of Bayes theorem that can mix continuous and discrete random variables. Let X
and Y be two jointly dened random variables such that X or Y are either continuous or discrete.
Then Bayes theorem is:
f(x|y) =
f(y|x)f(x)
f(y)
as long as the associated pmfs/pdfs are well dened. Throughout the section, we assume that the
pmfs/pdfs are always positive on their supports.
3.4.2 Bayesian Estimation
Throughout, we will assume that we will have a random sample X
1
, X
2
, . . . , X
n
which are condi-
tionally independent, given a parameter . As we will see, in Bayesian statistics, the parameter
is a random variable. So, we assume X
i
|
i.i.d.
F(|) where F(|) is assumed to be a distribution
function for any . Thus we have that the joint pmf/pdf of X
1
, . . . , X
n
| is:
f(x
1
, . . . , x
n
|) =
n

i=1
f(x
i
|).
The main key behind Bayesian statistics is the choice of a prior probability distribution for the
parameter . That is, Bayesian statisticians specify a probability distribution on the parameter
before the data are observed. This probability distribution is supposed to reect the information
one might have before seeing the observations. To make this idea concrete, consider the following
example:
Example 3.4.1. Suppose that we will observe data X
i
|
i.i.d.
E(). Then one has to construct a
probability distribution on R
+
. A possible candidate is G(a, b). If one has some prior beliefs
about the mean and variance of (say E[] = 10, Var[] = 10) then one determine what a and b
are.
40
Throughout, we will write the prior pmf/pdf as ().
Now the way in which Bayesian inference works is to update the prior beliefs on via the posterior
pmf/pdf. That is, in the light of the data the distributional properties of the prior are updated.
This is achieved by Bayes theorem; the posterior pmf/pdf is:
(|x
1
, . . . , x
n
) =
f(x
1
, . . . , x
n
|)()
f(x
1
, . . . , x
n
)
where
f(x
1
, . . . , x
n
) =
_

f(x
1
, . . . , x
n
|)()d
if is continuous and, if is discrete:
f(x
1
, . . . , x
n
) =

f(x
1
, . . . , x
n
|)().
For a Bayesian statistician, the posterior is the nal answer, in that all statistical inference should
be associated to the posterior. For example, if one is interested in estimating then one can use the
posterior mean:
E[|x
1
, . . . , x
n
] =
_

(|x
1
, . . . , x
n
)d.
In addition, consider R (or indeed any univariate component of ), then we can compute a
condence interval, which is called an credible interval in Bayesian statistics. The highest 95%-
posterior-credible (HPC) interval is the shortest region [, ], such that
_

(|x
1
, . . . , x
n
)d = 0.95.
By shortest, we mean the [, ], such that | | is smallest.
It should be clear by now, that the posterior distribution is much richer than the MLE, in the
sense that one now has a whole distribution which reects the parameter, instead of a point estimate.
What also might be apparent now, is the fact that the posterior is perhaps dicult to calculate. For
example the posterior mean will require you to compute two integrations over (in general) and this
may not be easy to calculate in practice. In addition, it could be dicult, analytically to calculate
a HPC. As a result, most Bayesian inference is done via numerical methods, based on Monte Carlo
(see Example 2.5.8); we do not mention this further.
3.4.3 Examples
Despite the fact that Bayesian inference is very challenging, there are still many examples where one
can do analytical calculations.
Example 3.4.2. Let us consider Example 3.4.1. Here we have that
f(x
1
, . . . , x
n
) =
_

0

n
exp{
n

i=1
x
i
}
b
a
(a)

a1
e
b
d
=
b
a
(a)
_

0

n+a1
exp{[
n

i=1
x
i
+ b]}d
=
b
a
(a)
_
1
[

n
i=1
x
i
+ b]
_
n+a
_

0
u
n+a1
e
u
du
=
b
a
(a)
_
1
[

n
i=1
x
i
+ b]
_
n+a
(n + a).
So, as:
f(x
1
, . . . , x
n
|)() =
b
a
(a)

n+a1
exp{[
n

i=1
x
i
+ b]}
we have:
(|x
1
, . . . , x
n
) =

n+a1
exp{[

n
i=1
x
i
+ b]}
_
1
[

n
i=1
xi+b]
_
n+a
(n + a)
41
i.e.
|x
1
, . . . , x
n
G(n + a, b +
n

i=1
x
i
).
Thus, the posterior distribution on is in the same family as the prior, except, with updated param-
eters, reecting the data. So, for example:
E[|x
1
, . . . , x
n
] =
n + a
b +

n
i=1
x
i
.
In comparison to the MLE in Example 3.2.3, we see that the posterior mean and MLE correspond
as a, b 0.
Example 3.4.3. Let X
i
|
i.i.d.
P(), i {1, . . . , n}. Suppose the prior on is G(a, b). Then
f(x
1
, . . . , x
n
) =
_

0

n
i=1
xi
exp{n}
1

n
i=1
x
i
!
b
a
(a)

a1
e
b
d
=
b
a
(a)

n
i=1
x
i
!
_

0

n
i=1
xi+a1
exp{[n + b]}d
=
b
a
(a)

n
i=1
x
i
!
_
1
[n + b]
_

n
i=1
xi+a
_

0
u

n
i=1
xi+a1
e
u
du
=
b
a
(a)

n
i=1
x
i
!
_
1
[n + b]
_

n
i=1
xi+a
(
n

i=1
x
i
+ a).
So as:
f(x
1
, . . . , x
n
|)() =
b
a
(a)

n
i=1
x
i
!

n
i=1
xi+a1
exp{[n + b]}
we have:
(|x
1
, . . . , x
n
) =

n
i=1
xi+a1
exp{[n + b]}
_
1
[n+b]
_

n
i=1
xi+a
(

n
i=1
x
i
+ a)
i.e.
|x
1
, . . . , x
n
G
_
n

i=1
x
i
+ a, n + b
_
.
Thus, the posterior distribution on is in the same family as the prior, except, with updated param-
eters, reecting the data. So, for example:
E[|x
1
, . . . , x
n
] =

n
i=1
x
i
+ a
n + b
.
In comparison to the MLE in Example 3.2.1, we see that the posterior mean and MLE correspond
as a, b 0.
Example 3.4.4. Let X
i
|
i.i.d.
N(, 1), i {1, . . . , n}. Suppose the prior on is N(, ). Then
f(x
1
, . . . , x
n
) =
_

_
1

2
_
n
exp{
1
2
n

i=1
(x
i
)
2
}
1

2
exp{
1
2
( )
2
}d
=
_
1

2
_
n
1

2
_

exp{
1
2
n

i=1
(x
i
)
2

1
2
( )
2
}d.
To compute the integral, let us rst manipulate the exponent inside the integral:

1
2
n

i=1
(x
i
)
2

1
2
( )
2
=
1
2
[
2
(n +
1

) 2(

+
n

i=1
x
i
) +

2

+
n

i=1
x
2
i
]
Let c(, , x
1
, . . . , x
n
) =
1
2

n
i=1
x
2
i
, then

1
2
n

i=1
(x
i
)
2

1
2
( )
2
=
1
2
[
2
(n +
1

) 2(

+
n

i=1
x
i
)] + c(, , x
1
, . . . , x
n
)
=
1
2
(n +
1

)
__

(

n
i=1
x
i
)
n +
1

)
2
(
(

n
i=1
x
i
)
n +
1

_
2
_
+ c(, , x
1
, . . . , x
n
).
42
Let c

(, , x
1
, . . . , x
n
) = c(, , x
1
, . . . , x
n
) +
1
2
(n +
1

)(
(

n
i=1
xi)
n+
1

_
2
, then we have
f(x
1
, . . . , x
n
) =
_
1

2
_
n
1

2
exp{c

(, , x
1
, . . . , x
n
)}
_

exp{
1
2
(n +
1

)
_

(

n
i=1
x
i
)
n +
1

)
2
}d
=
_
1

2
_
n
1

2
exp{c

(, , x
1
, . . . , x
n
)}

2
(n +
1

)
1/2
.
So as:
f(x
1
, . . . , x
n
|)() =
_
1

2
_
n
1

2
exp{c

(, , x
1
, . . . , x
n
)} exp{
1
2
(n+
1

)
_

n
i=1
x
i
)
n +
1

)
2
}
we have
(|x
1
, . . . , x
n
) =
exp{
1
2
(n +
1

)
_

(

n
i=1
xi)
n+
1

)
2
}

2
(n+
1

)
1/2
i.e.
|x
1
, . . . , x
n
N
_
(

n
i=1
x
i
)
n +
1

, (n +
1

)
1
_
Thus, the posterior distribution on is in the same family as the prior, except, with updated param-
eters, reecting the data. So, for example:
E[|x
1
, . . . , x
n
] =
(

n
i=1
x
i
)
n +
1

.
In comparison to the MLE in Example 3.2.4, we see that the posterior mean and MLE correspond
as , for any xed R.
We note that in all of our examples the posterior is in the same family as the prior. This is not
by chance; for every member of the exponential family, with parameter it is often possible to nd
a prior which obeys this former property. Such priors are called conjugate priors.
Chapter 4
Miscellaneous Results
In the following Chapter we quote some results of use in the course. We do not cover this material
in lectures and is there for your convenience and revision. The following Sections give general facts
which should be taken as true. You can easily nd the theory behind each of these results in any
undergraduate text in maths or statistics.
4.1 Set Theory
Recall that A B means all the elements in A or all the elements of B. E.g. A = {1, 2}, B = {3, 4}
then AB = {1, 2, 3, 4}. In addition that AB means all the elements in A and B. E.g. A = {1, 2, 3},
B = {3, 4} then A B = {3}. Finally recall A
c
is the complement of the set A, that is everything
in that is not in A, e.g. = {1, 2, 3} and A = {1, 2}, then A
c
= {3}.
Let A, B, C be sets in some state-space . Then the following holds true:
A B = B A A B = B A ASSOCIATIVITY
A (B C) = (A B) (A C) A (B C) = (A B) (A C) DISTRIBUTIVITY.
Note that the second rule holds when mixing intersection and union, e.g.:
A (B C) = (A B) (A C).
We also have De-Morgans laws:
(A B)
c
= A
c
B
c
(A B)
c
= A
c
B
c
.
A neat trick for probability:
A = (A B) (A B
c
).
4.2 Summation
1
1 z
=

k=0
z
k
|z| < 1 GEOMETRIC
e
z
=

k=0
z
k
k!
z R EXPONENTIAL
(1 + z)
n
=
n

k=0
_
n
k
_
z
k
n Z
+
BINOMIAL
(1 + z)
n
=
n

k=0
_
n + k
k
_
z
k
n Z
+
, |z| < 1 NEGATIVE BINOMIAL
log(1 z) =

k=1
z
k
k
|z| < 1 LOGARITHMIC
log(1 + z) =

k=1
(1)
k+1
z
k
k
|z| < 1 LOGARITHMIC
43
44
Note that, suppose |z| < 1, then we have
n

k=j
z
k
= z
j
1 z
nj+1
1 z
,
4.3 Exponential Function
For x R
+
:
lim
n
_
1 +
x
n
_
n
= e
x
lim
n
_
1
x
n
_
n
= e
x
.
4.4 Taylor Series
Under assumptions that are in eect in this course, for innitely dierentiable f : R R:
f(x) =

k=0
(x c)
k
k!
f
(k)
(c) c R.
4.5 Integration Methods
Some ideas for integration:
d
dx
[f(g(x))] = g

(x)f

(g(x))
_
g

(x)f

(g(x))dx = f(g(x)) + C.
The methods of integration by parts and integration by substitution; you should have covered
these prior to your university education.
Consider the following integration trick. Suppose you want to compute
_
g(x)dx
where g is a positive function. Suppose that there is a PDF f(x) such that
f(x) = cg(x)
where you know c. Then
_
g(x)dx =
_
1
c
f(x)dx =
1
c
.
Recall that
I =
_

e
u
2
/2
du =

2.
This can be established by the fact that
I
2
=
_

e
u
2
/2v
2
/2
dudv
and then changing to polar co-ordinates.
4.6 Distributions
A Table of discrete distributions can be found in Table 4.1 and continuous Distributions in Table
4.2. Recall that
(a) =
_

0
t
a1
e
t
dt a > 0.
45
Note that one can show (a + 1) = a(a) and for a Z
+
, (a) = (a 1)!. In addition that for
a, b > 0
B(a, b) =
(a)(b)
(a + b)
.
One can show also that
B(a, b) =
_
1
0
u
a1
(1 u)
b1
du.
46
S
u
p
p
o
r
t
X
P
a
r
.
P
M
F
C
D
F
E
[
X
]
V
a
r
[
X
]
M
G
F
B
(
1
,
p
)
{
0
,
1
}
p

(
0
,
1
)
p
x
(
1

p
)
1

x
p
p
(
1

p
)
(
1

p
)
+
p
e
t
B
(
n
,
p
)
{
0
,
1
,
.
.
.
,
n
}
p

(
0
,
1
)
,
n

Z
+
_
nx
_
p
x
(
1

p
)
n

x
n
p
n
p
(
1

p
)
(
(
1

p
)
+
p
e
t
)
n
P
(

)
{
0
,
1
,
2
,
.
.
.
}

R
+

x
e

x
!

e
x
p
{

(
e
t

1
)
}
G
e
(
p
)
{
1
,
2
,
.
.
.
}
p

(
0
,
1
)
(
1

p
)
x

1
p
1

q
x
1
/
p
(
1

p
)
/
p
2
p
e
t
1

e
t
(
1

p
)
N
e
(
n
,
p
)
{
n
,
n
+
1
,
.
.
.
}
p

(
0
,
1
)
,
n

Z
+
_
x

1
n

1
_
(
1

p
)
x

n
p
n
n
/
p
n
(
1

p
)
/
p
2
_
p
e
t
1

e
t
(
1

p
)
_
n
T
a
b
l
e
4
.
1
:
T
a
b
l
e
o
f
D
i
s
c
r
e
t
e
D
i
s
t
r
i
b
u
t
i
o
n
s
.
N
o
t
e
t
h
a
t
q
=
1

p
.
47
S
u
p
p
o
r
t
X
P
a
r
.
P
D
F
C
D
F
E
[
X
]
V
a
r
[
X
]
M
G
F
U
[
a
,
b
]
[
a
,
b
]

<
a
<
b
<

1
b

a
x

a
b

a
(
a
+
b
)
/
2
(
b

a
)
2
/
1
2
e
b
t

e
a
t
t
(
b

a
)
E
(

)
R
+

R
+

x
1

x
1
/

1
/

t
G
(
a
,
b
)
R
+
a
,
b

R
+
b
a

(
a
)
x
a

1
e

b
x
a
/
b
a
/
b
2
(
b
b

t
)
a
N
(
a
,
b
)
R
(
a
,
b
)

R
+
1

b
e

1
2
b
(
x

a
)
2
a
b
e
a
t
+
b
t
2
/
2
B
e
(
a
,
b
)
[
0
,
1
]
(
a
,
b
)

(
R
+
)
2
B
(
a
,
b
)

1
x
a

1
(
1

x
)
b

1
a
/
(
a
+
b
)
a
b
(
a
+
b
)
2
(
a
+
b
+
1
)
T
a
b
l
e
4
.
2
:
T
a
b
l
e
o
f
C
o
n
t
i
n
u
o
u
s
D
i
s
t
r
i
b
u
t
i
o
n
s
.
N
o
t
e
t
h
a
t
B
(
a
,
b
)

1
=

(
a
+
b
)
/
[

(
a
)

(
b
)
]
.
Bibliography
[1] Grimmett, G, & Stirzaker, D. (2001). Probability and Random Processes. Third Edition.
Oxford: OUP.
[2] Tijms, H. (2012). Understanding Probability. Third Edition. Cambridge: CUP.
48

You might also like