You are on page 1of 127

1 — A SINGLE RANDOM VARIABLE

Questions involving probability abound in Computer Science:


• What is the probability of the PWF world falling over next week?
• What is the probability of one packet colliding with another in a network?
• What is the probability of an undergraduate not turning up for a lecture?
When addressing such questions there are often added complications: the question may
be ill posed or the answer may vary with time. Which undergraduate? What lecture? Is
the probability of turning up different on Saturdays?
Let’s start with something which appears easy to reason about. . .

Introduction — Throwing a die


Consider an experiment or trial which consists of throwing a mathematically ideal die.
Such a die is often called a fair die or an unbiased die. Common sense suggests that:
• The outcome of a single throw cannot be predicted.
• The outcome will necessarily be a random integer in the range 1 to 6.
1
• The six possible outcomes are equiprobable, each having a probability of 6.

Without further qualification, serious probabilists would regard this collection of assertions,
especially the second, as almost meaningless. Just what is a random integer? Giving proper
mathematical rigour to the foundations of probability theory is quite a taxing task.
To illustrate the difficulty, consider probability in a frequency sense. Thus a probability
of 61 means that, over a long run, one expects to throw a 5 (say) on one-sixth of the
occasions that the die is thrown. If the actual proportion of 5s after n throws is p 5 (n) it
would be nice to say:
1
lim p5 (n) =
n→∞ 6

Unfortunately this is utterly bogus mathematics! This is simply not a proper use of the
idea of a limit.
Suppose you frequently play a board game which requires you to ‘throw a 6 to start’. Each
time you play the game you write down how many throws it took to get your 6. One day
you ponder what the long-term average ought to be. . .
You reason that each of the six possible outcomes should occur with equal frequency in
each of the first six throws. You reason further that the required 6 could be in any position
from first to sixth in these first six throws and so, on average, you will get the first 6 at
throw 3 12 .
This, too, is nonsense reasoning but you will have to wait a while before the correct answer
is derived. Meantime, note a health warning: relying on intuition can seriously damage
your understanding of Probability!

– 1.1 –
Relationship to Set Theory — Sample Space — Sample Point
This course will concentrate on solving problems. It is not proposed to venture very far
into the foundations of probability theory but some formal discussion is unavoidable.
It is mathematically sound to represent the collection of possible outcomes of an experiment
as a set. Call this set Ω. In the case of a die:
Ω = {1, 2, 3, 4, 5, 6}

The set Ω is referred to as the sample space associated with the experiment. Each member
of the set is a sample point.

Events
There are times when you are concerned not so much with the value which results from a
particular throw but with whether this value is in some specific subset of Ω. For example,
you might bet on the result being an even number, when the subset of interest is {2, 4, 6}.
Call this subset E:
E = {2, 4, 6}
A set such as E is called an event. This term is often used in a slightly unnatural way: for
example, you can talk of ‘betting on some event’. This clearly doesn’t mean that you hope
the outcome of a throw is a subset like {2, 4, 6} since the outcome can be only a single
value. The intended meaning is that you hope the outcome is a member of the specified
subset.

Two Special Events


Any subset of Ω could be deemed an event. Two special cases are the empty set φ = {}
and the full set Ω itself.
The outcome of a throw cannot be a member of the empty set so a bet on φ is certain to
lose. The outcome of a throw is necessarily a member of Ω so a bet on Ω is certain to win.

Elementary Events
Any event which corresponds to a single sample point is also clearly special. Such an event
is called an elementary event. For example, the elementary event {5} corresponds to the
single sample point 5. Note that {5} is a subset of Ω whereas 5 is a member of Ω.

Random Variables — The { X=r}} Notation


A most useful abstraction is the idea of a random variable. A naı̈ve view of a random
variable is that, rather like a variable in a programming language, it has a name and a
value and whereas the name stays unchanged the value changes.
It is common to use the name X when referring to a single random variable. Suppose X
corresponds to the outcome of throwing a die, so the value of X after a particular throw
might be 5. It is easy to arrange for a spreadsheet cell to simulate such behaviour. At a
given moment the appearance of the cell might be:
X
5

– 1.2 –
It is more usual to write {X = 5} and this is an alternative way of representing the
elementary event {5}. In both cases the curly-brackets emphasise that a set is being
represented and not an element.
The notation {X = 5} can be generalised to {X = r}. In the context of a die, r would
have one of the values 1, 2, 3, 4, 5 or 6. Consider X to be the name of the variable and
r to be its value.

Random Variables — Discrete versus Continuous


When X is associated with the outcome of throwing a die, it is an example of a discrete
random variable, one which can take on only a countable number of different values.
Probability theory extends to continuous random variables too. For example, the outcome
of throwing a dart at the centre of a dart board can be recorded as the distance of the
landing point from the centre. Subject to certain practical constraints, this distance can
be any positive real number. Continuous random variables will be left until later.

Combining Events
In discrete examples, any event can be represented as a set and ordinary set notation can
be employed. Thus the combined event {X = 2} ∪ {X = 4} ∪ {X = 6} is equivalent to
the event {2, 4, 6} and if E = {2, 4, 6} then the complementary event Ē = {1, 3, 5}.
Events are said to be exclusive if the associated sets are disjoint. Thus {1, 4} and {2, 6}
are exclusive events.
Events are said to be exhaustive if the union of the associated sets incorporates every
possible sample point. Thus {1, 2, 3, 4} and {3, 4, 5, 6} are exhaustive.
Note that the two events E and Ē are exclusive and exhaustive as are Ω and φ.

The Capital-P Notation


As shorthand for writing ‘the probability of. . .’ it is customary to write capital-P followed
by a pair of round brackets. The brackets enclose the event whose probability is of interest.
Capital-P is best introduced by means of some examples:
P({2, 4, 6}) means The probability of the event {2, 4, 6}.
P(E) means The probability of the event E.
P({5}) means The probability of the elementary event {5}.
P({X = r}) means The probability of the random variable X being r.
As clarified earlier, ‘the probability of the event E’ really means ‘the probability that the
outcome is a member of E’.
If the probability is known, P(some event) can appear on the left-hand side of an equation:
1
P({2, 4, 6}) = 2 means The probability of the event {2, 4, 6} is 12 .
P(E) = 21 means The probability of the event E is 12 .
1
P({5}) = 6 means The probability of the elementary event {5} is 61 .
1 1
P({X = 5}) = 6 means The probability of the random variable X being 5 is 6.
1
Common sense suggests that P({2, 4, 6}) = 2 and this is shown formally on page 1.5.

– 1.3 –
Axioms of Probability
Taking Ω as sample space and A as an event (a subset of Ω), three assertions can be taken
as axioms:
I P(A) > 0. The probability of any event is positive.
II P(Ω) = 1. The probability of certainty is 1.
III If A1 , A2 , A3 . . . are exclusive (disjoint) events (such that Ai ∩ Aj = φ whenever i 6= j)
then P(A1 ∪ A2 ∪ A3 ∪ . . .) = P(A1 ) + P(A2 ) + P(A3 ) . . .

Axiom III is known as the Addition Rule of Probability.

φ)=0
The Empty Set Theorem — P(φ
The result P(φ) = 0, although self-evident, (the probability of an impossible event is 0) is
nevertheless a theorem whose proof follows from the axioms. . .
Given that Ω = Ω ∪ φ, it follows that:

P(Ω) = P(Ω ∪ φ)

From axiom II, the left-hand side P(Ω) = 1 and, from axiom III, the right-hand side
P(Ω ∪ φ) = P(Ω) + P(φ) since Ω and φ are exclusive. Accordingly, P(Ω) + P(φ) = 1 which,
given that P(Ω)=1, immediately ensures that P(φ) = 0.

The Summation of Elementary Events Theorem


If event A is the set of sample points {s1 , s2 , . . . , sn } then:
n
X
P(A) = P({si })
i=1

In words: the probability of event A is the sum of the probabilities of the elementary events
whose union is A.
Any non-empty event is the union of certain elementary events and elementary events are
necessarily exclusive. In particular A = {s1 } ∪ {s2 } ∪ . . . ∪ {sn } so the theorem is just a
special case of axiom III.

The Complementary Event Theorem


Let event Ā be the complement of event A (strictly the complement with respect to Ω).
The complementary event theorem states that:

P(A) = 1 − P(Ā)

Given that A ∪ Ā = Ω it follows that P(A ∪ Ā) = P(Ω) = 1.


Now A and Ā are exclusive, so P(A ∪ Ā) = P(A) + P(Ā) by axiom III.
Accordingly, P(A) + P(Ā) = 1. This proves the theorem.

– 1.4 –
Observations on the Axioms and Theorems
Being axioms, the assertions are not subject to proof but they are clearly reasonable.
Taking probability in the frequency sense, it is clear that the proportion of fives in a long
run of throws of a die cannot be negative. It is equally clear that throwing some value in
the range one to six is certain.
It seems simply common sense that the probability of an impossible event is 0 but taking
the probability of certainty to be 1 is merely convenient. It would be possible to take it
as axiomatic that the probability of certainty were 100 but this would lead to tiresome
factors of 100 appearing in calculations.
Event E = {2, 4, 6} is the union of the three elementary events {2}, {4} and {6}. The
probability of each of these elementary events is 16 so the sum of the probabilities is 12 .
Thus the demonstration that P(E) = 21 stems directly from the summation of elementary
events theorem.

Further use of Sets — Event Space


When the sample space is as simple as Ω = {1, 2, 3, 4, 5, 6}, the power set (the set of all
possible subsets) of Ω contains every possible event. The power set is an example of an
event space associated with a sample space. (For a fuller understanding of event space
see exercise 9.)
Consider the two particular events from the event space:

E = {2, 4, 6} and S = {1, 4}

With reference to throwing a die, event E is the outcome of throwing an even number and
event S is the outcome of throwing a perfect square.
With the exception of the event φ, any event is the union of one or more elementary events.
In the case of a fair die, the six elementary events are equiprobable (the probability is 61
in each case).
Noting these points and the summation of elementary events theorem:

3
P(E) = P({2}) + P({4}) + P({6}) =
6

2
P(S) = P({1}) + P({4}) =
6

1
P(E ∩ S) = P({4}) =
6

4
P(E ∪ S) = P({1, 2, 4, 6}) = P({1}) + P({2}) + P({4}) + P({6}) =
6

– 1.5 –
A Venn Diagram Illustration
A Venn diagram neatly illustrates the foregoing results:

E
Ω 2 6

4
3 5
1
S

1
Clearly E ∩ S is the elementary event {4} and the probability of this event is 6.
4
By contrast E ∪ S is the union of four elementary events so its probability is 6.

The number of members in a set A is the cardinality of the set and is written as |A|. With
a die the elementary events are equiprobable, so it is easy to see that the probability of
any event A ⊆ Ω is:
|A|
P(A) =
|Ω|

With a die |Ω| = 6. In the examples above, |E| = 3, |S| = 2, |E ∩ S| = 1 and |E ∪ S| = 4


so the probabilities of these events are 63 , 62 , 16 and 46 respectively.

Probability as Area
In simple cases, the areas of the different regions in a Venn diagram representation can be
made to reflect probabilities. In the figure, the circle representing event Ω should be taken
to have unit area to reflect that P(Ω) = 1 (certainty) by axiom II. The area of the circle
representing event E is constrained to be one half to reflect that P(E) = 12 and the area
of the circle representing event S is constrained to be one third because P(S) = 31 .
In principle, there could be further subdivisions which gave each elementary event an area
of one sixth.

What if the Elementary Events are not Equiprobable?


The axioms of probability apply generally and there is no requirement for the elementary
events to be equiprobable. With a weighted die, for example, there may be a different
probability of throwing each of the six values. The probability of an event such as {2, 4, 6}
would still be the sum of the probabilities of the elementary events but this would not
necessarily be 12 .
Clearly additional information is required for a full specification of the properties of a given
die. A simple approach is to present the probability of each elementary event and, from
these values, the probabilities of any event in the power set of events can be computed.

– 1.6 –
Probability Space
From any (straightforward) sample space, the associated power set gives an event space.
When this information is augmented by the probabilities of each individual event the whole
is referred to as the associated probability space.
Clearly there is redundant information in a fully-specified probability space since specifying
the probabilities of the elementary events alone provides sufficient information to calculate
the probabilities of all the other possible events.
Sometimes just the probabilities of certain non-elementary events may be provided and, in
appropriate circumstances, it is possible to derive either the probabilities of the associated
elementary events or the probabilities of other non-elementary events. (See exercise 6.)

Representing the Probabilities of Elementary Events


Even when the elementary events are not equiprobable, the areas of the different regions in
a Venn diagram representation can still be made to correspond to probabilities in simple
cases.
In more ambitious cases there are alternative ways of specifying the separate probabilities
of each elementary event. . .
For each r ∈ Ω the value of P({X = r}) must be specified. One way to achieve this is to
draw up a table. Consider a rather carefully constructed die which is biased so that the
probability of throwing r is proportional to r. An appropriate table of probabilities of the
elementary events is:

r→
X 1 2 3 4 5 6
1 2 3 4 5 6
P({X = r}) 21 21 21 21 21 21

It is well worth developing an instinctive reaction to meeting such a table: check that the
sum of the probabilities of the elementary events is unity. This always has to be the case
since P(Ω) = 1.
P({X = r}) is a function of r and can be described in ordinary mathematical notation:

( r
, if r ∈ N ∧ 1 6 r 6 6
P({X = r}) = 21
0, otherwise

Such a function is sometimes known as a density function but this term is more commonly
used for continuous random variables.

– 1.7 –
The function can be represented graphically:

6
21

3
P({X = r}) 21

0
21
0 1 2 3 4 5 6
r→

A pie chart is another useful way of representing the different probabilities:

1
2
Ω 6
3

5 4

The pie as a whole has unit area and the area of each slice is equal to the probability of
the elementary event which labels the slice.
The pie chart also suggests a way of constructing a practical device which will come up with
the required probabilities. The chart could be regarded rather like the card on a magnetic
compass but it should be fitted with a needle which isn’t magnetised. An experiment
consists of spinning the needle and allowing it to come to rest. The label on the slice at
which the head end of the needle then points is assigned to the random variable X.
It would be very hard to construct a die which is biased in the corresponding way!

The Summation of Elementary Events Theorem revisited


Suppose that, using the biased die, the events E and S are again:

E = {2, 4, 6} and S = {1, 4}

The examples given on page 1.5 can be reworked using the new probabilities for the
elementary events:
12
P(E) = P({2}) + P({4}) + P({6}) =
21

5
P(S) = P({1}) + P({4}) =
21

– 1.8 –
4
P(E ∩ S) = P({4}) =
21

13
P(E ∪ S) = P({1, 2, 4, 6}) = P({1}) + P({2}) + P({4}) + P({6}) =
21
Given the particular biased die and the specified events E and S, the four results just
obtained happen to be related as follows:

13 12 5 4
= + −
21 21 21 21

Accordingly, in these very particular circumstances:

P(E ∪ S) = P(E) + P(S) − P(E ∩ S)

This can be shown to be true generally. . .

The Inclusion-Exclusion Theorem


Suppose A and B are two arbitrary events in some sample space. Consider the following
Venn diagram in which area corresponds to probability:

B
A

The area of A ∪ B includes the area of A and the area of B but is not the sum of the areas
of A and B since this includes the shaded region representing A ∩ B twice. In consequence
the sum of the areas of A and B is greater than the area of A ∪ B by an amount equal to
A ∩ B. Translating areas into probabilities gives:

P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

The sum of the terms P(A) and P(B) on the right-hand side includes the shaded region
twice and the third term excludes the shaded region once. The relationship is, accordingly,
known as the inclusion-exclusion theorem and applies generally. See exercises 10 and 11.
If A and B are exclusive events A ∩ B = φ and the theorem conforms with axiom III. . .

If A ∩ B = φ then P(A ∪ B) = P(A) + P(B)

– 1.9 –
Conditional Probability
The notation of which P({X = 5}) is an example readily extends to forms such as:

P({X 6= 4}) or P({X > 4})

From the complementary event theorem it quickly follows that:

P({X 6= 4}) = 1 − P({X = 4})

The second example, P({X > 4}), is known as a tail probability. In this case the upper
tail-end values 5 and 6 are referred to and assuming the discussion is still about a die:

P({X > 4}) = P({5, 6})

A more important notion of conditional probability occurs when asking a question of the
form ‘What is the probability that a perfect square has been thrown given that an even
number has been thrown?’

The notation is used to mean ‘given’ and the question just posed is more formally written:

P({1, 4} {2, 4, 6})

In general: P(B A) means The probability of event B given event A.

By consideration of probability as area, it is easy to show that:

P(B ∩ A)
P(B A) =
P(A)

Refer to the Venn diagram on the previous page. Given event A, any sample point has to
be in the circle representing A. For event B to occur, given event A, the sample point has
to be in the shaded region.

The probability on the left-hand side, P(B A), is therefore the ratio of the area of the
shaded region to the area of the circle representing A. This is the right-hand side.

Mapping
The outcome of a throw of a die is necessarily a number. This makes it possible to make
informal assertions such as ‘on average the outcome is 3 12 ’.
Very often an outcome is not numerical, for example, ‘the horse that wins the 3 o’clock
at Newmarket’. The P({X = r}) notation extends readily to non-numerical values for r.
Consider a particular case. . .
Suppose a bag contains six balls, one red, two white and three blue and, on picking out
a ball at random, each of the six balls is equally likely to be selected. One can readily
imagine betting on the colour of the ball that is picked out.

– 1.10 –
To analyse this example, the sample space Ω = {red, white, blue} and for each r ∈ Ω (that
is for each of the three elementary events) the value of P(X = r) can be tabulated:
r→
X red white blue
1 2 3
P({X = r}) 6 6 6

Clearly the probabilities can equally be represented by a graph or by a pie chart. Questions
like what is the probability of picking out a ball that is either white or blue can now readily
be answered.
With non-numerical values it is normally meaningless to ask about the average outcome.
This is not something to worry about immediately but many techniques in probability
theory assume that any outcome is a number. In particular, many formulae in probability
theory involve summing over r and, with discrete random variables, it is usually assumed
that each r ∈ N.
There is a simple way to accommodate the requirement for r to be a number. This is to
map the possible outcomes onto the set of numbers {0, 1, 2, . . .}. The mapping is often
arbitrary and in the present case assigning red=0, white=1 and blue=2 is as good as any
other mapping. The table would then become:
r→
X 0 1 2
1 2 3
P({X = r}) 6 6 6

Note that r ranges from 0 upwards (and not 1 upwards). Using 0 as the starting value of
r is assumed in many probability formulae so it is unfortunate that the values indicated
on a conventional die run from 1 to 6 and not 0 to 5.
It would be altogether eccentric to insist that the faces of any die used in this course carried
the values 0 to 5 and an alternative ploy will sometimes be used. It will be assumed that
the experiment of throwing a die can yield seven different elementary events. These are
the members of the set {0, 1, 2, 3, 4, 5, 6} with the added proviso that P({X = 0}) = 0.

Glossary
A number of technical terms have been introduced. Those which relate to probability (as
distinct from set theory) are, in order of first appearance:
experiment sample space exclusive
trial sample point exhaustive
fair event event space
unbiased elementary event probability space
outcome random variable density function
equiprobable discrete tail probability
frequency continuous conditional probability
Check that you understand each of them.

– 1.11 –
Exercises — I
Whenever possible (and appropriate) work in fractions and not in decimals. Show the
probability of the event {2, 3, 5, 6} when a fair die is thrown as 23 and not as 0.6667 or
some such.
1. An insurance company is interested in the age distribution of couples. Let A be the
event ‘husband is older than 40’, B be the event ‘husband is older than wife’ and C be
the event ‘wife is older than 40’. Describe the following events in plain English:
(a) A ∩ B ∩ C (b) A − A ∩ B (c) A ∩ B̄ ∩ C

2. Given sample space Ω = {1, 2, 3, 4, 5, 6}, write down:


(a) Two events which are exclusive but not exhaustive.
(b) Two events which are exhaustive but not exclusive.
(c) Two events which are neither exclusive nor exhaustive.

3. Consider a random variable X whose value represents the outcome of throwing the
biased die described on pages 1.7 and 1.8. Determine the following probabilities:
(a) P({X = 1} ∪ {X = 2}).
(b) P({X < 3}).
(c) P({X 6 2}).

4. Among the digits 1, 2, 3, 4, 5 first one is chosen and then a second selection is made
among the four remaining digits. List the 20 possible elementary events. Assume
the 20 are equiprobable then find the probability of the following events:
(a) An odd digit is selected first time.
(b) An odd digit is selected second time.
(c) An odd digit is selected both times.

5. The four digits 1, 2, 3, 4 are arranged in some random order. List the 24 possible
elementary events. Assume the 24 are equiprobable. Let Ai be the event ‘digit i
appears in its natural place’ (note i ∈ {1, 2, 3, 4}). Verify the formula:

P(A1 ∪ A2 ) = P(A1 ) + P(A2 ) − P(A1 ∩ A2 ))

6. [From Grimmett and Welsh] Let A, B and C be three events such that:
5 7 6
P(A) = 10 P(B) = 10 P(C) = 10
3 4 2
P(A ∩ B) = 10 P(B ∩ C) = 10 P(C ∩ A) = 10
1
P(A ∩ B ∩ C) = 10

By drawing a Venn diagram or otherwise, find the probability that exactly two of the
events A, B and C occurs.

– 1.12 –
7. A card is selected at random from a normal deck of 52 playing cards. The elementary
events can be represented by the set {1♠, . . . , K♣} and these events may be mapped
onto the set {1, . . . , 52}. Assume that the 52 elementary events are equiprobable.
Let A be the elementary event 4♦, B be the event ‘any ace’ and C be the event ‘any
diamond’. Determine the following probabilities:

P(A) P(Ā) P(B) P(C) P(B ∪ C) P(B ∩ C) P(A B) P(B C)
8. You overhear a couple talking about a visit to an exclusive girls-only school where they
have recently been entertained as prospective parents. You gather that they have two
children (and no more are expected). What is the probability that both children are
girls?
You may assume:
(a) Boys and girls are born equiprobably.
(b) The sexes of the different children of the same parents are independent.
The first assumption is only slightly erroneous (in Britain the ratio, at birth, of boys
to girls is about 515:485). The second assumption is the subject of debate but it is
very widely accepted.

9. Grimmett and Welsh define an event space associated with a sample space Ω as any set
of subsets of Ω that satisfies certain constraints. For E to be an event space associated
with sample space Ω, the constraints are:

(i ) E must be non-empty.
(ii ) If A ∈ E then Ā ∈ E.
(iii ) If A1 , A2 , A3 . . . ∈ E then A1 ∪ A2 ∪ A3 ∪ . . . ∈ E.

Taking Ω = {1, 2, 3, 4, 5, 6}, consider the smallest associated event space E that
includes the event {1}. Clearly, {2, 3, 4, 5, 6} ∈ E by (ii ). Further, by (iii ),
{1, 2, 3, 4, 5, 6} = Ω ∈ E. By (ii ) again, φ ∈ E, so E = {φ, {1}, {2, 3, 4, 5, 6}, Ω}.
By a similar argument, the smallest event space E that includes the arbitrary event
A is E = {φ, A, Ā, Ω}. One notes that φ and Ω are necessarily members of any event
space associated with sample space Ω.
An event space E is said to be ‘closed under the operations of taking complements
and countable unions’ and as a special case the power set of any sample space is an
event space.
Given sample space Ω = {1, 2, 3, 4, 5, 6}, determine the smallest possible event space
E in each of the following cases:
(a) The events {1} and {2} are members of E.
(b) The events {1, 2} and {3, 4} are members of E.
(c) The events {1, 2} and {1, 3} are members of E.
(d ) The events A and B are members of E.

– 1.13 –
10. If A1 and A2 are two arbitrary events in some sample space, the inclusion-exclusion
theorem, discussed on page 1.9, can be expressed as:

P(A1 ∪ A2 ) = P(A1 ) + P(A2 ) − P(A1 ∩ A2 )

By using a three-set Venn diagram or otherwise, show that for three arbitrary events A 1 ,
A2 and A3 in some sample space the inclusion-exclusion theorem can be expressed as:

P(A1 ∪ A2 ∪ A3 ) = P(A1 ) + P(A2 ) + P(A3 )


− P(A1 ∩ A2 ) − P(A1 ∩ A3 ) − P(A2 ∩ A3 )
+ P(A1 ∩ A2 ∩ A3 )

11. By considering the effect of introducing a fourth event A4 , state the inclusion-exclusion
theorem as it applies to four arbitrary events A1 , A2 , A3 and A4 in the same sample space:

P(A1 ∪ A2 ∪ A3 ∪ A4 ) = ?

– 1.14 –
2 — TWO OR MORE RANDOM VARIABLES

Many problems in probability theory relate to two or more random variables. You might
have equipment which is controlled by two computers and you are interested in knowing
the probability of both computers failing at once. An important consideration is whether
or not the computers fail independently. A hardware failure in one computer is unlikely
to provoke a failure in the other but a power cut might cause the simultaneous failure of
both computers.
Let’s start by considering two dice. . .

Two Random Variables


When two dice are thrown, you can either think about a single outcome which is one of
the 36 possibilities 1,1 to 6,6 or you can think of a separate outcome for each die. It is
more usual to take the second view and, in any analysis, use two random variables, each
of which can take one of the 6 possibilities 1 to 6.

The Multiplication Theorem of Probability


1
What is the probability of throwing two 6s? If both dice are fair, the answer is 36 . This
result is an application of the multiplication theorem of probability which, in the present
case, says that the overall probability is the product of the individual probabilities. The
probability of the first die showing 6 is 16 and the probability of the second die showing 6
is also 16 and the product 16 × 61 = 36
1
.

Independence and the Multiplication Theorem — I


The multiplication theorem requires the two throwings to be independent. Informally this
means that the outcome from one die has no influence on the outcome of the other. It
seems most unlikely that one die could affect the other but it is easy to imagine a closely
analogous case where independence couldn’t be so readily assured.
Consider a two-drum fruit machine where, instead of sporting fruit icons, each drum has
six segments numbered 1 to 6. If the machine were well set up the two drums would be
independent and behave like two dice but if grit gets into the works independence may
be compromised. In an extreme case the two drums might stick together in such a way
that they always showed the same number; the probability of obtaining two 6s now would
simply be 16 .

Independence — Formal Definition


Two events A and B are said to be independent if and only if:

P(B A) = P(B)

In simple terms, knowing that event A has occurred makes it neither more likely nor less
likely that event B has occurred. In particular, knowing that the first die shows 6 makes
it neither more nor less likely that the second die shows 6.

– 2.1 –
Independence can be an important consideration even when there is just a single random
variable. Suppose a card is selected from a pack of cards. The probability of the event ‘an
1
ace has been drawn’ is 13 and the probability that ‘an ace has been drawn given that a
1
diamond has been drawn’ is also 13 . Algebraically:

P(ace diamond) = P(ace)

‘Drawing an ace’ and ‘drawing a diamond’ are independent events. By contrast, ‘drawing
the ace of diamonds’ and ‘drawing a diamond’ are not independent events:
1 1
P(1♦ diamond) 6= P(1♦) since 6=
13 52

Knowing that you have a diamond makes it more likely that you have the ace of diamonds.

Independence and the Multiplication Theorem — II


If events A and B are independent then, by definition:

P(B A) = P(B)

Hence:
P(B ∩ A)
= P(B)
P(A)
Noting that set intersection is commutative:

P(A ∩ B)
P(A ∩ B) = P(A) . P(B) so = P(A) and P(A B) = P(A)
P(B)

Unsurprisingly, independence (like set intersection) is commutative.


Note that P(A∩B) = P(A) . P(B) is the multiplication theorem. Consider two illustrations:

1 1 1
P(diamond ∩ ace) = P(diamond) . P(ace) = × =
4 13 52
and:
1 1 1
P(first die 6 ∩ second die 6) = P(first die 6) . P(second die 6) = × =
6 6 36

The relationship P(A ∩ B) = P(A) . P(B) will not, of course, be satisfied if A and B are
not independent:

1 1 1
P(1♦∩diamond) 6= P(1♦) . P(diamond) since 6= ×
52 52 4

– 2.2 –
Distinguishability
Again considering two dice, what is the probability of throwing a 1 and a 5? Here there
is scope for debate. Do you consider 5 and 1 to be the same as 1 and 5? In probability
theory you ask whether or not the two dice are distinguishable. If the dice are of different
colours or are thrown one into a left-hand bucket and the other into a right-hand bucket,
then the dice are distinguishable. If there really is no way of telling which is which (or you
don’t care) then they are deemed indistinguishable.
If the two dice are distinguishable then clearly the probability of throwing a 1 and a 5 is
1
just the same as throwing two 6s and the answer is 36 .
If the dice are indistinguishable then the separate probabilities for 1,5 and 5,1 can be added
1
and the overall probability is 18 . Spelt out, the probability is:
1 1 1 1
× + ×
6 6 6 6
This simple expression uses the multiplication theorem twice and Axiom III (the addition
rule) once. The addition rule is applicable since the outcomes 1,5 and 5,1 are exclusive.

A Table for two Random Variables


When there are two random variables it is common to use the names X and Y and the
values r and s respectively. One way of tabulating the probabilities is:

Y s →
0 1 2 3 4 5 6
0
0 0 0 0 0 0 0 0 6
1 1 1 1 1 1 1
X 1 0 36 36 36 36 36 36 6
1 1 1 1 1 1 1
2 0 36 36 36 36 36 36 6
1 1 1 1 1 1 1
r 3 0 36 36 36 36 36 36 6 P({X = r})
1 1 1 1 1 1 1
↓ 4 0 36 36 36 36 36 36 6 ↓
1 1 1 1 1 1 1
5 0 36 36 36 36 36 36 6
1 1 1 1 1 1 1
6 0 36 36 36 36 36 36 6
0 1 1 1 1 1 1
6 6 6 6 6 6 6

P({Y = s}) →

For convenience, r and s are arranged to start from 0 rather than 1 (see page 1.11) and
the probabilities in the first row (when r = 0) and first column (when s = 0) of the array
1
are all zero. The 36 other entries are all of course 36 .
Against the right-hand edge there are seven marginal sums. Each value is the sum of the
seven probabilities in the associated row. All the rows, except the first, sum to 61 . Against
the bottom edge there are, likewise, seven marginal sums for the columns.

– 2.3 –
Extending the Notation P({{X = r}})
With a single die, there are 6 elementary events (or 7 if the contrived special case of
throwing a zero is counted). {X = r} corresponds to an elementary event and P({X = r})
is the probability of that event.
With two dice, there are 36 elementary events (or 49 if the contrived special cases are
counted). {X = r} still corresponds to an event but no longer to an elementary event.
An elementary event now is the intersection of two events such as {X = 4} and {Y = 2}
written as {X = 4} ∩ {Y = 2}.
Going the other way, the event {X = 4} is the union of the seven elementary events
{X = 4} ∩ {Y = 0}, {X = 4} ∩ {Y = 1}, up to {X = 4} ∩ {Y = 6}. This may be written:
6
[ 
{X = 4} = {X = 4} ∩ {Y = s}
s=0

The P({X = r}) notation can readily be extended to accommodate the new circumstances:
P({X = 4} ∩ {Y = 2}) means The probability of throwing a 4 and a 2.
P({X = r} ∩ {Y = s}) means The probability of throwing an r and an s.
By the addition rule, the probability of {X = 4} is the sum of the probabilities of the
elementary events whose union is {X = 4}:
6
X
P({X = 4}) = P({X = 4} ∩ {Y = s})
s=0

This is one of the marginal sums in the tabular representation on the previous page.
The notation is rather cumbersome and it is common to use P(X = r) for P({X = r})
and. . .
P(X = r, Y = s) or even pr,s for P({X = r} ∩ {Y = s})

The above summation is then more concisely expressed as:


X
P(X = 4) = p4,s
s

Notice that ‘summing over s’ is taken to mean summing over all valid s starting from zero.
The seven marginal sums at the right-hand edge can be expressed:
X
P(X = r) = pr,s where r = 0, 1, . . . , 6
s

The Double-Sigma Notation


The sum of the probabilities of all 36 (or 49) elementary events has to be one and this sum
is written: XX
pr,s = 1
r s

– 2.4 –
The double-sigma notation is used rather as nested FOR-loops are in a programming
language. Thus:
XX X
pr,s = (pr,0 + pr,1 + pr,2 + pr,3 + pr,4 + pr,5 + pr,6 )
r s r

The expression could be expanded further by writing the bracketed item on the right-hand
side 7 times over and replacing r by 0 in the first, by 1 in the second, and so on.

Independence of Random Variables — I


In the earlier discussion of independence the term was applied to a pair of events. The
term can be extended to the idea of a pair of random variables being independent.
The two random variables X and Y are said to be independent if and only if:

P(X = r, Y = s) = P(X = r) . P(Y = s) for all r, s

A simple inspection of the probabilities of the elementary events in the table shows this
to be true. The probability shown in every cell is the product of the marginal sum at the
end of the row and the marginal sum at the bottom of the column. In most cases, the
product is 61 × 61 but for cells in the first row or first column at least one of the terms in
the multiplication is zero.

Conditional Probability
Consider just the Y = 2 column extracted from the tabular representation of the 49
elementary events:
Y =2
0
0 36
1
1 36
1
2 36
1
3 36
1
X=4 36
1
5 36
1
6 36

The highlighted item shows the probability of the elementary event {X = 4} ∩ {Y = 2}.
1
The probability is, of course, 36 .
Now consider a different question. What is the probability of this elementary event if we
know that Y = 2 ? This extra knowledge means that any elementary event has to be in
the Y = 2 column and the required probability can be expressed:
1
36
P({X = 4} {Y = 2}) = 0 1 1 1 1 1 1
36 + 36 + 36 + 36 + 36 + 36 + 36

– 2.5 –
This is the ratio of the probability of the elementary event of interest to the sum of the
probabilities of all the elementary events in the column.
This can be written more generally:
P({X = 4} ∩ {Y = 2}) p4,2
P({X = 4} {Y = 2}) = X =X
P({X = r} ∩ {Y = 2}) pr,2
r r

The summation is down the Y = 2 column.

Bayes’s Theorem
There is nothing special about 2 and 4 of course and the most general form of the previous
expression is:
P({X = r} ∩ {Y = s}) pr,s
P({X = r} {Y = s}) = X =X (2.1)
P({X = k} ∩ {Y = s}) pk,s
k k

Note that k is used in the summations to avoid using r in two ways.


This version leads naturally to a consideration of a most important theorem due to Bayes.
The problem addressed by Bayes’s Theorem is this:

Can you derive P(A B) from P(B A)?

First recall that if A and B are two events:


P(B ∩ A)
P(B A) =
P(A)
Noting that the set intersection operator is commutative:

P(A ∩ B) = P(B A) . P(A)

Given this, (2.1) can be rewritten:



P({Y = s} {X = r}) . P({X = r})
P({X = r} {Y = s}) = X
P({Y = s} {X = k}) . P({X = k})
k

Now let A be the event {X = r}, let B be the event {Y = s} and let Ak be the event
{X = k} then substitute:

P(B A) . P(A)
P(A B) = X (2.2)
P(B Ak ) . P(Ak )
k

This relationship will be the version of Bayes’s Theorem used on this course.
Bayes’s Theorem has many applications and a conspicuously successful use has been in
speech processing.

– 2.6 –
Illustration of Bayes’s Theorem
Suppose the children in a school are classified in two ways, boys/girls and fair-haired/dark-
haired. The following information is available:

Two-thirds of the children are boys of whom half are fair-haired


One-third of the children are girls of whom three-quarters are fair-haired

You are asked to determine the probability that a fair-haired child is a boy.
Notice what information is missing. You don’t know what proportion of children are fair-
haired never mind what proportion of those are boys. To set up a 2×2 table of probabilities
would require undertaking a fair amount of arithmetic. It is in cases like this (or more
ambitious versions of this) that Bayes’s Theorem can be useful.
First consider what information is available:
2 1 1 3
P(boy) = P(fair boy) = P(girl) = P(fair girl) =
3 2 3 4

The information required is P(boy fair).
These are exactly the
circumstances for making use of Bayes’s Theorem; you want to derive
P(A B) from P(B A).
The next step is to set up two random variables: let X represent boy/girl and Y represent
fair/dark. Each random variable can take two values and to make use of the summation
in Bayes’s Theorem the two values in each case need to be 0 and 1. Map as follows:

Let X = 0 7→ boy and let X = 1 7→ girl


Let Y = 0 7→ fair and let Y = 1 7→ dark

The supplied information now translates into:


2 1 1 3
P(X = 0) = P({Y = 0} {X = 0}) = P(X = 1) = P({Y = 0} {X = 1}) =
3 2 3 4

The information required translates into P({X = 0} {Y = 0})
By Bayes’s Theorem:

P({Y = 0} {X = 0}) . P({X = 0})
P({X = 0} {Y = 0}) = X
P({Y = 0} {X = k}) . P({X = k})
k

Substituting the known values:


1 2 1
2 × 3 3 4
P({X = 0} {Y = 0}) = 1 2 3 1 = 1 1 =
2 × 3 + 4 × 3 3 + 4
7

This is the required answer.

– 2.7 –
Tree Diagrams
One way to represent the information given in the problem about the children is to use a
tree:

1

P(fair boy) = 2 1 2 1

P(fair boy) . P(boy) = P(fair∩boy) = 2 × 3 = 3
2
P(boy) = 3

1

P(dark boy) = 2 1 2 1

P(dark boy) . P(boy) = P(dark∩boy) = 2 × 3 = 3

3

P(fair girl) = 4 3 1 1

P(fair girl) . P(girl) = P(fair∩girl) = 4 × 3 = 4
1
P(girl) = 3

1

P(dark girl) = 4 1 1 1

P(dark girl) . P(girl) = P(dark∩girl) = 4 × 3 = 12

The tree contains the four known values and the calculations to the right of the tree show
four derived values. These are the probabilities of the four elementary events and they can
be presented in a 2 × 2 table:

fair dark
1 1 2
boy 3 3 3
1 1 1
girl 4 12 3
7 5
12 12

Once the probabilities of the elementary events are known and entered in the table, the
marginal sums can be calculated. The two values at the right-hand edge which correspond
to P(boy) and P(girl) were given in the problem and it is worth checking to see that the
calculated values are the same. It goes without saying that these should total 1.
The marginal sums at the bottom edge correspond to P(fair) and P(dark). These values
also total 1. They are not given in the problem and have not been computed before.
Setting up the table in this way provides an alternative to using Bayes’s Theorem to solve
the original problem. The two probabilities in the left-hand column at once show that:
1 1
3 3 4
P(boy fair) = 1 1 = 7 =
3 + 4 12
7

Setting up the table is hard work and it is really very much easier to use Bayes’s Theorem.

– 2.8 –
Independence of Random Variables — II
In the two dice example, the two random variables were independent because:

P(X = r, Y = s) = P(X = r) . P(Y = s) for all r, s

In the problem about the children, simple inspection shows that this identity fails in all
four cases:

1 2 7 1 2 5 1 1 7 1 1 5
6= × 6= × 6= × 6= ×
3 3 12 3 3 12 4 3 12 12 3 12

Quite clearly the two random variables in this example are not independent.

The Inverse Tree


The first branching in the tree representation on the previous page indicates the division
into boys and girls. Now that the table has been constructed, the inverse tree can be
constructed. In this the first branching indicates the division into fair and dark:

4

P(boy fair) = 7

7
P(fair) = 12

3

P(girl fair) = 7

4

P(boy dark) = 5

5
P(dark) = 12

1

P(girl dark) = 5


The value of P(boy fair) was determined via Bayes’s Theorem earlier and from the table
more recently. The other values in the inverse tree are derived from the table.

Event Trees
There are many occasions in probability theory when a tree is a good way of presenting
information. A second common form of tree is the event tree.
Consider a family in which there are two children. In how many ways can the two sexes
occur? Trivially, the first child may be a boy or a girl and, in each case, the second child
might also be a boy or a girl. This gives rise to four cases (events) which can be represented
as BB, BG, GB and GG.

– 2.9 –
Suppose the probability of a child being a boy is p and the probability of a child being a
girl is q where p + q = 1. In Britain, p ≈ 0.515 and q ≈ 0.485. Assuming the sex of the
second child is independent of the sex of the first child, the multiplication theorem can be
used and P(BB) = p2 .
The probabilities of the other events are P(BG) = pq, P(GB) = qp and P(GG) = q 2 .
This information can be represented in an event tree:

P(BB) = p2

P(B) = p

P(BG) = pq

P(GB) = qp

P(G) = q

P(GG) = q 2

The two events BG and GB are clearly distinguishable but have equal probabilities given
that pq = qp.
The sum of the probabilities associated with two children is:
p2 + pq + qp + q 2 = p2 + 2pq + q 2 = (p + q)2 = 1
Once again, remember that it is good practice to check that the total probability is 1.
The coefficients of the terms in the second expression are 1, 2 and 1 and these are the
numbers of ways of having two boys, one of each, and two girls respectively. They are the
values in row two of Pascal’s Triangle and are also known as Binomial coefficients.

Combinatorial Numbers — Pascal’s Triangle


Here are the first five rows of Pascal’s Triangle shown in two ways:
0
1 C0
1
1 1 C0 1 C1
2
1 2 1 C0 2 C1 2 C2
3
1 3 3 1 C0 3 C1 3 C2 3 C3
4
1 4 6 4 1 C0 4 C1 4 C2 4 C3 4 C4

The rows are numbered from 0 and the values in row two are 1, 2 and 1 as noted in the
previous section. In row n there are n + 1 values which are numbered 0 to n.

– 2.10 –
In the version on the right, a given value is identified by n Cr where n is the row number
and r is the position in the row. n Cr is the number of ways in which there can be r girls
in a family of n children.
The symmetry of the values in Pascal’s Triangle means that, in general, n Cr = n Cn−r .
For example, 4 C0 = 4 C4 and 4 C1 = 4 C3 . In words: the number of ways there can be
n − r girls in n children, n Cn−r , is the same as the number of ways there can be r boys
in n children which is exactly the same as the number of ways there can be r girls in n
children, n Cr .
As already noted, with two children there are 2 ways in which one may be a girl, BG and
GB, and 2 C1 = 2. Pascal’s Triangle shows that with four children there are 6 ways in which
two may be girls, 4 C2 = 6. The values in Pascal’s triangle are examples of combinatorial
numbers and the C stands for Combinations.
Pascal’s Theorem states that:
n+1
Cr+1 = n Cr + n Cr+1

Informally, each value is the sum of the two nearest values in the row above. This doesn’t
apply to the edge values which are all 1.

Grid or Lattice Representation


Another useful form of representation when analysing problems in probability theory is to
use a grid or lattice. Here is an example which can be used to determine combinations of
boys and girls:
girls →
0 1 2 3 4
0 > •

1 •

boys 2 >•
↓ 3 •
4 •

The five blobs constitute the sample space of elementary events when there are four children
and boy-girl order is not relevant. The five blobs represent 4 boys and 0 girls, 3 boys and
1 girl and so on.
This representation can be used to determine combination values. For example, suppose
you wish to determine 4 C2 , the number of ways that there can be two girls in four children.
You simply count the number of ways of getting from the top left-hand corner to the
appropriate blob using only rightward and downward steps.
In the present case, the middle blob of the five represents two girls in four children and
one path from the top left-hand corner to this blob is marked. This one goes right-down-
down-right. Check that you can find the other five routes.

– 2.11 –
The Binomial Theorem
The Binomial Theorem is essentially the expansion of (x + y)n :
       
n n n n n−1 n n−2 2 n n−r r
(x + y) = x + x y+ x y + ··· + x y + · · · + yn
0 1 2 r

The item nr (pronounced n choose r) is a Binomial coefficient and has the same value as

n
Cr , a combination. The two forms can be expressed using factorials:
 
n n!
= n Cr =
r r! (n − r)!

Glossary
The following technical terms have been introduced:
independent Pascal’s Triangle combination
distinguishable Binomial coefficient grid
marginal sum combinatorial number lattice
event tree
Exercises — II
Again work in fractions for all the following problems which have a numerical content.
99
Treat, for example, 99% as 100 and not as 0.99.
1. A bag contains one ball which, equiprobably, may be black or white. A white ball is
added to the bag which is then shaken. One ball is retrieved at random and found to
be white. What is the probability that the other ball is white?
2. A shopkeeper obtains light bulbs from three suppliers, A1 , A2 and A3 and has noted
the following information about the bulbs:
• 30% come from supplier A1 and, of these, 1% don’t work.
• 45% come from supplier A2 and, of these, 1% don’t work.
• 25% come from supplier A3 and, of these, 2% don’t work.

For a bulb selected at random, consider two random variables S (the Supplier which
may be A1 , A2 or A3 ) and D (the Dudness which may be G or B for Good or Bad).
Note that the probability of a randomly selected light bulb from Supplier A 1 being
1

Bad can be expressed as P(D = B S = A1 ) = 100
.

Tasks:
(a) Express the information provided in Tree Representation. From the root of the
tree there should be three branches (one for each supplier). Label these branches
30 45 25
with probabilities, 100 , 100 and 100 respectively. From each node (there is one at
the end of each branch) draw two more branches; these should be labelled with
the appropriate conditional probabilities of Dudness.

– 2.12 –
(b) From the tree representation, derive a Tabular Representation. The table should
have three rows (one for each Supplier) and two columns (headed G and B for
Dudness). Each of the six cells in the table should show the probability of a
randomly selected bulb having come from the relevant Supplier and having the
relevant Dudness.
30 45 25
(c) Mark in the row totals. These should be 100 , 100 and 100 .

(d ) Mark in the column totals. Note that the total of the B column shows the overall
probability of a randomly selected bulb being Bad.
(e) Given that a randomly selected bulb is Bad, what is the probability that it came
from Supplier A3 ?
(f ) From the tabular representation derive the alternative tree representation. From
the root of this tree there should be two branches (one for Good and one for Bad)
labelled with the two appropriate probabilities (the Bad branch should be labelled
with the probability determined in Task d ). From each node there should be three
branches (one for each Supplier) each labelled with the appropriate conditional
probability.
(g) Redraw the tabular representation, but replace the numerical values in the six
cells by expressions of the form P(B A1 ) × P(A1 ) (the first part of this being an
abbreviation of the expression given in the preamble).
(h) Use the revised tabular representation to illustrate Bayes’s Theorem.

3. One person in 1000 is known to suffer from Nerd’s Syndrome (a pathological inability
to resist playing computer games). A standard test is such that 99% of those who
suffer from Nerd’s Syndrome show a positive result. The same test also (falsely) shows
positive with 2% of non-sufferers.
A person selected at random is tested and found positive. What is the probability
that this person actually suffers from Nerd’s Syndrome?

4. A coin is tossed until the first time the same result appears twice in succession. To
every possible outcome requiring n tosses attribute equal probability. Describe the
sample space. Find the probability of the following events: (a) the experiment ends
before the sixth toss, (b) an even number of tosses is required.

5. What is the probability that two throws with three dice show the same configuration
if (a) the dice are distinguishable and (b) they are not.

6. From first principles, prove Pascal’s Theorem that n+1 Cr+1 = n Cr + n Cr+1 . Do not
make use of the representation which uses factorials. Instead reason as follows:
Assume that n+1 Cr+1 really is the number of ways of having r + 1 girls in a family of
n + 1 children and consider, separately, the number of these ways where the last child
is a girl and the number of these ways where the last child is a boy.

– 2.13 –
7. [From Part IA of the Natural Sciences Tripos, 1996 (Elementary Mathematics for
Biologists)] Alice is observing the White Knight and the Red Knight, who are fighting
for the privilege of taking her prisoner. They have agreed to observe the Rules of
Battle, which are as follows:
2
In each round, the White Knight hits the Red Knight with probability 10 , and the
3
Red Knight hits the White Knight with probability 10 . It is not possible for both to
score a hit in the same round. If the Red Knight is hit, he falls off his horse with
6 8
probability 10 ; if the White Knight is hit, he falls off his horse with probability 10 .
A Knight who scores a hit never falls off his horse but if the Knights miss one another
then either may fall off his horse from surprise; they do so independently, the Red
4 6
Knight with probability 10 , and the White Knight with probability 10 .
If they both fall off in the same round the battle ends; they shake hands and walk
away (forgetting about Alice).
In all other cases the Knights proceed to a further round.
(a) What is the probability that the Red Knight falls off his horse in the first round?
(b) What is the probability that both fall off in the first round?
(c) What is the probability that the battle ends after exactly three rounds?
8. Four married couples go to a wife-swapping party. The names of the wives are thrown
into a hat, shuffled, and then distributed randomly one name to each husband. Each
man goes home with the woman allocated to him. What is the probability that no
husband goes home with his own wife?
9. Suppose n couples go to a wife-swapping party organised as in the previous question.
What now is the probability that no husband goes home with his own wife? It is
suggested that you proceed as follows. . .
Suppose Ai is the event ‘Mani goes home with his own wife’.
Then A1 ∪ A2 . . . ∪ An is the event ‘At least one man goes home with his own wife’.
The required probability (that no man goes home with his own wife) can then be
expressed as:
Probability = 1 − P(A1 ∪ A2 . . . ∪ An )
This requires recourse to the inclusion-exclusion theorem and, in the four-couple case
of the previous question, one may exploit the solution to Exercise I, question 11, to
obtain:
Probability = 1 1 term, value 1
1
− [P(A1 ) + P(A2 ) + . . .] 4 terms, each 4
1 1
+ [P(A1 ∩ A2 ) + . . .] 6 terms, each 4 × 3
1 1 1
− [P(A1 ∩ A2 ∩ A3 ) + . . .] 4 terms, each 4 × 3 × 2
1 1 1 1
+ [P(A1 ∩ A2 ∩ A3 ∩ A4 )] 1 term, value 4 × 3 × 2 × 1

Evaluate this expression and confirm that the result agrees with your answer to the
four-couple case. It should now be easy to write down the general expression for the
n-couple case which you should then prove by induction.

– 2.14 –
3 — DISCRETE DISTRIBUTIONS

It is always helpful when solving a problem to be able to relate it to a problem whose


solution is already known and understood. In probability theory, many problems turn out
to be special cases of standard examples. The most common standard examples are the
well-known distributions.

Discrete Distributions
In simple terms, a distribution is an indexed set of probabilities whose sum is 1.
For the moment, discussion will be restricted to cases where there is a single discrete
random variable X whose value r runs from zero upwards and serves as the index. It is
possible to think of r running from 0 to ∞ where in most cases the indexed probability is
zero.
A distribution may be expressed by a table or a function or graphically. Consider the
distribution associated with a fair die.
A tabular representation of the distribution is:

r→
X 0 1 2 3 4 5 6
0 1 1 1 1 1 1
P(X = r) 6 6 6 6 6 6 6

A functional representation of the distribution is:

 1 , if r ∈ N ∧ 1 6 r 6 6

P(X = r) = 6
0, otherwise

A graphical representation of the distribution is:


1
6

P(X = r)

0
6
0 1 2 3 4 5 6
r→
Of the three representations, only the function makes it pedantically clear that unless
1 6 r 6 6 the probability is zero.

– 3.1 –
The Uniform Distribution
When all the non-zero probabilities are the same and are indexed by a contiguous sequence
of values of r, the distribution is said to be a Uniform distribution.
The behaviour of a fair die is an example of a Uniform distribution. All six non-zero
probabilities are the same and the index r for these probabilities has the contiguous values
1, 2, 3, 4, 5 and 6.
One can imagine a fair die with a different number of faces. Consider a fair tetrahedral
die whose faces happen to be numbered 5, 6, 7 and 8. The four probabilities are all 14 .
There is actually a family of distributions and the description:
Uniform(m, n)
is used to refer to the general case; m and n are the start and stop values of r and are
called the parameters of the distribution.
A random variable whose value represents the outcome of throwing an ordinary fair die is
said to be ‘distributed Uniform(1,6)’. If the value represents the outcome of throwing the
curious tetrahedral die the random variable is distributed Uniform(5,8).
In the general case there are n − m + 1 values for the index and, given the equiprobable
nature of the Uniform distribution, the probabilities are all 1/(n − m + 1). The functional
representation of the general case is:
1

 , if r ∈ N ∧ m 6 r 6 n
P(X = r) = n − m + 1
0, otherwise

It is good practice always to check that the probabilities in a distribution sum to 1. With
the general Uniform distribution the check is straightforward:
n
X 1 n−m+1
= =1
r=m
n−m+1 n−m+1

The Triangular Distribution


Many standard distributions will be discussed but one which has already been noted but
not until now given a name is the Triangular distribution. The functional representation
of an example of this distribution was given on page 1.7 as:
( r
, if r ∈ N ∧ 1 6 r 6 6
P(X = r) = 21
0, otherwise

As with all discrete distributions it satisfies the informal requirement of being a set of
indexed probabilities whose sum is 1:
6
X r 1+2+3+4+5+6
= =1
r=0
21 21

– 3.2 –
The Binomial Distribution
If the probability of a boy is p and the probability of a girl is q (where p + q = 1) it has
already been shown that the four probabilities for two children sum to 1 as:

p2 + pq + qp + q 2 = 1

The four terms are the probabilities of BB, BG, GB and GG respectively. Since pq = qp
the four terms can conveniently be reduced to three:

p2 + 2pq + q 2 = 1

Using Binomial coefficients, this can be written as:


     
2 2 0 2 1 1 2 0 2
p q + p q + p q =1
0 1 2

The middle term is the probability of one boy and one girl without regard to order.
With four children the equivalent sum is:

p4 + 4p3 q + 6p2 q 2 + 4pq 3 + q 4 = 1

As with the previous example, the term which represents all boys is first and the term
which represents all girls is last. Reversing the order gives:

q 4 + 4pq 3 + 6p2 q 2 + 4p3 q + p4 = 1

Using Binomial coefficients, this can be written as:


         
4 0 4 4 1 3 4 2 2 4 3 1 4 4 0
p q + p q + p q + p q + p q =1
0 1 2 3 4

The five terms are, respectively, the probabilities of having 0, 1, 2, 3 and 4 boys in a family
of four children without regard to order.
Using the random variable X to refer to the number of boys:
 
 4 pr q 4−r , if r ∈ N ∧ 0 6 r 6 4
P(X = r) = r

0, otherwise

This is an indexed set of probabilities whose sum is 1 and so is a distribution.


 r 4−rIt is an
4
example of the Binomial distribution as it applies to 4 children. The term r p q begins
4

with r (the number of ways of there being r boys in 4 children) and this is multiplied by
pr (the probability of having r boys) and q 4−r (the probability that the remaining 4 − r
children are girls).

– 3.3 –
As a distribution it is not completely specified until a value is given for p (and hence q) as
well as saying how many children there are.
Taking p = 0.515 and q = 0.485, and for once not using fractions, the probabilities may
be tabulated thus:

r→
X 0 1 2 3 4
P(X = r) 0.055 0.235 0.374 0.265 0.071

It is easy to check that the five values sum to 1. Notice also that when there are four
children of the same sex, the probability that they are all boys is noticeably greater than
the probability that they are all girls.
A graphical representation of the distribution is:

0.40

P(X = r)

0.00
0 1 2 3 4
r→

As with the Uniform distribution, the Binomial distribution is a family of distributions,


indeed a family of families. The description
Binomial(n, p)

is used to refer to the general case; n and p are the parameters. In the example just
considered, the random variable X is said to be distributed Binomial(4, 0.515).
The (general) Binomial distribution applies to many circumstances where there is a finite
number of entities each of which may be one of two possibilities. A random variable
whose value represents the number of heads which appear when 4 fair coins are tossed is
distributed Binomial(4, 12 ). If you have 4 machines each of which has a 1% probability of
1
failing in a given time interval, the appropriate distribution is Binomial(4, 100 ).
In general, where a random variable X is distributed Binomial(n, p), the probability
P(X = r) is:
 
 n pr q n−r , if r ∈ N ∧ 0 6 r 6 n
P(X = r) = r

0, otherwise

– 3.4 –
The sum of these n + 1 probabilities is:
         
n 0 n n 1 n−1 n 2 n−2 n r n−r n n 0
p q + p q + p q + ··· + p q + ··· + p q
0 1 2 r n

It is immediately clear from the Binomial theorem that the sum is 1 since the expression
can be rewritten:
n  
X n r n−r
p q = (q + p)n = 1
r=0
r

Note that (q + p)n is shown (in preference to (p + q)n ) since, reading from left to right,
the terms in its expansion are normally written with ascending powers of p (compare with
the expansion of (x + y)n on page 2.12). The key point is that the general case satisfies
the informal requirement of having a set of indexed probabilities whose sum is 1.

A Point to Ponder
In the context of children, the particular term nr pr q n−r is the probability

of there being
r boys and n − r girls in a family of n children. The coefficient nr is the number of ways


in which n children may divide as n boys and n − r girls and this coefficient multiplies the
probability of one such case.
With 4 children the probability of there being 1 boy (and 3 girls) is 41 p1 q 3 = 4pq 3 . This


is really the sum:


pqqq + qpqq + qqpq + qqqp = 4p1 q 3

The multiplication theorem holds for each term because the boy-girl events are independent
and the addition rule holds overall because the B+3G events are mutually exclusive. Since
each of the four separate B+3G events has the same probability, the addition amounts to
multiplying the probability of one of them by 4.
The previous paragraph is worth pondering. Are the boy-girl events really independent?
If a couple have three girls would you really put the same odds on the next child being a
boy as you would if they were expecting their first baby? Is the value of p in qqqp really
the same as the p in pqqq? Demographic experts generally agree that it is.

The Trinomial Distribution


The Binomial distribution applies when considering entities which have two states, boy-
girl, heads-tails, working-broken and so on.
There are circumstances when three states are appropriate. For example a bicycle has
three principal states: Parked, Ridden or Pushed and traffic lights can be Red, Green or
Changing. There was a brief period when ternary computers were thought worth exploring:
voltages would have been positive, zero or negative.
There are many three-state examples in genetics. It would be fanciful to imagine that
children came in three sexes but, if both parents have blood group AB, then each offspring
will necessarily have blood group, AA, AB or BB and there are known probabilities for
each.

– 3.5 –
One can reasonably ask the probability of such parents with four children having two
children AA, one AB and one BB. Problems involving entities which have three states
lead naturally to a discussion of the Trinomial distribution.
Before proceeding, consider a summary of the case of a family of four children and the
two-state boy-girl analysis:
• There are two salient probabilities p and q; these are the probabilities of a child being
a boy or a girl respectively. Necessarily p + q = 1.
• If order is taken into account, there are 24 = 16 ways of having four children being
GGGG, GGGB, . . ., BBBB.
• If order is taken not into account, there are 4 + 1 = 5 ways of having four children
which can be listed as 0 boys, 1 boy, . . ., 4 boys.
• The probability of there being r boys is:
 
4 r 4−r 4!
p q = pr q 4−r
r r! (4 − r)!

Suppose the three blood groups are labelled a, b and c and are regarded as three sexes:
• There are three salient probabilities; call these pa , pb and pc and note that necessarily
pa + pb + pc = 1.
• If order is taken into account, there are 34 = 81 ways of having four children because
each may be one of three possibilities.
• If order is taken not into account, the number of ways of having four children turns
out to be 15 and these ways are most easily presented in a triangular array:

aaaa
aaab aaac
aabb aabc aacc
abbb abbc abcc accc
bbbb bbbc bbcc bccc cccc

• There is no direct parallel to ‘the probability of there being r boys’ because of the
complication introduced by having three possibilities. . .
Consider a rewrite of the expression for the probability of there being r boys in the Binomial
case:
4! 4!
pr q 4−r = p a ra p b rb
r! (4 − r)! ra ! r b !
The boy-girl probabilities p and q have been replaced by pa and pb and the boy-girl numbers
r and 4 − r have been replaced by ra and rb . Clearly pa + pb = 1 and ra + rb = 4.

– 3.6 –
This latter expression generalises to the Trinomial case:
4!
p a ra p b rb p c rc where pa + pb + pc = 1 and ra + rb + rc = 4
ra ! r b ! r c !

Here, within the total of four children, ra , rb and rc are the numbers with blood groups
AA, AB and BB respectively. The only possibilities for ra , rb and rc are 3 permutations of
(0,0,4), 6 permutations of (0,1,3), 3 permutations of (0,2,2) and 3 permutations of (1,1,2)
making a total of 15 possibilities.
These 15 possibilities can be plugged into the expression to give the 15 probabilities:

p4a
4p3a pb 4p3a pc
6p2a p2b 12p2a pb pc 6p2a p2c
4pa p3b 12pa p2b pc 12pa pb p2c 4pa p3c
p4b 4p3b pc 6p2b p2c 4pb p3c p4c

By way of illustration, take the middle term in the middle row. This is the probability
that two of the children have blood group AA, one is blood group AB and one is blood
group BB. Thus ra = 2, rb = 1 and rc = 1. So:
4! 4!
p a ra p b rb p c rc = pa ra pb rb pc rc = 12pa ra pb rb pc rc
ra ! r b ! r c ! 2! 1! 1!

Note that the sum of the coefficients is 81, accounting for the 81 possibilities if order is
important. [Equivalently, the sum of the coefficients in q 4 + 4pq 3 + 6p2 q 2 + 4p3 q + p4 is 16,
accounting for the 16 possibilities in the binomial case if order is important.]
The sum of the 81 probabilities turns out to be:

(pa + pb + pc )4 = 1 given that pa + pb + pc = 1

It is not difficult to expand this fourth power by hand and verify that the 15 terms which
result correspond to those in the triangle.
It is an essential requirement of any distribution that the overall total probability is 1 and
the Trinomial distribution satisfies this. A difference from the Uniform, Triangular and
Binomial distributions is that the constituent probabilities of the Trinomial distribution
are not indexed in a linear way.
There is nothing special about 4 as the number of children and the general expression for
the Trinomial distribution is:
n!
p a ra p b rb p c rc where pa + pb + pc = 1 and ra + rb + rc = n
ra ! r b ! r c !

Given n children, this is the probability that ra are of blood group AA, rb are of blood
group AB and rc are of blood group BB.

– 3.7 –
The Multinomial Distribution
If entities have k states then the Multinomial distribution may apply. The expression that
should be noted is:
n!
p 1 r1 p 2 r2 . . . p k rk where p1 + p2 + · · · + pk = 1 and r1 + r2 + · · · + rk = n
r1 ! r 2 ! . . . r k !
Given n entities, this is the probability that r1 are in state 1, r2 are in state 2 and so on
up to rk being in state k.

Expectation or Mean
If you repeatedly throw a fair die you would intuitively expect the long-term average of
the values shown to be 3 21 . On this occasion, intuition provides the right answer but a
more formal approach is merited.
The terms expectation (usually denoted by the letter E) and mean (usually denoted by µ)
are used to describe the long-term average. The mean may be calculated by thinking of
weights and moments to determine a centre of gravity.
Including the contrived zero, the values which can result from throwing a die are 0, 1, 2,
3, 4, 5 and 6. Imagine marking these values off at unit intervals along a light beam and at
each of the seven positions placing a weight whose mass is proportional to the associated
probability:
0 1 1 1 1 1 1
6 6 6 6 6 6 6

N
0 1 2 3 4 5 6
r→
µ
The figure shows such an arrangement with little squares representing the weights. The
leftmost weight has mass zero and so is not shown. A pivot has been placed at distance
µ along the beam and it is at once clear that its position would not leave the beam in
balance.
To achieve balance, consider the net clockwise moment about the pivot. The required
value of µ has to be such that the net moment is zero. Accordingly, µ must satisfy:
0 1 1 1 1 1 1
(0 − µ). + (1 − µ). + (2 − µ). + (3 − µ). + (4 − µ). + (5 − µ). + (6 − µ). = 0
6 6 6 6 6 6 6
Consider the last term on the left, (6 − µ). 61 . The value 6 − µ is the distance of the
rightmost weight from the pivot and this is multiplied by the mass of the weight (equal to
the probability). The same consideration applies to the other terms but notice that if a
weight is to the left of the pivot, the distance (as 2 − µ for example) is negative, correctly
implying that the moment is anti-clockwise.
Rearrange the equation:
 
0 1 1 1 1 1 1 0 1 1 1 1 1 1
µ + + + + + + = 0. + 1. + 2. + 3. + 4. + 5. + 6.
6 6 6 6 6 6 6 6 6 6 6 6 6 6

– 3.8 –
The item in brackets is the sum of the probabilities and this, as always, is 1. Accordingly:
0 1 1 1 1 1 1 1+2+3+4+5+6 21 7
µ = 0. + 1. + 2. + 3. + 4. + 5. + 6. = = =
6 6 6 6 6 6 6 6 6 2

The value 72 or 3 21 comes as no surprise as the long-term average outcome of throwing an


ordinary fair die.
The expression for µ can be rewritten:
6
X
µ = 0.P(X = 0) + 1.P(X = 1) + 2.P(X = 2) + · · · + 6.P(X = 6) or µ= r.P(X = r)
r=0

The analysis applies to any distribution which is an indexed set of probabilities whose
sum is 1. The general formula for the expectation or mean of a single random variable is
written as: X
E(X) = µ = r.P(X = r) (3.1)
r

The item E(X) is pronounced ‘the expectation of X’. The sum over r is left open-ended
but this is taken to refer to the range which is appropriate.

Glossary
The following technical terms have been introduced:
distribution Triangular distribution Multinomial distribution
Uniform distribution Binomial distribution expectation
parameter Trinomial distribution mean
Exercises — III
Work in fractions whenever possible.
1. If the probability of hitting a target is 25 and five shots are fired, what is the probability
that the target will be hit at least twice? What is the conditional probability that the
target will be hit at least twice, assuming that at least one hit is scored?
2. A supermarket has 20 check-outs: 5 have A-type cash registers and 15 have B-type.
The A-type has a probability a of breaking down during the first hour of trading and
the B-type has a probability b. The supervisor arrives at the end of this time and
learns that one register has broken down. Determine the probability that the broken
register is (a) A-type and (b) B-type.
3. 12 dice are thrown. What is the probability that each face appears twice?
n  

n+1 n n
 X n
= 2n
 
4. Given Pascal’s theorem expressed as r+1 = r + r+1 prove that
r=0
r
5. Prove the Binomial Theorem (page 2.12).
Hint: it may be helpful to assume that the expansion holds for (x+y) n and to consider
the effect of multiplication by one more (x + y).

– 3.9 –
6. Using (3.1), determine the expectation of the Triangular distribution:
( r
, if r ∈ N ∧ 1 6 r 6 6
P(X = r) = 21
0, otherwise

7. [From Part IA of the Mathematical Tripos, 1973] A princess is equally likely to sleep
on anything from six to a dozen mattresses of the softest down, and beneath the lowest
of these on just half the nights of the year is placed a pea. Being a young lady of
refined sensibility her sleep is invariably disturbed by the presence of a pea beneath
a mere six mattresses; with seven however a pea may pass unnoticed in one case out
of ten, with eight it may escape detection in two cases out of ten, and so on, so that
with the full twelve mattresses she slumbers on notwithstanding the offending pea as
often as six times in ten. One morning, on being wakened by Bayes, her maid, she
announces delightedly that she has spent the most tranquil of nights.
What is the expected number of mattresses upon which she slept?
8. The answers to the following questions may be expressed as decimals:
(a) What is the probability of obtaining at least one six when six dice are thrown?
(b) What is the probability of obtaining at least two sixes when 12 dice are thrown?
(c) What is the probability of obtaining at least three sixes when 18 dice are thrown?
9. A report of a possible breakthrough in the treatment of Nerd’s Syndrome described
a preliminary trial of a new drug. The drug was administered to 10 sufferers all of
whom were immediately cured of the affliction. Tragically, later trials showed that
10% of patients treated with this drug die from an unfortunate side-effect.
Suppose that n patients take part in a trial and that p is the probability that a trial
participant suffers the fatal side-effect. Let S be the probability that at least one of
the n patients dies. [Thus S is the probability that the trial reveals the side-effect.]
Now consider the following (where, again, probabilities may be expressed as decimals):
1
(a) In the preliminary trial, p = 10 and n = 10. What is the probability that none
of the 10 patients dies (as was the case in the reported trial)?
1
(b) Again taking p = 10 , what is the minimum value of n needed to be 90% sure that
the trial reveals the side-effect? [Thus what is the minimum value of n needed to
90
ensure that S > 100 ?]
1
(c) The value 90% is sometimes called the confidence. With p = 10 , what is the
99
minimum value of n needed to ensure a confidence of 99%? [That is S > 100 ?]
(d ) For arbitrary (but known) p and arbitrary (but known) confidence C, what is the
minimum value of n needed to ensure that S > C?
(e) In real life there is a well-known heart drug for which the probability of suffering
a serious side-effect is 10−5 . The risk of using the drug is deemed acceptable
because there is a very much greater probability that an untreated patient will
die. How large a trial would be needed to be 99% confident that the trial reveals
the side-effect?

– 3.10 –
4 — MEANS AND VARIANCES

The term expectation (or mean) has so far been confined to a single random variable.
Related to expectation is the term variance and this will be discussed before two important
new distributions are introduced. Expectation and variance will then be applied to cases
where there are two or more random variables.

Derived Random Variables


This course follows the common practice of using X and r to refer to the name and value
of a single random variable. The most-frequently used example of a random variable has
been the outcome of throwing a die.
If two people are betting on the outcome of throwing a die and wish to make matters (ever
so slightly) more interesting they might decide to bet on some function of the outcome
instead of the outcome itself. They might perhaps bet on the square of the outcome, the
sine of the outcome, seven more than the outcome, two to the power of the outcome and
so on.
Such a decision does not affect the number of possible outcomes or the values of the
outcomes themselves. What is affected is the values the people bet on and, of much
greater interest, the long-term average of the values bet on.
Any function of some random variable X is itself a random variable; call it Y where:

Y = f (X)

Subject to certain common-sense constraints f is an arbitrary function and Y is called a


derived random variable. If f is like the square function it will lead to integer values. If
it is like the sine function it will not and various issues arise which will not be addressed
here.

Generalised Expectation
Fortunately the expectation of some function f (X) of a random variable X is a trivial
generalisation of (3.1):
 X
E f (X) = f (r).P(X = r) (4.1)
r

This can be derived in exactly the same way as the simple expectation by considering the
position of the centre of gravity of a light beam supporting weights. Suppose the bets are
on the square of the outcome of throwing a die. The weights are just the same but they
are placed at positions 0, 1, 4, 9, 16, 25 and 36 along the beam.
The analysis gives rise to the expectation:
0 1 1 1 1 1 1 91 1
E(X 2 ) = 0. + 1. + 4. + 9. + 16. + 25. + 36. = = 15
6 6 6 6 6 6 6 6 6
Many functions of random variables will be used in this course but the square function
and one closely related to the square are undoubtedly the most important.

– 4.1 –
Variance
Suppose that, instead of simply squaring the outcome of each throw, the players first
subtract the mean (3 21 of course) and then square. The expectation of the values bet on
now is:  X
E (X − 3 12 )2 = (r − 3 12 )2 .P(X = r)
r

This value is known as the variance (usually denoted by the letter V) or mean squared
deviation from the mean usually denoted by σ 2 . In general:
 X
Variance = σ 2 = V(X) = E (X − µ)2 = (r − µ)2 .P(X = r) (4.2)
r

The item V(X) is pronounced ‘the variance of X’. The sum over r again refers to the
range which is appropriate.
The variance gives a measure of the spread of values from the mean. Here is a table
showing how the variance of the outcomes of throwing a die may be calculated:

r r−µ (r − µ)2 P(X = r) (r − µ)2 .P(X = r)


0 − 72 49
4
0
6 0
1 − 52 25
4
1
6
25
24

2 − 32 9
4
1
6
9
24

3 − 12 1
4
1
6
1
24
1 1 1 1
4 2 4 6 24
3 9 1 9
5 2 4 6 24
5 25 1 25
6 2 4 6 24

70 35
The sum of the entries in the rightmost column is 24 or 12 .

Standard Deviation
If expectation is related to the idea of a centre of gravity, the variance is related to the
idea of moment of inertia.
To produce a measure of the spread of values from the mean which has the same dimension
as values themselves it is common to use the square√root of the variance and this is known
as the standard deviation (denoted by σ, which is σ 2 of course):

Standard Deviation = σ = Variance

In the case of the die:


r
70
Standard Deviation = σ = ≈ 1.71
24

– 4.2 –
The Geometric Distribution
A very common two-state entity to which the Binomial distribution is applied is any
experiment or trial whose outcome is regarded as a success or a failure with no other
possibility.
A special case of this is waiting for a No. 9 Bus. Each bus that stops at your stop is
regarded as a trial. If the bus is a No. 9 the trial counts as a success and any bus which
isn’t a No. 9 is regarded as a failure.
This is a perfectly ordinary example of the Binomial distribution if you simply note success
or failure for each of n buses. If you are hoping to catch a No. 9 Bus the circumstances are
different. As soon as a No. 9 Bus comes to your stop you get on board and the sequence
of trials abruptly terminates!
Let p be the probability of a single trial being a success and X be the random variable
which represents the number of failures before the first success (at which point matters
conclude). The probability of having to wait r trials before the first success is:

P(X = r) = (1 − p)r p

In this, 1 − p is the probability of a trial being a failure and the probability of r such trials
is (1 − p)r . There has then to be a success and this explains the final p.
Here is an indexed set of probabilities. The sum happens to be 1 but that has yet to be
demonstrated. This is known as the Geometric distribution.
Note that the range of r runs indefinitely upwards from zero. When r = 0 there are no
failures before the first success. As r increases you have to sustain more and more failures
before the first success and, in principle, there is no limit to the number of failures.
The sum of the probabilities is therefore a summation to infinity. The summation is valid
because (1 − p) < 1:


X
P(X = r) = (1 − p)0 p + (1 − p)1 p + (1 − p)2 p + · · ·
r=0
1
= (1 − p)0 + (1 − p)1 + (1 − p)2 + · · · p =

p=1
1 − (1 − p)

The Geometric distribution satisfies the informal requirement of being an indexed set of
probabilities whose sum is 1. As with other distributions, the Geometric distribution is a
family of distributions but this one has only one parameter. The description:
Geometric(p)

is used to refer to the general case.

– 4.3 –
Note that the sum to some lesser limit, k say, represents the probability of having no more
than k failures before the first success:
k
X
P(X = r) = (1 − p)0 p + (1 − p)1 p + · · · + (1 − p)k p
r=0
= (1 − p)0 + (1 − p)1 + · · · + (1 − p)k p


1 − (1 − p)k+1
= p = 1 − (1 − p)k+1
1 − (1 − p)
The Poisson Distribution
Suppose that over a long period of time a town has recorded an average of λ murders a
year. Sometimes there is a run of good years in which there are no murders at all and then
there may be a couple of bad years with several. How does one estimate the probability
of there being exactly two murders next year?
This problem introduces the Poisson distribution. The analysis begins by taking λ as the
expectation, the expected (or average) number of murders in a year.
Divide the year into n equal intervals (365 might be a sensible value for n) and assume
that in any given interval the number of murders is zero or one. Given this assumption,
the probability of there being a murder in any given interval is nλ .
Having chosen n intervals in a full year, let X be the random variable which represents
the number of murders in these n intervals. Subject to the assumption, the Binomial
distribution gives the probability of there being r murders in the n intervals.
The assumption is important. If there were sometimes two murders in an interval the
Trinomial distribution would be required. To ensure that the assumption is valid, n must
be made sufficiently large that the intervals are sufficiently small for the possibility of two
or more murders to be ignored.
Noting the assumption, the probability of there being r murders in the n intervals is:
  r  n−r
n λ λ
P(X = r) = 1−
r n n
 r  n−r
n! λ λ
= 1−
r! (n − r)! n n
n−r
n(n − 1) . . . (n − r + 1)(n − r) . . . 1 λr

λ
= 1−
r! (n − r)! nr n
Rearrange the expression. First, eliminate (n−r)! from the numerator and the denominator
of the term on the left and note that this leaves exactly r terms in the numerator. Then
exchange the r! in the denominator with the nr in the denominator of the term in the
middle. Expand nr into r separate ns and expand (1 − λ/n)r as well:
 2 !
n − r + 1 λr
   
nn−1 n−r λ (n − r)(n − r − 1) λ
··· 1− + − ···
n n n r! 1! n 2! n

– 4.4 –
Next, let n tend to infinity. This will reinforce the validity of the assumption made earlier
and lead to a simplification of the expression. Given finite r, all the terms in the left-hand
pair of brackets tend to 1. The terms in the right-hand pair of brackets can be simplified
too; thus (n − r)/n and (n − r)(n − r + 1)/n2 and so on all tend to 1. The expression
reduces to: !
λr λ λ2 λ3 λr −λ
1− + − + ··· = e
r! 1! 2! 3! r!

The conclusion of this analysis is that the probability of there being r murders is:

λr −λ
P(X = r) = e
r!

This is an indexed set of probabilities and its sum is readily shown to be 1:


!
X λr λ λ2 λ3
e−λ = 1+ + + + · · · e−λ = eλ .e−λ = 1
r=0
r! 1! 2! 3!

As with other distributions, the Poisson distribution is a family of distributions but, like
the Geometric distribution, it has only one parameter. The description:
Poisson(λ)

is used to refer to the general case.


The expectation, E(X), is readily calculated in the case of the Poisson distribution:

∞ ∞ ∞
X X λr X λr
E(X) = r.P(X = r) = r. e−λ = r. e−λ
r=0 r=0
r! r=1
r!

When r = 0 the term is zero so it is in order to begin the sum from r = 1. This means
r/r! can be treated as 1/(r − 1)! and hence:

∞ ∞
! !
2
X λ.λr−1 −λ X λr−1 1 λ λ
E(X) = e =λ e−λ = λ + + + · · · e−λ = λ.eλ .e−λ
r=1
(r − 1)! r=1
(r − 1)! 0! 1! 2!

The eλ and e−λ cancel so:


E(X) = λ

This result should have been obvious all along. The analysis of the Poisson distribution
began by taking λ as the expectation (in the illustration this was the expected number of
murders in a year). It is comforting to see that the result derived now is not in conflict
with that original assumption.

– 4.5 –
The Prussian Cavalry Corps
An especially well-known illustration of the Poisson distribution was given by M.G. Bulmer.
This example relates to 14 corps of Prussian cavalry. Over a period of 20 years, that is
280 corps-years, the number of deaths of cavalry officers from horse kicks was monitored.
The mean number of such deaths per corps-year was 0.7 and this gives a value for λ.
0.7r −0.7
Given Poisson(0.7), P(X = r) = r! e , and a table of the probabilities for values of r
from 0 to 4 is:
r→
X 0 1 2 3 4
P(X = r) 0.496 0.348 0.122 0.028 0.005

These values turn out to be very close to those determined from the observed data.
The five values sum to 0.999, slightly less than 1. With the Poisson distribution, r runs to
infinity but probabilities for other than very small values of r are normally insignificant.
In none of the 280 corps-years were there more than 4 deaths from horse kicks.

Three Rules for Expectation


I If a is a constant E(a) = a
II If a is a constant and X is a random variable E(aX) = aE(X)
III If X and Y are random variables E(X + Y ) = E(X) + E(Y )

These rules require formal proofs. First recall the definition of the expectation of an
arbitrary function of a random variable (4.1):
 X
E f (X) = f (r).P(X = r)
r

In the first rule, the function is simply f (X) = a and does not involve the random variable
at all. Accordingly:
X X
E(a) = a.P(X = r) = a P(X = r) = a.1 = a
r r

This proves the first rule. Note that it is assumed that the sum of the probabilities is 1.
Note also that it is in order to take a constant outside a Σ sign.
In the second rule, the function is f (X) = aX and:
X X
E(aX) = ar.P(X = r) = a r.P(X = r) = aE(X)
r r

– 4.6 –
The proof of the third rule is as follows:
XX XX
E(X + Y ) = (r + s).P(X = r, Y = s) = (r + s).pr,s
r s r s
X X  X X 
= r pr,s + s pr,s
r s s r

X X
= r.P(X = r) + s.P(Y = s) = E(X) + E(Y )
r s

P P
Note that s pr,s and r pr,s are marginal sums (see page 2.3) being row and column
totals respectively in an array representation of the probabilities of the elementary events
associated with two random variables. These totals are P(X = r) and P(Y = s).

Application to Variance
The variance of a random variable X was defined in (4.2) as:

Variance = V(X) = E (X − µ)2




The three rules for expectation can be used to derive an alternative expression for variance:

V(X) = E (X − µ)2 = E(X 2 − 2µX + µ2 )




= E(X 2 ) − E(2µX) + E(µ2 )


= E(X 2 ) − 2µE(X) + µ2
= E(X 2 ) − 2µ2 + µ2
= E(X 2 ) − µ2
2
= E(X 2 ) − E(X) (4.3)

This expression for variance will be much used. The result can be expressed in words as:
Variance is the Expectation of the Square minus the Square of the Expectation

Warning
This expression for the variance often substantially reduces the amount of algebra required
when analysing problems but there is an element of bad news. Although this expression
is mathematically identical to E (X − µ)2 there is a practical difference which Computer
Scientists should readily appreciate. . .
2
The expression E(X 2 ) − E(X) may well involve taking a small difference between two
large numbers, something best avoided if you are mindful of rounding  errors. If you use a
2
computer to process real data, stick to the expression E (X − µ) .

– 4.7 –
Three Rules for Variance
I If a is a constant V(a) = 0
II If a is a constant and X is a random variable V(aX) = a2 V(X)
III If X and Y are independent random variables V(X + Y ) = V(X) + V(Y )
Using the form in (4.3), both rule I and rule II are proved trivially.
2
V(a) = E(a2 ) − E(a) = a2 − a2 = 0

Likewise:
2 2
V(aX) = E(a2 X 2 ) − E(aX) = a2 E(X 2 ) − a2 E(X) = a2 V(X)

Rule III is rather more cumbersome to prove. Begin thus:


2
V(X + Y ) = E (X + Y )2 − E(X + Y )

2 2
= E(X 2 ) + 2 E(XY ) + E(Y 2 ) − E(X) − 2 E(X) . E(Y ) − E(Y )

2 2
Now V(X) = E(X 2 ) − E(X) and V(Y ) = E(Y 2 ) − E(Y ) so:
 
V(X + Y ) = V(X) + V(Y ) + 2 E(XY ) − E(X) . E(Y )

This is the general expression for the variance of the sum of two random variables whether
or not they are independent. If they are independent, the item in square brackets, known
as the covariance of X and Y , turns out to be zero. This will be demonstrated shortly.

Covariance — I
Covariance is usually denoted by the letter W and there is an alternative expression for
W(X, Y ) which is directly analogous to that for the variance V(X) derived as (4.3):

Covariance = W(X, Y ) = E(XY ) − E(X) . E(Y ) = E (X − µX ).(Y − µY )

See exercise 11. Notice that covariance may be negative whereas variance can never be.

Lemma
The determination of covariance requires some further consideration of the double-sigma
notation. Suppose you wish to sum some function of r and s over r and s. If the function
can be reduced to the product of two functions f (r) and g(s) such that f does not depend
on s and g does not depend on r then:
XX X X 
f (r).g(s) = f (r) g(s)
r s r s

– 4.8 –
An informal proof can be derived by consideration of the left-hand side:
XX
f (r).g(s) = f (0).g(0) + f (0).g(1) + f (0).g(2) · · ·
r s
+ f (1).g(0) + f (1).g(1) + f (1).g(2) · · ·
+ f (2).g(0) + f (2).g(1) + f (2).g(2) · · ·
..
.
= f (0).[g(0) + g(1) + g(2) + · · ·]
+ f (1).[g(0) + g(1) + g(2) + · · ·]
+ f (2).[g(0) + g(1) + g(2) + · · ·]
..
.
= [f (0) + f (1) + f (2) + · · ·] [g(0) + g(1) + g(2) + · · ·]
X X 
= f (r) g(s)
r s

This is the right-hand side and the informal proof is concluded.

Covariance — II
The determination of covariance requires the evaluation of E(XY ):
XX
E(XY ) = r.s P(X = r, Y = s)
r s

Now if X and Y are independent, P(X = r, Y = s) = P(X = r) . P(Y = s), when:


XX XX
r.s P(X = r, Y = s) = r.s P(X = r) . P(Y = s)
r s r s

The function after the double-sigma sign can be separated into the product of two terms,
the first of which does not depend on s and the second of which does not depend on r.
Hence, by the lemma:
X X 
E(XY ) = r.P(X = r) s.P(Y = s) = E(X) . E(Y )
r s

Accordingly, if X and Y are independent, the covariance of X and Y :

W(X, Y ) = E(XY ) − E(X) . E(Y ) = 0

and the variance of the sum of X and Y :

V(X + Y ) = V(X) + V(Y )

– 4.9 –
Illustration
Various formulae for the expectation, variance and covariance have been presented and
these formulae can now be applied to an illustrative case.
Recall the class of children who were classified boy-girl and fair-dark. Suppose that the
information supplied consists only of a 2 × 2 table of the four elementary events:

fair dark
1 1
boy 3 3
1 1
girl 4 12

A full analysis is presented on the following page. This takes the form of a completed and
annotated proforma which can be used for analysing the data in any two-dimensional table
of elementary events.
There are five exercises later which should all be carried out using this proforma as a guide.
Note the following steps:

• First choose X and Y as the names of the two random variables. Let these have values
r and s which can each be 0, 1, . . . Appropriate, and obvious, mappings have to be
made to convert non-numerical events such as fair and dark into integers.

• Copy the table of elementary events and ornament it with X, Y , r and s and the
chosen mappings. Compute the marginal sums and check that the row sums and
column sums each total 1.

• Tabulate the probabilities P(X = r) and P(Y = s) separately and, using the values in
the tables, compute the expectation, the expectation of the square, and the variance
of X and Y .

• Set up the big table in the middle of the page. There will be one line of entries for
each elementary event. The third column is simply a transcription of the probabilities
of the elementary events and their sum should be 1. The  totals of the fourth, fifth
2
and sixth columns should give E(X + Y ), E (X + Y ) and E(XY ). Check that
E(X + Y ) = E(X) + E(Y ) (these latter values being computed in the previous step).

• Compute the variance of the sum V(X + Y ) and check whether or not the result is
the same as V(X) + V(Y ).

• Compute the covariance and add twice this to V(X) + V(Y ) and check that the result
is the same as the value of V(X + Y ) computed in the previous step.

• Finally check each probability in the original table against the product of the two
relevant marginal sums. If any probability is different from the relevant product, the
two random variables are not independent.

– 4.10 –
A Full Analysis of a Pair of Random Variables
First copy the table of elementary events and compute the marginal sums. . .

Y
s →
0 1
1 1 2
r 0 3 3 3
X 1 1 1
↓ 1 4 12 3
7 5
12 12

Tabulate the probabilities P(X = r) and P(Y = s) separately and for each variable
compute the expectation, the expectation of the square, and the variance. . .
1 5
r P(X = r) E(X) = 3 s P(Y = s) E(Y ) = 12
2 1 7 5
0 3 E(X 2 ) = 3 0 12 E(Y 2 ) = 12
1 2 5 35
1 3 V(X) = 9 1 12 V(Y ) = 144

For each elementary event tabulate r, s, pr,s , (r + s) pr,s , (r + s)2 pr,s and (r.s) pr,s and
then determine the sums of the four rightmost columns. . .

r s pr,s (r + s) pr,s (r + s)2 pr,s (r.s) pr,s


1
0 0 3 0 0 0
1 1 1
0 1 3 3 3 0
1 1 1
1 0 4 4 4 0
1 1 1 1
1 1 12 6 3 12
3 11 1
1 4 12 12

E (X + Y )2
P P 
r s pr,s E(X + Y ) E(XY )

Now compute various values related to the pair X and Y . . .


2 11 9 17
V(X + Y ) = E (X + Y )2 − E(X + Y ) =

Variance of the sum (i): 12 − 16 = 48

1 1 5 1
Covariance: W(X, Y ) = E(XY ) − E(X) . E(Y ) = 12 − 3 × 12 = − 18
2 35 1 17
Variance of the sum (ii): V(X + Y ) = V(X) + V(Y ) + 2 W(X, Y ) = 9 + 144 − 9 = 48

Finally check for independence; P(X = r, Y = s) versus P(X = r) . P(Y = s) . . .


1 2 7 1 2 5 1 1 7 1 1 5
3 6= 3 × 12 3 6= 3 × 12 4 6= 3 × 12 12 6= 3 × 12

– 4.11 –
Glossary
The following technical terms have been introduced:
variance standard deviation Poisson distribution
derived random variable Geometric distribution covariance

Exercises — IV
Work in fractions.

1. Determine the variance of the Triangular distribution:

( r
, if r ∈ N ∧ 1 6 r 6 6
P(X = r) = 21
0, otherwise

2. A random variable X is distributed Geometric(p) so P(X = r) = (1 − p) r p. Show


that P(X > m + n X > m) = P(X > n) for m, n = 0, 1, 2, . . . [Thus X has the ‘lack
of memory property’ since, given that X − m > 0, the distribution of X − m is the
same as the original distribution of X.]

3. As r takes the values 0, 1, 2, . . . show that the probabilities of the Poisson distribution
P(X = r) = λr e−λ /r! initially increase monotonically then decrease monotonically.
Additionally show that they reach their greatest value when r is the largest integer
not exceeding λ.

4. A computer printout of n pages contains on average λ misprints per page. Estimate


the probability that at least one page will contain more than k misprints.
5. Two random variables X and Y are independently distributed with means µ x and µy
and variances σx2 and σy2 respectively. Derive expressions for the mean and variance
of XY .

6. Suppose that the data given in the 2 × 2 table on page 4.10 were replaced by the
following values for the probabilities of the elementary events:

fair dark
4 2
boy 15 15
2 1
girl 5 5

Complete a proforma like that shown on page 4.11 but based on these new values.

– 4.12 –
7. Suppose the hair colour is now classified in three ways and the results for a rather
extraordinary class are recorded in the following 2 × 3 table:

fair red dark


1 1
boy 3 0 3
1
girl 0 3 0

Again complete a proforma. Note that s will now range 0, 1 and 2 and there will be
an extra line of entries in the table headed s and P(Y = s). There will be two extra
lines in the main table.
8. Suppose that the data are now as in the following 7 × 7 table which is appropriate
for two fair dice. Both r and s will now range from 0 to 6 and there will be seven
lines of entries in the two short tables. In principle there will be 49 lines of entries in
the main table but it is not necessary to fill this in. Only the four totals beneath the
table are required and it will not be hard to see how to derive these by some obvious
short cuts:
0 1 2 3 4 5 6
0 0 0 0 0 0 0 0
1 1 1 1 1 1
1 0 36 36 36 36 36 36
1 1 1 1 1 1
2 0 36 36 36 36 36 36
1 1 1 1 1 1
3 0 36 36 36 36 36 36
1 1 1 1 1 1
4 0 36 36 36 36 36 36
1 1 1 1 1 1
5 0 36 36 36 36 36 36
1 1 1 1 1 1
6 0 36 36 36 36 36 36

9. Complete a similar analysis using the data in the following table where the two dice
always show the same value:

0 1 2 3 4 5 6
0 0 0 0 0 0 0 0
1
1 0 6 0 0 0 0 0
1
2 0 0 6 0 0 0 0
1
3 0 0 0 6 0 0 0
1
4 0 0 0 0 6 0 0
1
5 0 0 0 0 0 6 0
1
6 0 0 0 0 0 0 6

– 4.13 –
10. To conclude this series of dice questions, use the data in the following table where the
sum of the values shown by the two dice is always seven:

0 1 2 3 4 5 6
0 0 0 0 0 0 0 0
1
1 0 0 0 0 0 0 6
1
2 0 0 0 0 0 6 0
1
3 0 0 0 0 6 0 0
1
4 0 0 0 6 0 0 0
1
5 0 0 6 0 0 0 0
1
6 0 6 0 0 0 0 0

2
11. By analogy with the method used to demonstrate that E (X−µ)2 = E(X 2 )− E(X)


(see (4.3) above) show that:



E (X − µX ).(Y − µY ) = E(XY ) − E(X) . E(Y )

where µX = E(X) and µY = E(Y ).

– 4.14 –
5 — CORRELATION

The covariance of two random variables gives some measure of their independence. A
second way of assessing the measure of independence will be discussed shortly but first the
expectation and variance of the Binomial distribution will be determined. The calculations
turn out to be surprisingly tedious.

Expectation of the Binomial Distribution — I


The Binomial distribution can be defined as:
 
n r n−r
P(X = r) = p q where p + q = 1 and 0 6 r 6 n
r
The expectation is:
n n  
X X n r n−r
E(X) = r.P(X = r) = r. p q
r=0 r=0
r

When r = 0 the term is zero so it is in order to begin the sum from r = 1:


n
X n!
E(X) = r pr q n−r
r=1
r! (n − r)!
n
X n!
= pr q n−r
r=1
(r − 1)! (n − r)!
n
X (n − 1)!
= np pr−1 q n−r
r=1
(r − 1)! (n − r)!

Next, let s = r − 1 which, of course, means replacing r by s + 1. Noting that the sum
is currently from r = 1 to r = n, the replacement will mean s running from s = 0 to
s = n − 1:
n−1
X (n − 1)!
E(X) = np ps q n−s−1
s=0
s! (n − s − 1)!

Then, let m = n − 1:
m
X m!
E(X) = np ps q m−s
s=0
s! (m − s)!

Finally note that by the Binomial theorem,


m
m
X m!
(q + p) = ps q m−s
s=0
s! (m − s)!

Given that p + q = 1 the summation itself is 1 and hence:


E(X) = np

– 5.1 –
Expectation of the Binomial Distribution — II
This seems a horribly tedious way of determining the expectation. Fortunately there is a
quicker way.
Consider the most trivial Binomial distribution where a random variable is distributed
Binomial(1, p). A tabular representation of this trivial case is:

r→
X 0 1
P(X = r) q p

The expectation is:


1
X
E(X) = r.P(X = r) = 0.q + 1.p = p
r=0

Now consider n such random variables X1 , X2 , . . . , Xn . Such variables are said to be


Independent and Identically Distributed, commonly simply called IID. The expectation of
their sum is:

E(X1 + X2 + · · · + Xn ) = E(X1 ) + E(X2 ) + · · · + E(Xn ) = p + p + · · · + p = np

Since the variables are Identically Distributed they each have the same expectation and
the sum of these expectations is simply n times the expectation of the original variable.
Taking n variables which are each distributed Binomial(1, p) is exactly equivalent to having
an overall distribution Binomial(n, p). For example, in a family of four children, each child
is distributed Binomial(1, p) and the overall distribution is Binomial(4, p).
This is clearly a quicker way of determining the expectation of the Binomial distribution
and an extension of this technique leads to the variance. . .

Variance of the Binomial Distribution


The random variable which is distributed Binomial(1, p) has expectation E(X) = p. The
expectation of the square is:
1
X
E(X 2 ) = r2 .P(X = r) = 02 .q + 12 .p = p
r=0

The variance can now be computed:


2
V(X) = E(X 2 ) − E(X) = p − p2 = p(1 − p) = pq

Again consider n such random variables X1 , X2 , . . . , Xn . Given that they are Independent
as well as Identically Distributed the variance of their sum is the sum of their variances:

V(X1 + X2 + · · · + Xn ) = V(X1 ) + V(X2 ) + · · · + V(Xn ) = pq + pq + · · · + pq = npq

– 5.2 –
Binomial Summary
A random variable X which is distributed Binomial(n, p) is such that the probability:
 
n r n−r
P(X = r) = p q where p + q = 1 and 0 6 r 6 n
r
The expectation and variance of the random variable are:
E(X) = np and V(X) = npq

Correlation Coefficient
The correlation coefficient comes from scaling the covariance and is often referred to by
the letter R:
W(X, Y )
Correlation Coefficient = R = p
V(X) . V(Y )
In general −1 6 R 6 +1.
R = −1 means complete negative correlation.
R= 0 means zero correlation but not necessarily independence.
R = +1 means complete positive correlation.

Example — Two Ordinary Dice


Let X and Y be two random variables associated with the throws of two ordinary dice. In
both cases the mean is 72 and the variance is 35
12 (see page 4.2). In summary:

7 35
E(X) = E(Y ) = V(X) = V(Y ) =
2 12
To determine the covariance W(X, Y ) it is first necessary to evaluate E(XY ):
6 X
X 6
E(XY ) = r.s P(X = r, Y = S)
r=0 s=0

1
In the case of two fair dice, the probability is always 36 when r, s are non-zero so:
6 X
6
X 1
E(XY ) = r.s
r=1 s=1
36

By the lemma on page 4.8 the summations can be separated:


X6 X6 
1 21 × 21 49
E(XY ) = r s = =
r=1 s=1
36 36 4

The covariance and correlation can now be determined:


49 7 7
W(X, Y ) = E(XY ) − E(X) . E(Y ) = − × = 0 and so R = 0
4 2 2

– 5.3 –
Example — Two Linked Dice
The result for two ordinary dice should have been obvious from the outset. The dice are
independent ensuring E(XY ) = E(X) . E(Y ). Independence implies zero covariance and
hence zero correlation.
By contrast, suppose X and Y are two random variables associated with two dice which
behave as two linked drums on a broken fruit machine; both dice always show the same
result. Each random variable viewed alone has the same expectation and variance as
before:
7 35
E(X) = E(Y ) = V(X) = V(Y ) =
2 12
To determine the covariance W(X, Y ) first evaluate E(XY ):
6 X
X 6
E(XY ) = r.s P(X = r, Y = S)
r=0 s=0

There are only six non-zero probabilities, all 61 , and hence only six terms to sum:
1 1 + 4 + 9 + 16 + 25 + 36 91
E(XY ) = (1.1 + 2.2 + 3.3 + 4.4 + 5.5 + 6.6) = =
6 6 6
The covariance can now be determined:
91 7 7 91 49 35
W(X, Y ) = E(XY ) − E(X) . E(Y ) = − × = − =
6 2 2 6 4 12
Accordingly, the correlation coefficient is:
35
W(X, Y )
R= p = q 12 =1
V(X) . V(Y ) 35 35
. 12 12

This is complete positive correlation.

Example — Two Reverse-Linked Dice


Suppose the two dice behave so as to give only the (equiprobable) pairs 16, 25, 34, 43, 52
and 61. The analysis is as before except that sum now is:
1 6 + 10 + 12 + 12 + 10 + 6 56
E(XY ) = (1.6 + 2.5 + 3.4 + 4.3 + 5.2 + 6.1) = =
6 6 6

56 49 35
W(X, Y ) = E(XY ) − E(X) . E(Y ) = − =−
6 4 12
Accordingly, the correlation coefficient is:
W(X, Y ) − 35
R= p = q 12 = −1
V(X) . V(Y ) 35 35
12 . 12

This is complete negative correlation.

– 5.4 –
Example — A Curious Special Case
Suppose the two random variables X and Y give rise to the set of elementary events whose
probabilities are as shown in the following 2 × 3 table:

Y
s →
0 1 2
1 1 2
r 0 3 0 3 3
X 1 1
↓ 1 0 3 0 3
1 1 1
3 3 3

The expectations and variances of X and Y are easily computed as:

1 2 2
E(X) = E(Y ) = 1 V(X) = V(Y ) =
3 9 3

Evaluate E(XY ):
1 X
X 2
E(XY ) = r.s P(X = r, Y = S)
r=0 s=0

There are six terms:

1 1 1 1
E(XY ) = (0.0). + (0.1).0 + (0.2). + (1.0).0 + (1.1). + (1.2).0 =
3 3 3 3

Evaluate the covariance:

1 1
W(X, Y ) = E(XY ) − E(X) . E(Y ) = − ×1=0
3 3

The covariance is zero so the correlation is zero too. This result though is not obvious
from the start because the random variables are clearly not independent. By inspection,
each elementary event has probability 0 or 13 but the six products P(X = r) . P(X = r)
are all either 29 or 19 .

Footnote about Correlation


If the variables are independent the correlation is zero but. . .
If the correlation is zero the variables are not necessarily independent.

– 5.5 –
One Random Variable versus Two
Given a single random variable X which has some value r (r ∈ N), three entities are of
immediate interest:
I P(X = r) The associated set of probabilities
II E(X) The expectation
III V(X) The variance
If X now appears in conjunction with a second random variable Y which has some value
s (s ∈ N), three further entities may be of interest:
I P(X = r, Y = s) The set of probabilities of the composite elementary events
II E(X + Y ) The expectation of the sum
III V(X + Y ) The variance of the sum
The only reason for singling out the expectation and variance of the sum (rather than, say,
the product) of X and Y is that the formulae for deriving these values have been discussed
at length.
Concentrating on the sum for a moment, there is one entry which is conspicuous by its
absence from the list. This is the probability of the sum. Before such an entry can be
added, some meaning has to be given to the concept. If you throw two dice and obtain
two values r and s you can trivially note their sum. This is a derived random variable.
It is then perfectly reasonable to ask what the probability is of this sum being, say, 7. One
might express the probability as P(X + Y = t) where t is the sum from some particular
outcome.
Here are some preliminary observations:
• Determining the expectation of the sum, E(X + Y ), is easy. This is simply the sum of
the expectations. It is not necessary for X and Y to be independent and they don’t
have to be identically distributed.
• Determining the variance of the sum, V(X + Y ), is not quite so easy unless X and
Y satisfy the restriction that they are independent. The variance of the sum is then
sum of the variances. The variables do not have to be identically distributed.
• Determining the probability of the sum, P(X +Y = t), is difficult and will be discussed
further. Certainly no progress can be made by simply adding probabilities whatever
restrictions are imposed.

The Two-Dice Example


The following table shows the probabilities associated with a fair die:

r→
X 0 1 2 3 4 5 6
1 1 1 1 1 1
P(X = r) 0 6 6 6 6 6 6

– 5.6 –
Here X is the random variable and P(X = r) is the probability of throwing value r. Take
a second fair die with an associated random variable Y ; the probabilities P(Y = s) are as
for X.
Assuming independence, the 36 pairs which do not involve zero are all equiprobable with
1
the probability for each being 36 .
A simple approach to investigating the probabilities of the different sums of the two dice
scores is to consider each possible sum in turn. It is impossible to score a sum of 0 or 1
since at least one die would have to show zero and P(X = 0) = P(Y = 0) = 0. The
minimum sum is 2, which is obtained by throwing two 1s.
1
So far, P(X + Y = 0) = 0, P(X + Y = 1) = 0 and P(X + Y = 2) = 36 (note that there is
only one way of throwing two 1s).
2
A sum of 3 can be obtained in two ways, 1 + 2 and 2 + 1 and so P(X + Y = 2) = 36 . It is
straightforward to consider the other possible sums and obtain the following table:

t→
X +Y 0 1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P(X + Y = t) 0 0 36 36 36 36 36 36 36 36 36 36 36

Note that the most-probable sum is 7 since this may be obtained in six ways: 1 + 6, 2 + 5,
3 + 4, 4 + 3, 5 + 2 and 6 + 1.
This duly achieves the goal of determining the values of P(X + Y = t) but it has been
fairly hard work even in this simple case! This table provides an opportunity for a direct
illustration of the formulae E(X + Y ) = E(X) + E(Y ) and V(X + Y ) = V(X) + V(Y )
(when X and Y are independent). . .
12
X
E(X + Y ) = t P(X + Y = t)
t=0
1
= (0.0 + 1.0 + 2.1 + 3.2 + 4.3 + 5.4 + 6.5 + 7.6 + 8.5 + 9.4 + 10.3 + 11.2 + 12.1)
  36
252 7 7
= = 7 = E(X) + E(Y ) + =7
36 2 2
12
X
2
t2 P(X + Y = t)

E (X + Y ) =
t=0
= (0.0 + 1.0 + 4.1 + 9.2 + 16.3 + 25.4 + 36.5 + 49.6 +
1 1974 329
64.5 + 81.4 + 100.3 + 121.2 + 144.1) = =
36 36 6
These two results are used for the variance:
 
2
 2 329 35 35 35 35
V(X+Y ) = E (X+Y ) − E(X+Y ) = −49 = = V(X)+V(Y ) + =
6 6 12 12 6

– 5.7 –
A Polynomial with Probabilities as Coefficients
Determining the table of values of P(X + Y = t) is tedious and fortunately there is an
alternative approach. . .
Consider the following function G(η) which is a sixth-order polynomial whose coefficients
are the probabilities associated with a fair die:

1 1 1 1 1 1
G(η) = 0η 0 + η 1 + η 2 + η 3 + η 4 + η 5 + η 6
6 6 6 6 6 6

The coefficients could be taken straight from the table giving values of P(X = r) or
P(Y = s).
What is interesting is to note the outcome of multiplying this polynomial by itself:

1 2 2 3 3 4 4 5 6
G(η) . G(η) = 0η 0 + 0η 1 + η + η + η + η5 + η6 + η7
36 36 36 36 36 36
5 4 9 3 10 2 1
+ η8 + η + η + η 11 + η 12
36 36 36 36 36

The coefficients in this product polynomial are exactly those in the table of values for
P(X + Y = t).
There will be more to say about functions like G(η).

Glossary
The following technical terms have been introduced:
Independent and Identically Distributed, IID Correlation Coefficient

Exercises — V
Work in fractions.
1. The Problem of Points is the ‘founding problem’ of probability theory. It was discussed
by Pascal and Fermat in a famous correspondence in the summer of 1654. A and B
stake equal money on winning a simple coin-tossing game in which the winner of each
toss scores a point, the first to reach an agreed number of points winning the game.
However, the game is interrupted when A still needs two points to win, and B three
(Pascal’s actual example). In what proportion should the total stake be divided and
returned to the players? Display the sample space on a lattice diagram, or otherwise,
and find the probabilities, and hence the answer, by enumeration.
2. Determine the values of the probabilities P(X + Y = t) when (a) two dice are linked
so r = s and (b) two dice are reverse-linked so r + s = 7. Follow the analysis on
page 5.7 and directly evaluate the expectations of X + Y and (X + Y )2 then, from
these expectations, determine the variance of X + Y . Note that E(X + Y ) is, in both
cases, the same as for independent dice and that V(X + Y ), in both cases, is not.

– 5.8 –
3. Suppose X is a random variable whose value is the outcome of throwing a fair die
and suppose Y is a random variable whose value is the number of heads which show
when 4 fair coins are tossed. Tabulate the values of the probabilities P(X + Y = t)
and directly evaluate the expectations of X + Y and (X + Y )2 then, from these
expectations, determine the variance of X + Y . Check that E(X + Y ) = E(X) + E(Y )
and V(X + Y ) = V(X) + V(Y ).
4. Write an ML function dice of type int -> int list * int which is used thus:
- dice 1;
> ([0,1,1,1,1,1,1],6) : int list * int
-
- dice 2;
> ([0,0,1,2,3,4,5,6,5,4,3,2,1],36) : int list * int

The function takes an int argument n and evaluates the probabilities associated with
the sum of the scores when n fair dice are thrown. The result is a two-tuple whose first
component is an int list of the numerators of the probabilities and whose second
component is an int being the denominator common to all the probabilities.
Hint: it may be helpful to note that one way of generating the list in dice 2 is to
sum the following lists, element by element:
[0,0,0,0,0,0,0,1,1,1,1,1,1]
[0,0,0,0,0,0,1,1,1,1,1,1]
[0,0,0,0,0,1,1,1,1,1,1]
[0,0,0,0,1,1,1,1,1,1]
[0,0,0,1,1,1,1,1,1]
[0,0,1,1,1,1,1,1]

– 5.9 –
6 — PROBABILITY GENERATING FUNCTIONS

Certain derivations presented in this course have been somewhat heavy on algebra. For
example, determining the expectation of the Binomial distribution (page 5.1) turned out to
be fairly tiresome. Another example of hard work was determining the set of probabilities
associated with a sum, P(X + Y = t). Many of these tasks are greatly simplified by using
probability generating functions. . .

Probability Generating Functions — Introduction


A polynomial whose coefficients are the probabilities associated with the different outcomes
of throwing a fair die was given on page 5.8:

1 1 1 1 1 1
G(η) = 0η 0 + η 1 + η 2 + η 3 + η 4 + η 5 + η 6
6 6 6 6 6 6

The coefficients of the square of this polynomial give rise to a higher-order polynomial
whose coefficients are the probabilities associated with the different sums which can occur
when two dice are thrown.
There is nothing special about a fair die. The set of probabilities associated with almost
any discrete distribution can be used as the coefficients of G(η) whose general form is:

G(η) = P(X = 0)η 0 + P(X = 1)η 1 + P(X = 2)η 2 + P(X = 3)η 3 + P(X = 4)η 4 + · · · (6.1)

This is a power series which, for any particular distribution, is known as the associated
probability generating function. Commonly one uses the term generating function, without
the attribute probability, when the context is obviously probability. Generating functions
have interesting properties and can often greatly reduce the amount of hard work which is
involved in analysing a distribution.
The crucial point to notice, in the power series expansion of G(η), is that the coefficient
of η r is the probability P(X = r).

Properties of Generating Functions


It is to be noted that G(0) = P(X = 0) and, rather more importantly, that:
!
G(1) = P(X = 0) + P(X = 1) + P(X = 2) + · · · = P(X = r) = 1
r

In the particular case of the generating function for the fair die:

1 1 1 1 1 1
G(1) = 0 + + + + + + =1
6 6 6 6 6 6

– 6.1 –
Next, consider G(η) together with its first and second derivatives G! (η) and G!! (η) (the
differentiation is with respect to η of course):

G(η) = P(X = 0)η 0 + P(X = 1)η 1 + P(X = 2)η 2 + P(X = 3)η 3 + P(X = 4)η 4 + · · ·
G! (η) = 1 P(X = 1)η 0 + 2 P(X = 2)η 1 + 3 P(X = 3)η 2 + 4 P(X = 4)η 3 + · · ·
G!! (η) = 2.1 P(X = 2)η 0 + 3.2 P(X = 3)η 1 + 4.3 P(X = 4)η 2 + · · ·

Now, consider G(1), G! (1) and G!! (1):

G(1) = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) + · · ·


G! (1) = 1 P(X = 1) + 2 P(X = 2) + 3 P(X = 3) + 4 P(X = 4) + · · ·
G!! (1) = 2.1 P(X = 2) + 3.2 P(X = 3) + 4.3 P(X = 4) + · · ·

At this stage, recall the general formula for the expectation of an arbitrary function of a
random variable: " # !
E f (X) = f (r).P(X = r)
r

Then, express G(1), G! (1) and G!! (1) in sigma notation and derive the following results:
!
G(1) = P(X = r) = 1
r
!
G! (1) = r.P(X = r) = E(X) (6.2)
r
! " #
G!! (1) = r(r − 1).P(X = r) = E X(X − 1)
r

By differentiating the generating function" one can directly


# obtain the expectation E(X).
By differentiating again one can obtain E X(X − 1) . These two results lead to the rapid
derivation of V(X). Note:
" #2 " # " #2
G!! (1) + G! (1) − G! (1) = E X(X − 1) + E(X) − E(X)
" #2
= E(X 2 ) − E(X) + E(X) − E(X)
" #2
= E(X 2 ) − E(X)
= V(X) (6.3)

A generating function is particularly helpful when the probabilities, as coefficients, lead to


a power series which can be expressed in a simplified form. With many of the commonly-
used distributions, the probabilities do indeed lead to simple generating functions. Often
it is quite easy to determine the generating function by simple inspection. Some examples
will be considered. . .

– 6.2 –
The Binomial Distribution
The set of probabilities for the Binomial distribution can be defined as:
$ %
n r n−r
P(X = r) = p q where r = 0, 1, . . . , n
r

Accordingly, from (6.1), the generating function is:


$ % $ % $ % $ %
n 0 n 0 n 1 n−1 1 n 2 n−2 2 n 3 n−3 3
G(η) = p q η + p q η + p q η + p q η + ···
0 1 2 3
$ % $ % $ % $ %
n n n n
= (pη) q +
0 n
(pη) q
1 n−1
+ (pη) q
2 n−2
+ (pη)3 q n−3 + · · ·
0 1 2 3

This last expression is easily recognised as the expansion of (q + pη) n so the generating
function and its first two derivatives are:

G(η) = (q + pη)n
G! (η) = n(q + pη)n−1 p
G!! (η) = n(n − 1)(q + pη)n−2 p2

Accordingly:

G(1) = (q + p)n = 1n = 1
G! (1) = n(q + p)n−1 p = n.1n−1 p = np
G!! (1) = n(n − 1)(q + p)n−2 p2 = n(n − 1).1n−2 p2 = n(n − 1)p2

By (6.2), G! (1) = E(X) = np. This result was derived very much more tediously on
page 5.1.
The variance is derived from (6.3):
" #2
V(X) = G!! (1) + G! (1) − G! (1)
= n(n − 1)p2 + np − (np)2
= n2 p2 − np2 + np − n2 p2
= np(1 − p)
= npq

This result was derived on page 5.2.

– 6.3 –
The Geometric Distribution
The set of probabilities for the Geometric distribution can be defined as:

P(X = r) = q r p where r = 0, 1, . . .

Remember, this represents r successive failures (each of probability q) before a single


success (probability p). Note that r is unbounded; there can be an indefinite number of
failures before the first success.
From (6.1), the generating function is:

G(η) = q 0 pη 0 + q 1 pη 1 + q 2 pη 2 + q 3 pη 3 + · · ·
" #
= p (qη)0 + (qη)1 + (qη)2 + (qη)3 + · · ·

The item in brackets is easily recognised as an infinite geometric progression, the expansion
of (1 − qη)−1 , so the generating function and its first two derivatives are:

G(η) = p(1 − qη)−1


G! (η) = p(1 − qη)−2 q
G!! (η) = 2p(1 − qη)−3 q 2

Accordingly:
G(1) = p(1 − q)−1 = p.p−1 = 1
p q
G! (1) = p(1 − q)−2 q = 2 q =
p p
p q2
G!! (1) = 2p(1 − q)−3 q 2 = 2 3 q 2 = 2 2
p p
q
By (6.2), E(X) = p .

The variance is derived from (6.3):


" #2
V(X) = G!! (1) + G! (1) − G! (1)
q2 q q2
=2 + −
p2 p p2
$ %
q q
= +1
p p
$ %
q q+p
=
p p
q
= 2
p

Both the expectation and the variance of the Geometric distribution are difficult to derive
without using the generating function.

– 6.4 –
The Poisson Distribution
The set of probabilities for the Poisson distribution can be defined as:

λr −λ
P(X = r) = e where r = 0, 1, . . .
r!

This was introduced as the probability of r murders in a year when the average over a long
period is λ murders in a year. As with the Geometric distribution r is unbounded; there
can, in principle, be an indefinite number of murders in a year.
From (6.1), the generating function is:

λ0 −λ 0 λ1 −λ 1 λ2 −λ 2 λ3 −λ 3
G(η) = e η + e η + e η + e η + ···
0! 1! 2! 3!
$ %
(λη)0 (λη)1 (λη)2 (λη)3
= + + + + · · · e−λ
0! 1! 2! 3!

The item in brackets is easily recognised as an exponential series, the expansion of e (λη) ,
so the generating function and its first two derivatives are:

G(η) = e(λη) e−λ


G! (η) = λe(λη) e−λ
G!! (η) = λ2 e(λη) e−λ

Accordingly:
G(1) = eλ e−λ = 1
G! (1) = λeλ e−λ = λ
G!! (1) = λ2 eλ e−λ = λ2

By (6.2), E(X) = λ.

The variance is derived from (6.3):


" #2
V(X) = G!! (1) + G! (1) − G! (1)
= λ2 + λ − λ 2

The expectation of the Poisson distribution was derived without great difficulty on page 4.5
and it was pointed out that λ was the obvious result since the analysis of the Poisson
distribution began by taking λ as the expectation. The variance has not been derived
before and it is interesting to note that the variance and the expectation are the same.

– 6.5 –
The Uniform Distribution
In all the illustrations so far, a crucial step has been to recognise some expansion so that
the generating function can be reduced to a simple form. In the case of the distribution
for a fair die, the generating function cannot be simplified and little benefit accrues from
using it to determine the expectation and variance.
Nevertheless the generating function can be used and the following analysis is a final
illustration of the use of generating functions to derive the expectation and variance of a
distribution.
The generating function and its first two derivatives are:

1 1 1 1 1 1
G(η) = 0η 0 + η 1 + η 2 + η 3 + η 4 + η 5 + η 6
6 6 6 6 6 6
1 1 1 1 1 1
G! (η) = 1. η 0 + 2. η 1 + 3. η 2 + 4. η 3 + 5. η 4 + 6. η 5
6 6 6 6 6 6
1 1 1 1 1
G!! (η) = 2. η 0 + 6. η 1 + 12. η 2 + 20. η 3 + 30. η 4
6 6 6 6 6

Accordingly:
1 1 1 1 1 1
G(1) = 0 + + + + + + =1
6 6 6 6 6 6
1+2+3+4+5+6 7
G! (1) = =
6 2
2 + 6 + 12 + 20 + 30 70 35
G!! (1) = = =
6 6 3

By (6.2), E(X) = 72 .

The variance is derived from (6.3):


" #2
V(X) = G!! (1) + G! (1) − G! (1)
35 7 49
= + −
3 2 4
140 + 42 − 147
=
12
35
=
12

The expectation and variance for a fair die were derived on pages 3.9 and 4.2 and the
present calculations are almost identical. This is not a fruitful use of generating functions.
Nevertheless, the generating function for a fair die can be very well worth exploiting in a
quite different way. . .

– 6.6 –
The Generating Function as a Special Expectation
The general form of G(η), given in (6.1), is:

G(η) = P(X = 0)η 0 + P(X = 1)η 1 + P(X = 2)η 2 + P(X = 3)η 3 + P(X = 4)η 4 + · · ·

In sigma notation this can be rewritten:


!
G(η) = P(X = r)η r
r

Recall again the general formula for the expectation of an arbitrary function of a random
variable: " # !
E f (X) = f (r).P(X = r)
r

Taking f (X) = η X as a function of X:


" # ! r
E ηX = η .P(X = r)
r

This is G(η) and the identity: " #


G(η) = E η X (6.4)

is the starting point for some important theory. . .

The Sum of two Independent Random Variables — A Theorem


Suppose X and Y are two independent random variables whose values are r and s where
r, s = 0, 1, 2, . . ..
Let GX (η) be the generating function associated with X and GY (η) be the generating
function associated with Y . From (6.4):
" # " #
GX (η) = E η X and GY (η) = E η Y

Next, consider the generating function associated with X + Y and call this G X+Y . Again
from (6.4): " #
GX+Y (η) = E η X+Y

From the general formula for the expectation of a function of random variables:
" # ! ! r+s !!
GX+Y (η) = E η X+Y = η P(X = r, Y = s) = η r η s P(X = r).P(Y = s)
r s r s

Note that P(X = r, Y = s) may be split only because X and Y are independent.

– 6.7 –
The function after the double-sigma sign can be separated into the product of two terms,
the first of which does not depend on s and the second of which does not depend on r.
Hence: $! %$! %
GX+Y (η) = η .P(X = r)
r
η .P(X = s)
s

r s
" # " #
= E η .E η Y
X

= GX (η).GY (η)
In words: the generating function of the sum of two independent random variables is the
product of the separate generating functions of the random variables.
This is a most useful theorem.

Two Dice
The theorem GX+Y (η) = GX (η).GY (η) brings the analysis back to the starting point.
When the generating function for a fair die was introduced on page 5.8, it was shown
without explanation that squaring this generating function gives the generating function
for the sum of two dice.
The explanation is now clear. The two dice were identical so GX = GY = G and the
square, G2 , gave the generating function for X + Y . In summary:
1 1 1 1 1 1
GX (η) = GY (η) = G(η) = 0η 0 + η 1 + η 2 + η 3 + η 4 + η 5 + η 6
6 6 6 6 6 6
and:
" #2 1 2 3 3 4 4 5 6
GX+Y (η) = G(η) = 0η 0 + 0η 1 + η 2 + η + η + η5 + η6 + η7
36 36 36 36 36 36
5 4 9 3 10 2 1
+ η8 + η + η + η 11 + η 12
36 36 36 36 36
Note that although both X and Y are distributed Uniform(1,6), the distribution of X + Y
is emphatically not uniform. The distribution of X + Y is triangular with a peak in the
middle.
If the number of dice is increased, inspection of G2 , G3 , G4 and so on shows that the
shape of the distribution continues to change and tends to that of a symmetrical Binomial
distribution (one in which p = q = 21 ).

Two Binomial Distributions


On reflection, there has never been any reason to suppose that combining two Uniform
distributions (in the manner of throwing two dice and adding their scores) would produce
another Uniform distribution. By contrast, combining two Binomial distributions can
produce another Binomial distribution. . .
Imagine that you have a pile of identical biased coins, each of which has a probability p
of landing heads when tossed. Suppose you take 3 of these coins for yourself and hand 5
others to a friend.

– 6.8 –
You then toss your coins and count the number of heads, clearly a number in the range 0
to 3. Your friend (independently) tosses the other coins and again counts the number of
heads, a number in the range 0 to 5. You then add your head count to your friend’s head
count.
Let X be the random number whose value r is the number of heads you obtain and Y be
the random number whose value s is the number of heads your friend obtains. Clearly,
X is distributed Binomial(3, p) and Y is distributed Binomial(5, p). It will be shown that
X + Y is distributed Binomial(8, p).
Given that X and Y are Binomially distributed:
$ % $ %
3 r 3−r 5 s 5−s
P(X = r) = p q and P(Y = s) = p q
r s

where r is in the range 0 to 3 and s is in the range 0 to 5.


From page 6.3, the generating functions associated with X and Y are:

GX (η) = (q + pη)3 and GY (η) = (q + pη)5

From the theorem:


GX+Y (η) = GX (η).GY (η)
= (q + pη)3 (q + pη)5
= (q + pη)8

Now (q + pη)8 is the generating function for a random variable which is distributed
Binomial(8, p). Accordingly, combining the two random variables X and Y , which are
distributed Binomial(3, p) and Binomial(5, p), produces a random variable X + Y which is
distributed Binomial(8, p).
On reflection, this result is hardly surprising. A third party watching you and your friend
could simply regard each trial as the tossing of 8 independent coins. The fact that the
tossing employs two people is neither here nor there. The total number of heads is clearly
distributed Binomial(8, p).
There is nothing special about the values 3 and 5 and the analysis readily generalises to
the case where you have n1 coins and your friend has n2 coins. The only restrictions are
that all n1 + n2 coins must be independent and identically distributed.
Each coin individually is, of course, distributed Binomial(1, p) and its generating function
is G(η) = (q + pη)1 . For an arbitrary n such coins the generating function is:
" #n
Gn (η) = G(η) = (q + pη)n

This is the generating function of a random variable distributed Binomial(n, p).

Glossary
The following technical term has been introduced:
Probability Generating Function

– 6.9 –
Exercises — VI
1. A board game requires you to ‘throw a 6 to start’. How many throws do you expect
to have to make before the throw which delivers you the required 6?

2. Consider again the battling Knights of Exercises II, question 7. Supposing battling
Knights became a popular form of entertainment, promoters of such tournaments
would wish to inform spectators of the expected length of a battle. Taking a round
as the unit of measure, just what is the expected length of a battle?

3. If a random variable X is distributed Poisson(λ), show that:


" #
E X(X − 1)(X − 2) . . . (X − k) = λk+1

4. Use generating functions to find the distribution of X + Y , where X and Y are


independent random variables distributed Binomial(nX , pX ) and Binomial(nY , pY )
respectively, for the case where nX and nY are arbitrary positive integers (they may
or may not be equal) but pX = pY . In the alternative case, where nX = nY but
pX "= pY , find the mean and variance of X + Y . Is the distribution Binomial?
5. [From Part IA of the Computer Science Tripos, 1993] X and Y are independent
random variables having Poisson distributions with parameters α and β respectively.
By using probability generating functions, or otherwise, prove that X+Y has a Poisson
distribution and give its parameter.
Find the conditional distribution for X given that X + Y = n, and give its mean and
variance. Explain your result in words.

– 6.10 –
7 — DIFFERENCE EQUATIONS

Many problems in Probability give rise to difference equations. Difference equations relate
to differential equations as discrete mathematics relates to continuous mathematics.
Anyone who has made a study of differential equations will know that even supposedly
elementary examples can be hard to solve. By contrast, elementary difference equations
are relatively easy to deal with.
Aside from Probability, Computer Scientists take an interest in difference equations for a
number of reasons. For example, difference equations frequently arise when determining
the cost of an algorithm in big-O notation. Since difference equations are readily handled
by program, a standard approach to solving a nasty differential equation is to convert it
to an approximately equivalent difference equation.

Classification of Difference Equations


As with differential equations, one can refer to the order of a difference equation and note
whether it is linear or non-linear and whether it is homogeneous or inhomogeneous.
The present discussion will almost exclusively be confined to linear second order difference
equations both homogeneous and inhomogeneous.

Notation Convention
A trivial example stems from considering the sequence of odd numbers starting from 1.
The associated difference equation might be specified as:

f (n) = f (n − 1) + 2 given that f (1) = 1

In words: term n in the sequence is two more than term n−1. The proviso, f (1) = 1,
constitutes an initial condition. The first term in the sequence is 1.
A term like f (n) so strongly suggests a continuous function that many writers prefer to
use a subscript notation. The present difference equation would be presented as:

un = un−1 + 2 given that u1 = 1 (7.1)

This is the notation which will be used below. It is strongly implicit that n is an integer.
In simple cases, a difference equation gives rise to an associated auxiliary equation (first
explained in (7.2) overleaf). In the case of (7.1) the associated auxiliary equation is:

w1 − 1 = 0

The highest power of the polynomial in w is 1 and, accordingly, (7.1) is said to be a first
order difference equation. If the constant term 2 were absent from (7.1), the equation
would be homogeneous but its presence makes it inhomogeneous.
Some standard techniques for solving elementary difference equations analytically will now
be presented. . .

– 7.1 –
Second Order Homogeneous Linear Difference Equation — I
To solve:
un = un−1 + un−2 given that u0 = 1 and u1 = 1

transfer all the terms to the left-hand side:

un − un−1 − un−2 = 0

The zero on the right-hand side signifies that this is a homogeneous difference equation.
Guess:
un = Awn

so:
Awn − Awn−1 − Awn−2 = 0

and:
w2 − w − 1 = 0 (7.2)

This is the auxiliary equation associated with the difference equation. Being a quadratic,
the auxiliary equation signifies that the difference equation is of second order.
The two roots are readily determined:
√ √
1+ 5 1− 5
w1 = and w2 =
2 2

For any A1 substituting A1 w1n for un in un − un−1 − un−2 yields zero.


For any A2 substituting A2 w2n for un in un − un−1 − un−2 yields zero.

This suggests a general solution:

un = A1 w1n + A2 w2n

Check by substituting into un − un−1 − un−2 thus:

(A1 w1n + A2 w2n ) − (A1 w1n−1 + A2 w2n−1 ) − (A1 w1n−2 + A2 w2n−2 )

This, rearranged, is:

A1 w1n−2 (w12 − w1 − 1) + A2 w2n−2 (w22 − w2 − 1)

and this is zero because both expressions in brackets are zero.

On substituting the values of w1 and w2 the general solution is:


 √ n  √ n
1+ 5 1− 5
un = A 1 + A2
2 2

– 7.2 –
but, by noting initial conditions, values for A1 and A2 can be obtained. . .
Note:
u0 = 1 so A1 + A2 = 1 and A2 = 1 − A1

Likewise: √  √ 
A1 1 + 5 + (1 − A1 ) 1 − 5
u1 = 1 so =1
2
so: √  √  √ 
A1 1 + 5 + 1 − 5 − A1 1 − 5 = 2

√ √  √
A1 1 + 5−1+ 5 =2−1+ 5

√  √
A1 2 5 = 1 + 5

Hence: √
1+ 5
A1 = √
2 5
and: √ √ √ √ √
1+ 5 2 5−1− 5 −1 + 5 1− 5
A2 = 1 − A 1 = 1 − √ = √ = √ =− √
2 5 2 5 2 5 2 5
In consequence: √  √ n √  √ n
1+ 5 1+ 5 1− 5 1− 5
un = √ − √
2 5 2 2 5 2
giving: " √ n+1  √ n+1 #
1 1+ 5 1− 5
un = √ − (7.3)
5 2 2

as the final solution.



Observe that despite the 5s:

u0 = 1, u1 = 1, u2 = 2, u3 = 3, u4 = 5, etc.

– 7.3 –
Second Order Homogeneous Linear Difference Equation — II
To solve:
un = p un+1 + q un−1 given that u0 = 0, ul = 1 and p+q =1

Transfer all the terms to the left-hand side:


p un+1 − un + q un−1 = 0

Guess:
un = Awn
so:
pAwn+1 − Awn + qAwn−1 = 0

pw2 − w + q = 0

pw2 − (p + q)w + q = 0

(w − 1)(pw − q) = 0
The two roots are:
q
w1 = 1 and w2 =
p

This suggests a general solution:


 n
n q
un = A1 (1) + A2 provided p 6= q (7.4)
p
Check by substituting into p un+1 − un + q un−1 thus:
"  n+1 # "  n # "  n−1 #
q q q
pA1 (1)n+1 + pA2 − A1 (1)n + A2 + qA1 (1)n−1 + qA2
p p p

This, rearranged, is:


 n−1 "  2 #
q q q
A1 [p − 1 + q] + A2 p − +q
p p p

which, given that p + q = 1, is:


 n−1 " 2 #  n−1 " #  n−1 " #
q q q q q q q
A2 − + q = A2 (q − 1) + q = A2 (−p) + q = 0
p p p p p p p

The next step is to determine values for A1 and A2 in the general solution. . .

– 7.4 –
The general solution was determined to be:
 n
n q
un = A1 (1) + A2 provided p 6= q
p

Note:
u0 = 0 so A 1 + A2 = 0

Likewise:  l
q
ul = 1 so A1 + A2 =1
p
so:
 l "  l #
q q 1
−A2 + A2 =1 and thus 1 = A2 −1 + giving A2 =
p p q l

p −1

and:
−1
A1 = −A2 =
q l

p −1

In consequence:
q n

−1 p
un = +
q l q l
 
p − 1 p − 1

giving:
q n

p −1
un =
q l

p −1

as the final solution.

Observations about the solution:


First, u0 = 0 and ul = 1 as required.
Secondly, suppose 0  n  l (e.g.: l = 57 and n = 41). . .
i
If pq < 1 [when pq → 0 for large i] the solution un → 0−1
0−1
→ 1.
   
( q )n 1−( p )n 1−0
If pq > 1 the solution un → pq l  pq l  → q 1l−n →0
( p ) 1−( q ) (p) 1−0

In simple terms, provided n is well clear of the extremes 0 and l, un will tend to 1 or to 0
depending on whether q < p or q > p. (It has been assumed that p 6= q.)

– 7.5 –
What about the case p = q (as for an even coin)?
Recall that w1 = 1 and w2 = pq so the case p = q implies twin roots, w1 = w2 = 1.
The general solution un = A1 w1n + A2 w2n would be un = A1 + A2 which is silly.
In such cases, try a different guess:

un = (A1 + A2 n)wn where w is the twin root

In the present case, try:


un = (A1 + A2 n) (1)n (7.5)
as the general solution.
Check by substituting into p un+1 − un + q un−1 thus:

p [A1 + A2 (n + 1)] − [A1 + A2 n] + q [A1 + A2 (n − 1)]

This, rearranged, is:


A1 [p − 1 + q] + A2 [pn + p − n + qn − q]

which, remembering that p + q = 1, is zero.

The next step is to determine values for A1 and A2 in the general solution whose revised
form is:
un = (A1 + A2 n) (1)n

Note:
u0 = 0 so A1 = 0

Likewise:
1
ul = 1 so 0 + A2 l = 1 giving A2 =
l
In consequence:
1
un = 0 + n
l
giving:
n
un =
l
as the final solution when the special case p = q applies.

– 7.6 –
Second Order Inhomogeneous Linear Difference Equation
To solve:

vn = 1 + p vn+1 + q vn−1 given that v0 = vl = 0 and p+q =1

Transfer all the terms except the 1 to the left-hand side:

p vn+1 − vn + q vn−1 = −1

If the right-hand side were zero, this would be identical to the homogeneous equation just
discussed. The new equation is solved in two steps. First, deem the right-hand side to be
zero and solve as for the homogeneous case:
 n
n q
vn = A1 (1) + A2 provided p 6= q
p

Then, augment this solution by some f (n) which has to be given further thought:
 n
n q
vn = A1 (1) + A2 + f (n)
p

This augmented vn has to be such that when substituted into p vn+1 − vn + q vn−1 the
result is −1.
n
From previous experience with un , it is known that substituting A1 (1)n + A2 pq gives a
result of zero. In consequence, the property required of f (n) is that on substituting it into
p vn+1 − vn + q vn−1 the result must be −1.

In this course, it will always be possible to express f (n) as the quadratic a + bn + cn 2 with
only one of the constants a, b and c non-zero. In the present case try f (n) = k n and
therefore require:
p k(n + 1) − k n + q k(n − 1) = −1

so:
p k n + p k − k n + q k n − q k = −1

1
Hence (p − q) k = −1 so k = q−p
giving:  n
q n
vn = A 1 + A 2 + (7.6)
p q−p

as the general solution appropriate to the inhomogeneous difference equation. It is left as


an exercise for the reader to determine values for A1 and A2 appropriate for the initial
conditions given.

– 7.7 –
What about the case p = q?
When p = q the equation:
p vn+1 − vn + q vn−1 = −1

can be solved in two steps as before. First, deem the right-hand side to be zero and solve
as for the homogeneous case:
vn = (A1 + A2 n)(1)n

Then, augment this solution by some f (n) which has to be given further thought:

vn = (A1 + A2 n)(1)n + f (n)

As before, this augmented vn has to be such that when substituted into p vn+1 −vn +q vn−1
the result is −1 but remember that p = q this time.
Again, from previous experience with un , it is known that substituting (A1 + A2 n)(1)n
gives a result of zero. Once more, the property required of f (n) is that on substituting it
into p vn+1 − vn + q vn−1 the result must be −1.
Since p = q, it is no use this time employing the previous approach which was to try
1
f (n) = k n and derive k = q−p . This is not a helpful value for k!
The appropriate approach now is to try f (n) = k n2 and require:

p k(n + 1)2 − k n2 + q k(n − 1)2 = −1

so:
p k n2 + 2p k n + p k − k n2 + q k n2 − 2q k n + q k = −1

Hence (p + q) k = −1 so k = −1
giving:
vn = A 1 + A 2 n − n 2 (7.7)

as the general solution appropriate to the inhomogeneous difference equation when p = q.


Note that A1 + A2 n is the solution to the homogeneous equation when p = q and −n2 is
the required augmentation.
Given the initial conditions v0 = vl = 0, it is easy to determine that A1 = 0 and A2 = l
giving:
vn = n (l − n)
as the final solution when the special case p = q applies.

Glossary
The following technical terms have been introduced:
difference equations non-linear initial condition
order homogeneous auxiliary equation
linear inhomogeneous twin roots

– 7.8 –
Exercises — VII
When solving the inhomogeneous difference equations presented in problems 1, 6 and 7,
recall that the function f (n) can, in this course, always be expressed as the quadratic
a + bn + cn2 with only one of the constants a, b and c non-zero. You have seen as examples
f (n) = kn and f (n) = kn2 and you should be prepared to try f (n) = k on occasions.
1. Solve the linear first order inhomogeneous difference equations given as (7.1) from
first principles.

2. Noting the expression for un given in (7.3), check that the first six values really are
1, 1, 2, 3, 5, 8.

3. Determine the values of the constants A1 and A2 in (7.4) given u0 = 1 and ul = 0.


4. Determine the values of the constants A1 and A2 in (7.5) given u0 = 1 and ul = 0.

5. Determine the values of the constants A1 and A2 in (7.6) given v0 = vl = 0.

6. Solve the following inhomogeneous equation:

un+1 − un (1 − α − β) = α given that u0 = 0

Note that α and β are constants in the range 0 to 1.

7. Solve the following inhomogeneous equation in which p is some probability:

2un + (1 − 2p)un−1 = 1 given that u0 = 0

– 7.9 –
8 — STOCHASTIC PROCESSES

The word stochastic is derived from the Greek στ oχαστ ικoς, meaning ‘to aim at a target’.
Stochastic processes involve state which changes in a random way. A Markov process is a
particular kind of stochastic process. Using discrete time the state of the process at time
n + 1 depends only on its state at time n. The classic example of a stochastic process is
the random walk. . .

Random Walk
The simplest form of the random walk problem imagines a line marked out in unit steps
or paces from some origin:

−3 −2 −1 0 1 2 3

A person or other object starts at the origin and then makes a sequence of steps, some to
the right and some to the left, at random.
It is reasonable to think of a sequence of turns. At each turn a weighted coin is tossed and
if it lands heads one step is taken to the right and if it lands tails one step is taken to the
left.
In the analysis below assume:
• Probability of a left-step (tails) is q
where p + q = 1
• Probability of a right-step (heads) is p
Consider a walk which consists of a total of n steps or turns. Let X be a random variable
whose value, r, is the number of those n steps which are to the right.
Given a total of n steps, each of which has a probability p of being a right-step, the
probability of there being r right-steps is given by the Binomial distribution:
 
n r n−r
P(X = r) = p q (8.1)
r
Usually one is interested in the net displacement. Call this k measured in net steps to the
right of the origin. Clearly:
Net displacement to the right = Total right-steps − Total left-steps
Since the total number of left-steps is n − r the net displacement k can be expressed
algebraically:
k = r − (n − r) = 2r − n hence r = 12 (n + k)
Rewriting (8.1):  
 n 1 1
P X= 1
2 (n + k) = 1 p 2 (n+k) q 2 (n−k) (8.2)
2 (n + k)

An incomplete interpretation is the following: for fixed n, this is the probability that the
number of right-steps is such as to give a net displacement of k steps to the right.

– 8.1 –
Interpretation of the Probability
Since X is the number
 of right-steps, its value must be an integer. Therefore the probability
P X = 12 (n + k) requires the term 12 (n + k) to be an integer. Accordingly n + k (and
hence n − k) are even in the expression for the probability:
 
 n 1 1
1
P X = 2 (n + k) = 1 p 2 (n+k) q 2 (n−k)
2 (n + k)

Further, −n 6 k 6 n.
This accords with common sense. If the total number of steps is 2 the net displacement
must be one of the three possibilities: two steps to the left, back to the start, or two steps
to the right. These correspond to values of k = −2, 0, +2. Clearly it is impossible to get
more than two units away from the origin if you take only two steps and it is equally
impossible to end up exactly one unit from the origin if you take two steps.
The following table shows the probabilities associated with the different possible values of
k for n = 1, 2, 3, 4:

n k P(net = k)
1 −1 q
1 p
2 −2 q2
0 2pq
2 p2
3 −3 q3
−1 3pq 2
1 3p2 q
3 p3
4 −4 q4
−2 4pq 3
0 6p2 q 2
2 4p3 q
4 p4

For given n, P(net = k) is the probability that the net displacement is k units to the right
of the origin. For each n any missing value of k (such as k = 2 when n = 3) is impossible
and P(net = k) = 0.
Notice that for each n the tabulated probabilities total 1. Thus for n = 3 the sum of the
probabilities is q 3 + 3pq + 3p2 q + pq = (q + p)3 = 1.

– 8.2 –
Expected Displacement and Drift
Given that X is distributed
 Binomial(n, p), the expectation E(X) = np. This is also the
expectation E 12 (n + k) , so:
 
np = E(X) = E 12 (n + k) = 12 n + E(k)
Hence: 
E(k) = 2np − n = n(2p − 1) = n 2p − (p + q) = n(p − q)
If p = q the expected displacement is zero but if p 6= q the expected displacement is
non-zero and the walk is not expected to end at the starting point. This phenomenon is
known as drift. The expected net displacement is proportional to the number of steps so
the longer the walk the greater the drift.
The term recurrent random walk is used to describe a random walk which is certain to
return to the starting point in a finite number of steps. In the present case, the random
walk is recurrent if and only if p = q = 21 .
The term transient random walk is used to describe a random walk which has a non-zero
probability of never returning to the starting point. In the present case, the random walk
is transient if p 6= q.

Corollary
A footnote to the random walk analysis is to consider the probability of landing on the
origin at step n. Clearly n must be even and k = 0 so, from (8.2):
 
n 1 1

1
  1 p 2 n q 2 n , if n even
P X = 2n = n
 2
0, otherwise
Remember that X is the number of right-steps. When this is 21 n the number of right-steps
is obviously the same as the number of left-steps; thus P(X = 12 n) is exactly equivalent to
P(return to origin at step n).

The Gambler’s Ruin Problem


Many stochastic processes are disguised variants of the random walk problem. One of the
best-known variants is the Gambler’s Ruin problem. You suppose there are two gamblers,
A and B, and they each have a pile of pound coins:
A has initial capital of £n
B has initial capital of £(a − n)
Play then proceeds by a sequence of turns. At each turn:
Probability(A wins the turn) = p
where p + q = 1
Probability(B wins the turn) = q
At the end of each turn one pound is transferred from the loser’s pile of pound coins to
that of the winner. Note that the total capital is £a and that stays constant.
The game ends when one player is ruined and has no money left.

– 8.3 –
A diagrammatic representation of the game at the start is:

partition
A’s initial capital ↓ B’s initial capital
z }| {z }| {
u0 un−1 un un+1 ua

0 1 n−1 n n+1 a−1 a

The horizontal line is a units long and is marked off at unit intervals numbered 0, 1, 2, . . . , a.
Point n is marked as the partition. The n units of line to the left of the partition represent
A’s initial capital and the a − n units of line to the right of the partition represent B’s
initial capital.
At each turn the partition moves one place to the right if A wins (this outcome has
probability p) and one place to the left if B wins (this outcome has probability q).
Let un = probability that A ultimately wins starting from n
Let vn = probability that B ultimately wins starting from n

It is important to note that un + vn may be less than unity. In this kind of problem there
is often the possibility of there being no winner. There could be a non-zero probability of
the game going on for ever with the partition moving backwards and forwards but never
reaching 0 or a.
The only respectable way of tackling this problem is to determine u n and vn separately
and check whether or not they sum to 1.

Probability that A wins


First, extend the notation un so that, for example:
Let un+1 = probability that A ultimately wins starting from n + 1
Let un−1 = probability that A ultimately wins starting from n − 1

Consider the position of the partition after the first turn:


• The partition is at n+1 with probability p and from n+1 the probability of ultimately
winning is un+1 .
• The partition is at n−1 with probability q and from n−1 the probability of ultimately
winning is un−1 .

Now p + q necessarily sum to 1 (unlike un + vn about which the sum is still in doubt) so
after one turn the partition must be at n + 1 or n − 1 with probabilities u n+1 and un−1
respectively so:
un = p un+1 + q un−1 (8.3)

– 8.4 –
This is a homogeneous difference equation. In words:
The probability of A ultimately winning from n =
(probability of first turn landing on n + 1) × (probability of winning from n + 1) +
(probability of first turn landing on n − 1) × (probability of winning from n − 1)

The difference equation (8.3) holds for n = 1, 2, . . . , (a − 1) but since the game ends when
n = 0 or n = a the equation does not hold for u0 or ua . Either leads to an invalid
right-hand side.
Points 0 and a on a random walk are known as absorbing barriers. No walk can pass these
barriers.
The absorbing barriers lead to the boundary conditions:
u0 = 0 probability that A wins when B has all the capital
ua = 1 probability that A wins when B is out of capital

From (7.4), the general solution to the difference equation is:


 n
n q
un = A1 (1) + A2 provided p 6= q (8.4)
p

Given the boundary conditions u0 = 0 and ua = 1:


 a
q
A1 + A 2 = 0 and A 1 + A2 =1
p

Solve for A1 and A2 and back substitute in (8.4) to give:



q n
p −1
un = 
q a (8.5)
p −1

Further discussion of this will be postponed until after a solution has been found for v n . . .

Probability that B wins


As with un , first extend the notation vn so that, for example:
Let vn+1 = probability that B ultimately wins starting from n + 1
Let vn−1 = probability that B ultimately wins starting from n − 1

The difference equation is set up in exactly the same way as for un and is now:

vn = p vn+1 + q vn−1

– 8.5 –
The absorbing barriers now lead to different boundary conditions:
v0 = 1 probability that B wins when A is out of capital
va = 0 probability that B wins when A has all the capital

The general solution has exactly the same form as (8.4):


 n
n q
vn = A1 (1) + A2 provided p 6= q (8.6)
p

Given the boundary conditions v0 = 1 and va = 0:


 a
q
A1 + A 2 = 1 and A 1 + A2 =0
p

Solve for A1 and A2 and back substitute in (8.6) to give:



q n

q a
p − p
vn = 
q a (8.7)
1− p

Now that both un and vn are determined, their sum can be computed:

q n
a 
q n
p − 1 + pq − p
un + v n = 
q a =1
p −1

Happily the sum is unity so there really is a winner.

Observation about the Solutions


If both players have a reasonable amount of capital to start with the outcome is very
sensitive to the sign of the difference between p and q . . .
The assumptions are that the total capital a is fairly large and that 0  n  a (so that
the partition is initially not close to 0 or a).
If p > q then pq < 1 and when pq is raised to the powers n and a small values result. By
inspection of (8.5) and (8.7) un → 1 and vn → 0. A small rightward bias makes it very
likely that the partition will end up at the right-hand end.
If p < q then pq < 1 and when pq is raised to the powers n and a small values result. By
inspection of (8.5) and (8.7) un → 0 and vn → 1. A small leftward bias makes it very
likely that the partition will end up at the left-hand end.

The fair Case


If p = q the solutions found for un and vn are invalid because the constants A1 and A2 in
the general solutions (8.4) and (8.6) are not independent.

– 8.6 –
From (7.5), the appropriate general solution for un now is:

un = (A1 + A2 n) (1)n

Using the same boundary conditions u0 = 0 and ua = 1:

A1 = 0 and A2 a = 1

Solving and substituting gives:


n
un =
a
The appropriate general solution for vn is likewise:

vn = (A1 + A2 n) (1)n

Using the boundary conditions v0 = 1 and va = 0:

A1 = 1 and 1 + A2 a = 0

Solving and substituting gives:


a−n
vn =
a
Again, happily, un + vn = 1 and there really is a winner.
Notice that both solutions show that the probability of each player winning is equal to that
player’s share of the capital. Gamblers say that the probability of winning is proportional
to the initial share of the stake.

The Expected Length of a Game — I


Assume that the length of a game is finite and:
Let dn turns be the expected duration of play when starting from n

Extend this notation so that:


dn+1 turns is the expected duration of play when starting from n + 1
dn−1 turns is the expected duration of play when starting from n − 1

Consider the situation after one turn. The partition is either at n + 1 (and the probability
of this outcome is p) where there are expected to be dn+1 further turns or at n − 1 (and
the probability of this outcome is q) where there are expected to be dn−1 further turns.
Thus the expected number of turns from n is the very first turn plus either dn+1 more or
dn−1 more. This gives the inhomogeneous difference equation:

dn = 1 + p dn+1 + q dn−1 (8.8)

– 8.7 –
Notice the special case p = 1 and q = 0 when the expected duration from n is simply the
first turn plus dn+1 more.
The boundary conditions now are d0 = 0 and da = 0. No further turns are to be expected
if the partition has reached an absorbing barrier.
From (7.6), the general solution to the new difference equation is:
 n
q n
dn = A 1 + A 2 + provided p 6= q (8.9)
p q−p

Given the boundary conditions d0 = 0 and da = 0:


 a
q a
A1 + A 2 = 0 and A 1 + A2 + =0
p q−p

Solve for A1 and A2 and back substitute in (8.9) to give:



q n
n a 1− p
dn = − . 
q a
q−p q−p 1− p

This is the expected duration of play when p 6= q.

The Expected Length of a Game — II


When p = q the general solution (8.9) is invalid. From (7.7), the appropriate solution is:

dn = A 1 + A 2 n − n 2 (8.10)

Given the boundary conditions d0 = 0 and da = 0:

A1 = 0 and A 2 a − a2 = 0

Solve for A1 and A2 and back substitute in (8.10) to give:

dn = n(a − n)

This is a remarkable result. Suppose the total capital a is £1000 but player A starts with
just £1 (so n = 1) whereas player B starts with £999. Since the partition starts off just
one unit from the left-hand end, there is a probability of 21 that the game will be over at
the first turn. Nevertheless the expected duration of play is 1.(1000 − 1) or 999 turns.
1
Although the probability of player A winning is only 1000 a mere £1 investment provides
entertainment that is expected to last 999 turns!

Glossary
The following technical terms have been introduced:
stochastic processes recurrent random walk absorbing barrier
random walk transient random walk boundary conditions
drift partition stake

– 8.8 –
Exercises — VIII
1. Consider a variant of the Gambler’s Ruin problem. To decide the outcome of each
turn, the players are using a fat coin which sometimes lands on its edge. Such an
outcome is deemed a draw for the turn and no money changes hands. There are now
three probabilities relating to the outcome of each turn:
Probability(A wins the turn) = p
Probability(B wins the turn) = q
Probability(Turn is a draw) = r

Necessarily p + q + r = 1 and, in this question, assume that p 6= q. It is possible that


p = r or that q = r but such coincidences turn out not to matter. It is not immediately
clear whether the introduction of turns that have no effect alters the probabilities that
A ultimately wins or that B ultimately wins but it seems likely that the duration of
play (measured in turns) will be increased.
Complete the following tasks:
(a) Modify equation (8.3) and determine the probability un that A ultimately wins
starting from n.
(b) In a similar manner determine the probability vn that B ultimately wins starting
from n.
(c) Could the two results have been predicted at the outset? In what circumstances
will the game never finish?
(d ) Modify equation (8.8) and determine dn the duration of play starting from n. If
the ratio p : q is kept constant while the value of r is increased steadily, does the
duration of play lengthen in a way that might have been predicted at the outset?

2. Suppose that at each turn the two players play one round of the paper-scissors-stone
game. The outcome may be a win for A, a win for B or a draw and money changes
hands only when there is clear win.

Complete the following tasks:


(a) Determine the probabilities p, q and r for a round of the paper-scissors-stone
game and note that p = q.
(b) Rework question 1 for the case p = q.
(c) Use the values of p, q and r appropriate for the paper-scissors-stone game in the
expressions derived for un , vn and dn . Could the results have been predicted at
the outset?

– 8.9 –
9 — CONTINUOUS DISTRIBUTIONS

A random variable whose value may fall anywhere in a range of values is a continuous
random variable and will be associated with some continuous distribution. Continuous
distributions are to discrete distributions as type real is to type int in ML.
Many formulae for discrete distributions can be adapted for continuous distributions. Very
often, little more is required than the translation of sigma signs into integral signs. The
main bad news is that there is no equivalent of probability generating functions.

Adapting the P(X = r) Notation


In discussions which involve a single discrete random variable, the notation P(X = r) has
been used. When required, mapping is employed to ensure that r ∈ N.
In discussing a single continuous random variable, X will again be used as the name
but x will be used instead of r for the value. In probability theory r strongly implies a
non-negative integer whereas x ∈ R and may range from −∞ to +∞.
There is at once a problem with the notation P(X = x) for the probability is zero for any
particular x. Even if x is constrained to be in some finite range, such as −1 to +1, there
are an infinite number of possible values for x.
Fortunately, many variants of the P(X = x) notation are still useful. For example:
P(X < 0.5) P(−1 6 X < +1) P(a 6 X < b)

There is an obvious difficulty with a graphical representation of a continuous random


variable. A plot of P(X = x) against x serves no useful purpose! Nevertheless, graphical
representations are both possible and useful and here is a first attempt at representing a
continuous random variable X which is distributed Uniform(0,2):

X
1
2

0
0 1 2
x→

It is not immediately clear what label should be attached to the vertical axis but this
representation has the right feel about it. The height of the plot is constant over the range
0 to 2 and is zero outside this range.
The constant height is 21 to ensure that the total area under the curve is 1 and this is the
clue to much of what follows. The idea of area corresponding to probability was introduced
on page 1.6 and with continuous random variables area is often the most convenient way
of representing probability.

– 9.1 –
Probability Density Functions
In the present case, the area under the curve between x = 1 and x = 1 14 is (1 14 − 1) × 21 = 18
so the probability P(1 6 X < 1 14 ) = 81 .
In general, this calculation will be an integration and some consideration needs to be given
to the function to be integrated.
The function is called a probability density function or pdf. In the case of a single random
variable it is often named f (x) and this is the appropriate label for the vertical axis:

X
1
2

f (x) ↑

0
0 1 2
x→

In the case of the random variable X which is distributed Uniform(0,2):


(
1
, if 0 6 x < 2
f (x) = 2
0, otherwise

The calculation just undertaken is more formally written:


Z 1 41 Z 1 14   54
1 1 x 5 1 1
P(1 6 X < 1 4 ) = f (x) dx = dx = = − =
1 1 2 2 1 8 2 8
In general:
Z b
P(a 6 X < b) = f (x) dx
a

Both common sense and the axioms of probability impose certain constraints that have to
be met by any probability density function:
I f (x) must be single valued for all x
II f (x) > 0 for all x
Z +∞
III f (x) dx = 1
−∞

This last is sometimes expressed as:


Z
f (x) dx = 1
R

Here R refers to the range of interest, where the probability density function is non-zero.
For the uniform distribution above, the range R is 0 to 2.

– 9.2 –
Continuous Distributions
Informally, a discrete distribution has been taken as almost any indexed set of probabilities
whose sum is 1. The index has always been r = 0, 1, 2, . . .
Equally informally, almost any function f (x) which satisfies the three constraints can be
used as a probability density function and will represent a continuous distribution.

Expectation
With discrete distributions, the general formula for the mean or expectation of a single
random variable X is: X
µ = E(X) = r.P(X = r)
r

This is the first example of a formula used with discrete distributions which can be readily
adapted for continuous distributions. The mean µ or expectation E(X) of a random
variable X whose probability distribution function is f (x) is:
Z
µ = E(X) = x.f (x) dx
R

The general form for the expectation of a function of a random variable adapts too but
since f is used as the name of the general probability density function some other name
has to be used for the function of the random variable. In the following, h is taken as some
function of the random variable X and the expectation:
Z

E h(X) = h(x).f (x) dx
R

In the particular case of the square of X, when h(X) = X 2 :


Z
2
E(X ) = x2 .f (x) dx
R

Variance
The definition of variance is exactly the same for continuous random variables as for
discrete random variables:
2
Variance = σ 2 = V(X) = E (X − µ)2 = E(X 2 ) − E(X)


Thus the variance can be determined by first evaluating E(X) and E(X 2 ). Alternatively,
(X − µ)2 can be regarded as a special case of the function h and the variance can be
directly computed thus:
Z
Variance = σ = V(X) = E (X − µ) = (x − µ)2 .f (x) dx
2 2

R

– 9.3 –
Illustration — Uniform(0,2)
Consider the random variable X which is distributed Uniform(0,2) and whose probability
density function is: (
1
2 , if 0 6 x < 2
f (x) =
0, otherwise
The expectation E(X) is:
2  2 2
1
Z
x
E(X) = x. dx = =1
0 2 4 0

The expectation E(X 2 ) is:


2  3 2
1 4
Z
2 2 x
E(X ) = x . dx = =
0 2 6 0 3

The variance V(X) is:


2 4 1
V(X) = E(X 2 ) − E(X) = − 12 =
3 3
The General Uniform Distribution
In the general case, a random variable which is distributed Uniform(a, b) is uniformly
distributed over the range a to b. To ensure that the integral of the associated probability
density function f (x) over this range is 1 the function is defined as:
(
1
b−a , if a 6 x < b
f (x) =
0, otherwise

This function can be represented graphically:

X
1
b−a

f (x) ↑

0
a b
x→

By analogy with discrete distributions, the first check is that the integration over the
appropriate range is 1:
Z b  b
1 x
dx = =1
a b−a b−a a

– 9.4 –
The expectation E(X) is:

b  2 b
1 1 1 b2 − a 2 b+a
Z
x
E(X) = x. dx = = . =
a b−a b−a 2 a b−a 2 2

This is simply confirming that the mean is halfway between a and b and this was seen
earlier with the distribution Uniform(0,2) where the mean was 1.
The expectation E(X 2 ) is:

b  3 b
1 1 1 b3 − a 3 b2 + ab + a2
Z
2 2 x
E(X ) = x . dx = = . =
a b−a b−a 3 a b−a 3 3

The variance V(X) is:

2 b2 + ab + a2 b2 + 2ab + a2
V(X) = E(X 2 ) − E(X) = −
3 4

4b2 + 4ab + 4a2 − 3b2 − 6ab − 3a2 b2 − 2ab + a2 (b − a)2


= = =
12 12 12
1
With the distribution Uniform(0,2) a = 0 and b = 2 giving the result 3 noted earlier.

Illustration — Roulette Wheel


Let X be the angle between some reference radius on a roulette wheel and some fixed
direction on the casino table. The angle X is a random variable which is distributed
Uniform(0, 2π).
Taking the values a = 0 and b = 2π, the expectation and variance are:

2π (2π)2 π2
E(X) = =π and V(X) = =
2 12 3
Mode and Median
Informally, the mode of any distribution is the most probable value. This is the value for
which f (x) is a maximum. Clearly the Uniform distribution does not have a mode in any
useful sense.
Informally, the median of any distribution is the middle value. This is the value of x which
is such that the area under f (x) to the left of x is equal to the area under f (x) to the right
of x. If the value of the median is M then M must be such that:
Z M Z +∞
f (x) dx = f (x) dx
−∞ M

In the case of the Uniform distribution, the median is the same as the mean since the
halfway point divides the area into two equal parts.

– 9.5 –
Probability Distribution Functions
Related to any probability density function f (x) there is an associated function F (x) which
is known as the probability distribution function. The relationship is:
Z x
F (x) = P(X < x) = f (t) dt
−∞

The following figure shows the relationship diagrammatically. The function F (x) is the
area under the curve from the leftmost end of the region of the distribution (which may
be −∞) up to x:

f (t) ↑


0 x t→

Two points stem directly from the definition of a probability distribution function. First:

P(a 6 X < b) = F (b) − F (a)

Secondly, given that F (x) is the integral of f (x), the derivative of F (x) must be f (x):

d
F (x) = f (x)
dx

It is unfortunate that two important functions have the same initial letters. Some writers
distinguish the two thus:

pdf stands for probability density function


PDF stands for probability distribution function

Given such obvious scope for confusion, the abbreviations will not be used. Moreover, only
limited use will be made of probability distribution functions.

– 9.6 –
The Exponential Distribution
The Exponential distribution, sometimes known as the Negative Exponential distribution,
is related to the Geometric and Poisson discrete distributions.
The main design criterion for this distribution is to find, for some random variable X, a
probability density function which is such that:

P(X > x) = e−λx

This is a tail probability whose value decreases exponentially as x increases.


In the context of the Poisson distribution, imagine that a town averages one murder a
year. The probability of the town having a run of ‘at least 10 years’ without a murder is
substantially less than the probability of lasting ‘at least one year’ without a murder.
The first step in determining the appropriate probability density function is to find the
probability distribution function. Given that P(X < x) + P(X > x) = 1:

P(X > x) = 1 − P(X < x) = 1 − F (x)

Hence:
F (x) = 1 − e−λx

Differentiate with respect to x:


f (x) = λ.e−λx

This is not quite suitable as a probability density function because the range has not been
specified. Clearly the range cannot start from −∞ for this would lead to an infinite area
under the curve. The appropriate formal specification of the probability density function
for the exponential distribution is:

λ.e−λx , if x > 0

f (x) =
0, otherwise

It is simple to check that, without any need for scaling, the integration over the range 0
to ∞ is 1: Z ∞  ∞
−λx −λx
λ.e dx = −e =1
0 0

A second check is to confirm that the probability density function satisfies the design
criterion that P(X > x) = e−λx :
Z ∞  ∞
−λt −λt
P(X > x) = λ.e dt = −e = e−λx
x x

– 9.7 –
A graphical representation of the Exponential distribution is:

f (t) ↑


0 x t→

The probability P(X > x) corresponds to the shaded area. Quite clearly, the larger the
value of x the smaller this probability.
The expectation E(X) is: Z ∞
E(X) = xλ.e−λx dx
0

Let t = λx so dx = λ1 dt. Then:

∞  ∞
1 1 1
Z
−t −t
E(X) = t.e dt = −(t + 1)e =
0 λ λ 0 λ

The expectation E(X 2 ) is: Z ∞


E(X ) = 2
x2 λ.e−λx dx
0

Let t = λx so dx = λ1 dt. Then:

∞ ∞  ∞
t −t 1 1 1 2
Z Z
2 2 −t 2 −t
E(X ) = t.e dt = 2 t .e dt = 2 −(t + 2t + 2)e = 2
0 λ λ λ 0 λ 0 λ

The variance V(X) is:


 2
2
2 2 1 1
V(X) = E(X ) − E(X) = 2− = 2
λ λ λ

In the case of the Poisson distribution both the expectation and the variance are λ.
In the case of the Exponential distribution the expectation is λ1 but the variance is λ12 .
An important consideration in the context of the Exponential distribution is that the time
you may expect to wait for a No. 9 bus does not depend on when you start waiting for it.

– 9.8 –
Glossary
The following technical terms have been introduced:
probability density function median mode
probability distribution function
Exercises — IX
1. The distribution of the angle α (to the vertical) at which meteorites strike the Earth
has probability density function:

π
f (α) = sin(2α) where 0 6 α 6 2

Find the expectation and variance of the distribution.


2. Find the expectation and variance of the double exponential distribution:

f (x) = 21 ce−c|x|

3. If X has the exponential distribution show that:



P(X > u + v X > u) = P(X > v) for all u, v > 0

This is the ‘lack of memory property’ (c.f. Exercises IV, question 2).

– 9.9 –
10 — BIVARIATE DISTRIBUTIONS

After some discussion of the Normal distribution, consideration is given to handling two
continuous random variables.

The Normal Distribution


The probability density function f (x) associated with the general Normal distribution is:
1 (x−µ)2
f (x) = √ e− 2σ2 (10.1)
2πσ 2
The range of the Normal distribution is −∞ to +∞ and it will be shown that the total
area under the curve is 1. It will also be shown that µ is the mean and that σ 2 is the
variance. A graphical representation of the Normal distribution is:

f (x) ↑


0 µ x→

It is immediately clear from (10.1) that f (x) is symmetrical about x = µ. This value of x
is marked. When µ = 0 the curve is symmetrical about the vertical axis. The value of σ 2
governs the width of the hump and it turns
√ out that the inflexion points (one each side of
2
the peak) are at x = µ ± σ where σ = σ is the standard deviation.
When integrating expressions which incorporate the probability density function, it is
essential to know the following standard result (a derivation is presented on page 10.10):
Z +∞
2 √
e−x dx = π (10.2)
−∞

This result makes it easy to check that the total area under the curve is 1:
Z +∞
1 (x−µ)2
Total Area = √ e− 2σ2 dx
2πσ 2 −∞
x−µ √
Let t = √ so dx = 2σ 2 dt. Then:
2σ 2
1
Z +∞ √ 1
Z +∞
1 √
−t2 2
Total Area = √ e 2σ 2 dt = √ e−t dt = √ . π = 1
2πσ 2 −∞ π −∞ π

– 10.1 –
From the symmetry of the probability density function (10.1) it is clear that the expectation
E(X) = µ but to give a further illustration of the use of the standard result (10.2) the
expectation is here evaluated by integration:
Z +∞
1 (x−µ)2
E(X) = √ −
x.e 2σ2 dx
2πσ 2 −∞
x−µ √ √
Let t = √ so x = 2σ 2 t + µ and dx = 2σ 2 dt. Then:
2σ 2
1
Z +∞ √ 2√
E(X) = √ ( 2σ 2 t + µ).e−t 2σ 2 dt
2πσ 2 −∞
r
2σ 2 +∞ −t2
Z +∞
µ
Z
2
= t.e dt + √ e−t dt
π −∞ π −∞
r +∞
2σ 2 µ √

1 −t2
= − e +√ . π
π 2 −∞ π
r
2σ 2
= [0 − 0] + µ
π

The variance, for once, is conveniently evaluated from the formula V(X) = E (X − µ)2 :


Z +∞
2 1 (x−µ)2
(x − µ)2 .e− 2σ2 dx

V(X) = E (X − µ) = √
2πσ 2 −∞
x−µ √ √
Let t = √ so x − µ = 2σ 2 t and dx = 2σ 2 dt. Then:
2σ 2
1
Z +∞
2 √
V(X) = √ 2σ 2 t2 .e−t 2σ 2 dt
2πσ 2 −∞

+∞
σ2 d −t2 
Z
= −√ t e dt
π −∞ dt

The integrand has been put into a form ready for integration by parts:
+∞ Z +∞
σ2 σ2

−t2 2
V(X) = − √ t.e +√ e−t dt
π −∞ π −∞

σ2 σ2 √
= − √ [0 − 0] + √ . π
π π
= σ2

– 10.2 –
Standard Form of the Normal Distribution
The general Normal distribution is described as:
Normal(µ, σ 2 ) or simply N(µ, σ 2 )

The smaller the variance σ 2 the narrower and taller the hump of the probability density
function.
A particularly unfortunate difficulty with the Normal distribution is that integrating the
probability density function between arbitrary limits is intractable. Thus, for arbitrary a
and b, it is impossible to evaluate:
b
1
Z
(x−µ)2
P(a 6 X < b) = √ e− 2σ 2 dx
2πσ 2 a

The traditional way round this problem is to use tables though these days appropriate
facilities are built into any decent spreadsheet application.
Unsurprisingly, tables are not available for a huge range of possible (µ, σ 2 ) pairs but the
special case when the mean is zero and the variance is one, the distribution Normal(0,1),
is well documented. The probability distribution function for Normal(0,1) is often referred
to as Φ(x): Z x
1 1 2
Φ(x) = √ e− 2 t dt
2π −∞
Tables of Φ(x) are readily available so, given a random variable X distributed Normal(0,1),
it is easy to determine the probability that X lies in the range a to b. This is:

P(a 6 X < b) = Φ(b) − Φ(a)

It is now straightforward to see how to deal with the general case of a random variable X
distributed Normal(µ, σ 2 ). To determine the probability that X lies in the range a to b
first reconsider: Z b
1 (x−µ)2
P(a 6 X < b) = √ e− 2σ2 dx
2πσ 2 a
x−µ
Let t = so dx = σ dt. Then:
σ
b−µ b−µ
1 1
Z σ
Z σ
− 12 t2 1 2
P(a 6 X < b) = √ e σ dt = √ e− 2 t dt
2πσ 2 a−µ
σ
2π a−µ
σ

The integration is thereby transformed into that for the distribution Normal(0,1) and is
said to be in standard form. Noting the new limits:
   
b−µ a−µ
P(a 6 X < b) = Φ −Φ
σ σ

– 10.3 –
The Central Limit Theorem
No course on probability or statistics is complete without at least a passing reference to
the Central Limit Theorem. This is an extraordinarily powerful theorem but only the most
token of introductory remarks will be made about it here.
Suppose X1 , X2 , . . . Xn are n independent and identically distributed random variables
each with mean µ and variance σ 2 . Consider a derived random variable Y which is the
mean of X1 , X2 , . . . Xn :
X1 + X 2 + · · · + X n
Y =
n
The expectation of Y is:
1 1 
E(Y ) = E(X1 + X2 + · · · + Xn ) = E(X1 ) + E(X2 ) + · · · + E(Xn ) = µ
n n
The variance of Y is:
1 1 1
V(Y ) = 2 V(X1 + X2 + · · · + Xn ) = 2 V(X1 ) + V(X2 ) + · · · + V(Xn ) = σ 2

n n n
The expectation of the mean is the same whatever the value of n but the variance gets less
as n increases.
The Central Limit Theorem, in essence, addresses the distribution of the derived random
variable Y . The theorem notes that, as n increases indefinitely, the distribution of Y tends
to a Normal distribution whatever the distribution of the individual X i .

Bivariate Distributions — Reference Discrete Example


It has been noted that many formulae for continuous distributions parallel equivalent
formulae for discrete distributions. In introducing examples of two continuous random
variables it is useful to employ a reference example of two discrete random variables.
Consider two discrete random variables X and Y whose values are r and s respectively
and suppose that the probability of the event {X = r} ∩ {Y = s} is given by:
 r + s , if 0 6 r, s 6 3

P(X = r, Y = s) = 48
0, otherwise

The probabilities may be tabulated thus:


Y s →
0 1 2 3
0 1 2 3 6
X 0 48 48 48 48 48
1 2 3 4 10
1 48 48 48 48 48 P(X = r)
2 3 4 5 14
r 2 48 48 48 48 48 ↓
3 4 5 6 18
↓ 3 48 48 48 48 48
6 10 14 18
48 48 48 48

P(Y = s) →

– 10.4 –
The following is a graphical representation of the same function:

Y 3
2
1
0

X
2

The probabilities exist for only integer values of r and s so the effect is of a kind of
pegboard. Different lengths of peg correspond to different probabilities.
Axiom II requires the sum of all the entries in the table (which is the total length of the
pegs) to be 1.
The table also includes the marginal sums which separately tabulate the probabilities
P(X = r) and P(Y = s).

Bivariate Distributions — Continuous Random Variables


When there are two continuous random variables, the equivalent of the two-dimensional
array is a region of the x–y (cartesian) plane. Above the plane, over the region of interest,
is a surface which represents the probability density function associated with a bivariate
distribution.
Suppose X and Y are two continuous random variables and that their values, x and y
respectively, are constrained to lie within some region R of the cartesian plane. The
associated probability density function has the general form fXY (x, y) and, regarding this
function as defining a surface over region R, axiom II requires:
ZZ
fXY (x, y) dx dy = 1
R

This is equivalent to requiring that the volume under the surface is 1 and corresponds to
the requirement that the total length of the pegs in the pegboard is 1.

– 10.5 –
As an illustration, consider a continuous analogy of the reference example of two discrete
random variables.
Suppose X and Y are two continuous random variables and that their values x and y are
constrained to lie in the unit square 0 6 x, y < 1. Suppose further that the associated
bivariate probability density function is:
x + y, if 0 6 x, y < 1

fXY (x, y) =
0, otherwise
This probability density function can be regarded as defining a surface over the unit square.
Continuous functions cannot satisfactorily be tabulated but it is not difficult to depict a
graphical representation of fXY (x, y):

1.0
Y

0.0

0.0

1.0

Note that when x = y = 1 the value of fXY (x, y) is 2, an impossible value for a probability
but a perfectly possible value for a probability density function. Probabilities correspond
to volumes now and it is easy to check that the total volume under the surface is 1. Given
that region R represents the specified unit square:
Z 1Z 1 Z 1h 2 i1
x
ZZ
fXY (x, y) dx dy = (x + y) dx dy = + yx dy
R 0 0 0 2 0

Z 1 h y y 2 i1
1 
= + y dy = + =1
0 2 2 2 0
In the general case where the probability density function is fXY (x, y), the probability that
(x, y) lies in a particular sub-region Rs of R is given by:
ZZ
P(X, Y lies in Rs ) = fXY (x, y) dx dy
Rs

– 10.6 –
In the example case, suppose that the sub-region Rs is defined by the half-unit square
0 6 x, y < 12 :

Z 21Z 1
2
Z 1
2 h x2 i 12
P(X, Y lies in Rs ) = (x + y) dx dy = + yx dy
0 0 0 2 0

1
Z 2 1 y h y y 2 i 21 1
= + dy = + =
0 8 2 8 4 0 8

In the reference discrete example (where the values of both random variables are confined
to the range 0 to 3) this result loosely corresponds to determining P(X, Y < 1 12 ):

1 X
1
X 0 1 1 2 4 1
P(X, Y < 1 12 ) = P(X = r, Y = s) = + + + = =
r=0 s=0
48 48 48 48 48 12

Why is the answer different?

The Equivalent of Marginal Sums


With two discrete random variables, the marginal sums P(X = r) and P(Y = s) are given
by the relationships:
X X
P(X = r) = P(X = r, Y = s) and P(Y = s) = P(X = r, Y = s)
s r

A pair of continuous random variables X and Y governed by a bivariate distribution


function fXY (x, y) will, separately, have associated probability density functions f X (x) and
fY (y). By analogy with the discrete case, these functions are given by the relationships:
Z ymax Z xmax
fX (x) = fXY (x, y) dy and fY (y) = fXY (x, y) dx
ymin xmin

The limits of integration merit a moment’s thought. The region, R, over which f XY (x, y) is
defined is not, in general, a square. Accordingly, the valid range of y depends on x and, in
consequence, the limits ymin and ymax will depend on x. Likewise, the limits xmin and xmax
will depend on y.
In the example case, where R is the unit square, there is no difficulty about the limits:
1
y 2 i1 1
Z h
fX (x) = (x + y) dy = xy + =x+
0 2 0 2

and:
1 h x2 i1 1
Z
fY (y) = (x + y) dx = + yx =y+
0 2 0 2

– 10.7 –
The Equivalent of Marginal Sums and Independence
With two discrete random variables, the two sets of marginal sums each sum to 1. With
continuous random variables, integrating each of the functions f X (x) and fY (y) must
likewise yield 1. It is easy to verify that this requirement is satisfied in the present case:
Z 1 h x2 Z 1 h y2
1 x i1 1 y i1
x+ dx = + =1 and y+ dy = + =1
0 2 2 2 0 0 2 2 2 0

With two discrete random variables, marginal sums are used to test for independence. Two
variables variables X and Y are said to be independent if:

P(X = r, Y = s) = P(X = r) . P(Y = s) for all r, s

By analogy, two continuous random variables X and Y are said to be independent if:

fXY (x, y) = fX (x) . fY (y)

In the present case:


 1  1 x+y 1
fXY (x, y) = x + y and fX (x) . fY (y) = x + y+ = xy + +
2 2 2 4

Clearly two continuous random variables X and Y whose probability density function is
x + y are not independent but the function just derived can be dressed up as bivariate
probability density function whose associated random variables are independent:

(4xy + 2x + 2y + 1)/4 if 0 6 x, y < 1



fXY (x, y) =
0, otherwise
Illustration — The Uniform Distribution
Suppose X and Y are independent and that both are distributed Uniform(0,1). Their
associated probability density functions are:

1, if 0 6 x < 1 1, if 0 6 y < 1
 
fX (x) = and fY (y) =
0, otherwise 0, otherwise

Given this trivial case, the product is clearly the bivariate distribution function:

1, if 0 6 x, y < 1

fXY (x, y) =
0, otherwise

The surface defined by fXY (x, y) is a unit cube so it is readily seen that:
ZZ
fXY (x, y) dx dy = 1
R

– 10.8 –
Illustration — The Normal Distribution
Suppose X and Y are independent and that both are distributed Normal(0,1). Their
associated probability density functions are:

1 1 2 1 1 2
fX (x) = √ e− 2 x and fY (y) = √ e− 2 y
2π 2π

In this case the product leads to the probability density function:

1 1 2 1 1 2 1 − 1 (x2 +y2 )
fXY (x, y) = √ e− 2 x . √ e− 2 y = e 2 (10.3)
2π 2π 2π

The surface approximates that obtained by placing a large ball in the centre of a table and
then draping a tablecloth over the ball.

Glossary
The following technical terms have been introduced:
standard form bivariate distribution

Exercises — X
1. Find the points of inflexion of the probability density function associated with a
random variable which is distributed Normal(µ, σ 2 ). Hence find the points at which
the tangents at the points of inflection intersect the x-axis.

2. The Cauchy distribution has probability density function:


c
f (x) = where − ∞ < x < +∞
1 + x2

Evaluate c. Find the probability distribution function F (x). Calculate P(0 6 x < 1).
3. Two continuous random variables X and Y have the following bivariate probability
function which is defined over the unit square:

(9 − 6x − 6y + 4xy)/4 if 0 6 x, y < 1

fXY (x, y) =
0, otherwise
RR
(a) Given that R is the unit square, verify that R
fXY (x, y) dx dy = 1.
(b) Determine fX (x) and fY (y).

(c) Hence state whether or not the two random variables are independent.

– 10.9 –
ADDENDUM — AN IMPORTANT INTEGRATION
Rb 2
For arbitrary limits a and b it is impossible to evaluate a e−x dx but evaluation is possible
if a = −∞ and b = +∞ and an informal analysis is presented below.
As a preliminary observation note that:
sZ sZ Z
Z ∞ ∞ Z ∞ ∞ ∞
2
e−x dx = e −x2 dx e −y 2 dy = e−(x2 +y2 ) dx dy
0 0 0 0 0

The item under the square root sign can be integrated by the substitution of two new
variables r and θ using the transformation functions:

x = r cos θ and y = r sin θ

Further discussion of integration by the substitution of two new variables will be given on
pages 12.1 and 12.2. In the present case, the integration is transformed thus:
Z ∞Z ∞ Z π2Z ∞
−(x2 +y 2 ) 2
e dx dy = e−r rdr dθ
0 0 0 0

The first pair of limits reflects integration over a square constituting the quadrant of the
cartesian plane in which x and y are both positive. The second pair of limits reflects
integration over a quarter circle in the same quadrant. Since the function being integrated
is strongly convergent as x and y or r increase, it is valid to equate the two integrations.
The right-hand integration is straightforward:
Z π2Z π
h −e−r2 i∞ π
h θ i π2

1 π
Z 2
Z 2
−r 2
e rdr dθ = dθ = dθ = =
0 0 0 2 0 0 2 2 0 4

Accordingly: r √

π π
Z
2
e −x
dx = =
0 4 2
This leads, by symmetry, to the standard result quoted as (10.2):
Z +∞ √
2
e−x dx = π
−∞

– 10.10 –
11 — TRANSFORMING DENSITY FUNCTIONS

It can be expedient to use a transformation function to transform one probability density


function into another. As an introduction to this topic, it is helpful to recapitulate the
method of integration by substitution of a new variable.

Integration by Substitution of a new Variable


Imagine that a newcomer to integration comes across the following:
Z √ π2
2x cos x2 dx
0

Assuming that the newcomer doesn’t notice that the integrand is the derivative of sin x 2 ,
one way to proceed would be to substitute a new variable y for x2 :

Let y = x2
Replace the limits x = 0 and x = π2 by y = 0 and y = π2
p


Replace 2x cos x2 by 2 y cos y
√ 1 dy
Note that x = y and hence dx dy = 2 y and so replace dx by
√ √
2 y

The original problem is thereby transformed into the following integration:


Z π2 h iπ2
cos y dy = sin y = 1
0 0

The General Case


It is instructive to develop the general case alongside the above example:

General Case Above Example


Z b Z √π 2
f (x) dx 2x cos x2 dx
a 0

Choose a transformation function y(x) y(x) = x2



Note its inverse x(y) x(y) = y

π
Replace the limits by y(a) and y(b) 0 and 2
 √
Replace f (x) by f x(y) 2 y cos y
dx 1
Replace dx by dy √ dy
dy 2 y
Z y(b)  dx
Z π
2
Result is f x(y) dy cos y dy
y(a) dy 0

– 11.1 –
Application to Probability Density Functions
The previous section informally leads to the general formula for integration by substitution
of a new variable: Z b Z y(b)
 dx
f (x) dx = f x(y) dy (11.1)
a y(a) dy

This formula has direct application to the process of transforming probability density
functions. . .
Suppose X is a random variable whose probability density function is f (x).
By definition:
Z b
P(a 6 X < b) = f (x) dx (11.2)
a

Any function of a random variable is itself a random variable and, if y is taken as some
transformation function, y(X) will be a derived random variable. Let Y = y(X).
Notice that if X = a the derived random variable Y = y(a) and if X = b, Y = y(b).
Moreover, (subject to certain
 assumptions about y) if a 6 X < b then y(a) 6 Y < y(b)
and P y(a) 6 Y < y(b) = P(a 6 X < b). Hence, by (11.2) and (11.1):


Z b Z y(b)  dx
P y(a) 6 Y < y(b) = P(a 6 X < b) = f (x) dx = f x(y) dy (11.3)
a y(a) dy

 dx
Notice that the right-hand integrand f x(y) is expressed wholly in terms of y.
dy
Calling this integrand g(y):


Z y(b)
P y(a) 6 Y < y(b) = g(y) dy
y(a)

This demonstrates that g(y) is the probability density function associated with Y .
The transformation is illustrated by the following figures in which the function f (x) (on
the left) is transformed by y(x) (centre) into the new function g(y) (right):

X Y

f (x) y(b) g(y)

y(a)

a b a b y(a) y(b)
x x y

– 11.2 –
Observations and Constraints
The crucial step is (11.3). One imagines noting a sequence of values of a random variable
X and for each value in the range a to b using a transformation function y(x) to compute
a value for a derived random variable Y .
Given certain assumptions about y(x), the value of Y must be in the range y(a) to y(b)
and the probability of Y being in this range is clearly the same as the probability of X
being in the range a to b.
In summary: the shaded region in the right-hand figure has the same area as the shaded
region in the left-hand figure.
There are three important conditions that any probability density function f (x) has to
satisfy:
• f (x) must be single valued for all x
• f (x) > 0 for all x
Z +∞
• f (x) dx = 1
−∞

Often the function usefully applies over some finite interval of x and is deemed to be
zero outside this interval. The function 2x cos x2 could be used in the specification of a
probability density function:
(
2x cos x2 , if 0 6 x < π2
p
f (x) =
0, otherwise

By inspection, f (x) is single valued and non-negative and, given the analysis on page 11.1,
the integral from −∞ to +∞ is one.
The constraints on the specification of a probability density function result in implicit
constraints on any transformation function y(x), most importantly:
• Throughout the useful range of x, both y(x) and its inverse x(y) must be defined and
must be single-valued.
dx dx dx
• Throughout this range, must be defined and either > 0 or 6 0.
dy dy dy
If dx
dy were to change sign there would be values of x for which y(x) would be multivalued
(as would be the case if the graph of y(x) were an S-shaped curve).
A consequence of the constraints is that any practical transformation function y(x) must
either increase monotonically over the useful range of x (in which case for any a < b,
y(a) < y(b)) or decrease monotonically (in which case for any a < b, y(a) > y(b)).
Noting these constraints, it is customary for the relationship between a probability density
function f (x), the inverse x(y) of a transformation function, and the derived probability
density function g(y) to be written:

 dx
g(y) = f x(y) (11.4)
dy

– 11.3 –
Example I
Take a particular random variable X whose probability density function f (x) is:

 x , if 0 6 x < 2

f (x) = 2
0, otherwise

Suppose the transformation function y(x) is:



4 − x2
y(x) = 1 −
2

Note that the useful part of the range of x is 0 to 2 and, over this range, y(x) increases
monotonically from 0 to 1.
Let Y = y(X), the derived random variable, and let g(y) be the probability density function
associated with Y . What is the function g(y)?
The problem is illustrated by the following figures:

X Y
2 2 2

1 1 g(y)

f (x) y(x)

0 0 0
0 2 0 2 0 1 2
x x y

First, derive x(y) the inverse of the function y(x).


Given: √
4 − x2
y =1−
2

4(y − 1)2 = 4 − x2

So:
4y 2 − 8y + 4 = 4 − x2

x2 = 4y(2 − y)

p
x=2 y(2 − y)

– 11.4 –
Accordingly:
 p dx 2(1 − y)
f x(y) = y(2 − y) and =p
dy y(2 − y)

From (11.4):
 dx
g(y) = f x(y) = 2(1 − y)
dy

As illustrated in the figures, the function y(x) transforms one triangular distribution f (x)
into another g(y). The two triangles are opposite ways round and the transformation
function y(x) has to ensure that although low values of X are relatively rare, low values
of Y are common.
Expressing this informally: y(x) stays low for most of the range of x so that even when x
is well over one, the value of y is well under a half. This ensures that the transformation
shifts the bias appropriately.

An Alternative Question
In the example, a probability density function and a transformation function were given
and the requirement was to determine what new probability density function results.
Suppose instead that two probability density functions are given and the requirement is
to find a function which transforms one into the other.
Take the particular functions used in the previous example and pose the question as follows.
Given:

 x , if 0 6 x < 2
 (
2(1 − y), if 0 6 y < 1
f (x) = 2 and g(y) =

0, otherwise 0, otherwise

determine the function y(x) which will transform f (x) into g(y).

 dx
From the relationship g(y) = f x(y) :
dy

x dx
2(1 − y) =
2 dy
or:
dx
x = 4(1 − y)
dy
This differential equation is readily solved and yields:

x2
= 4y − 2y 2 + c
2

Since X = 0 has to transform into Y = 0, the constant c = 0.

– 11.5 –
Continuing:
x2 = 4(2y − y 2 )

Hence the inverse function x(y) is:


p
x(y) = 2 y(2 − y)

A little more processing is required to determine y(x):

x2
y 2 − 2y + 1 = 1 −
4

Hence:
2 x2
(y − 1) = 1 −
4
This leads to: r
x2
y =1± 1−
4
Choice of sign is important. Note, again, that X = 0 has to transform into Y = 0 and
hence minus is appropriate.
This gives the solution: √
4 − x2
y(x) = 1 −
2
Transforming a Uniform Distribution
It would be unusual to wish to transform a triangular distribution but there is a good
reason for wanting to be able to transform a uniform distribution into something else.
The generation of a uniform distribution by computer is a well-understood process and
a typical programming language will be supplied with a library procedure to generate a
random variable whose values are uniformly distributed.
All that remains to generate a random variable which is distributed differently is to use
an appropriate transformation function.
It is very common to start with a distribution which is Uniform(0,1) which is to say that
the probability density function f (x) is:
(
1, if 0 6 x < 1
f (x) =
0, otherwise

 dx
Over the useful range of x, the relationship g(y) = f x(y) simplifies to:
dy

dx
g(y) = (11.5)
dy

– 11.6 –
Example II
Take a random variable X whose probability density function f (x) is Uniform(0,1) and
suppose that the transformation function y(x) is:

1
y(x) = − ln x (λ > 0)
λ

Note that the useful part of the range of x is 0 to 1 and, over this range, y(x) decreases
monotonically from ∞ to 0.
Let Y = y(X) and let g(y) be the probability density function associated with Y . What
is the function g(y)?
The problem is illustrated by the following figures (in which λ = 2):

X Y
1 1 2

f (x) y(x) g(y)

0 0 0
0 1 0 1 0 2
x x y

First, derive x(y) the inverse of the function y(x).


Given:
1
y = − ln x
λ

x = e−λy

Accordingly:
dx
= −λ.e−λy
dy
dx
Given that λ > 0 this derivative dy is everywhere negative.
From (11.5):
dx
g(y) = = λ.e−λy
dy

As illustrated in the figures, the function y(x) transforms the distribution f (x) which is
Uniform(0,1) into g(y) which is the exponential distribution.

– 11.7 –
Example III — Introduction
Suppose raindrops fall in a uniformly distributed way onto the surface of a circular pond
which has unit radius.

D
x

Let X be a random variable whose value x is the distance of a raindrop (shown at D in


the figure) from the centre of the pond. What is the probability density function f (x)
associated with X?
Consider a narrow annular concentric strip of radius x and width δx. The area of this
strip is 2πx δx. The area of the pond as a whole is π.12 .
Hence:
2πx δx
P(x 6 X < x + δx) =
π.12
The probability density function f (x) is therefore 2x or, more strictly:
(
2x, if 0 6 x < 1
f (x) =
0, otherwise

Note, as a check, that f (x) is single valued and non-negative and its integral from −∞ to
+∞ is one.
This is another triangular distribution and leads to the unsurprising result that more
raindrops fall close to the edge of the pond than fall close to the centre.

Example III — Transformation


The value of the random variable X described in the previous section corresponded to the
distance of a random raindrop from the centre of the circular pond.
Suppose one is interested in the square of the distance from the centre of the pond and
how this derived value is distributed.
To investigate this, take the random variable X and apply to it the transformation function
y(x) specified as:
y(x) = x2

Note that the useful part of the range of x is 0 to 1 and, over this range, y(x) increases
monotonically from 0 to 1.
Let Y = y(X) and let g(y) be the probability density function associated with Y . What
is the function g(y)?

– 11.8 –
The problem is illustrated by the following figures:

X Y
2 1 1

f (x) y(x) g(y)

0 0 0
0 1 2 0 1 0 1
x x y

First, derive x(y) the inverse of the function y(x):



x(y) = y

Accordingly:
 √ dx 1
f x(y) = 2 y and = √
dy 2 y
From (11.4):
 dx
g(y) = f x(y) = 1
dy
As illustrated in the figures, the function y(x) transforms the triangular distribution f (x)
into the distribution g(y) which is Uniform(0,1).

Transforming a Uniform Distribution into a Normal Distribution


It would be very useful if there were an easy way of transforming a uniform distribution
into a normal distribution.
Suppose that X is a random variable whose distribution is Uniform(0,1) and Y is a random
variable whose distribution is Normal(0,1). The associated probability density functions
(f (x) and g(y) respectively) are:
(
1, if 0 6 x < 1 1 1 2
f (x) = and g(y) = √ e− 2 y
0, otherwise 2π

The goal is to determine a function y(x) which will transform f (x) into g(y). Given that
f (x) is Uniform(0,1), relationship (11.5) above leads to the differential equation:
dx 1 1 2
= √ e− 2 y (11.6)
dy 2π

Unfortunately this differential equation is intractable.

– 11.9 –
Glossary
The following technical term has been introduced:
transformation function

Exercises — XI
1. Although (11.6) cannot be solved analytically, it succumbs to numerical methods.
The required transformation function y(x) is incorporated into Excel as the built-in
function NORMSINV. Its use is illustrated in the Excel worksheet which is shown on the
facing page. Prepare a worksheet like this one.
The following steps are involved:
(a) First set up a column of 99 values running from 0.01 to 0.99 in steps of 0.01.
Only the first 51 of these values appear on the facing page (the worksheet runs
to a second page which is not shown). Head this column x as shown.
(b) Set up a second column headed y(x). Each value is the result of applying the
function NORMSINV to the corresponding value of x. Note that 0.0 and 1.0 are
deliberately omitted as values of x because and y(0) = −∞ and y(1) = +∞. The
range of the Uniform distribution is 0 to 1 and this maps into the range of the
Normal distribution which is −∞ to +∞.
(c) Use the chart wizard to set up the plot of the transformation function: y(x)
against x over the range of values 0.01 to 0.99. The chart only hints at how
rapidly the function approaches −∞ and +∞ as x tends to zero or one.
(d ) Set up the column headed Range, whose 12 values run from -2.75 to 2.75. These
values constitute the Bin Range required by the Histogram tool in Excel. . .
(e) Check the Tools menu. If the Data Analysis command is not there, choose the
Add-Ins command and, via that, pick up the Analysis ToolPak. Now choose the
Data Analysis command and, via that, select the Histogram tool.
Specify the range containing the 99 values under the heading y(x) as the Input
Range and specify the range containing the 12 values under the heading Range
as the Bin Range. Select the Output Range option and, as the Output Range
itself, specify the single cell two places to the right of the cell with Range in it.
This will be the top left-hand cell of the table which the Histogram tool should
then produce along with the lower chart.
(f ) Tidy up the chart and add comments to the worksheet. The overall result should
have a neat appearance roughly as the worksheet opposite.

Ideally the figures in the column headed Frequency should be half a row higher. The
value 19 would then more clearly indicate that it is the number of values found between
−0.25 and +0.25. The value 0 at the top is the number of values found less than −2.75
and the value 0 at the bottom is the number of values found more than +2.75 (hence
the word More). The numbers against the x-axis of the chart are also rather unhappily
placed.
Despite these minor shortcomings the table and chart strongly suggest that a Uniform
distribution has been transformed into a Normal distribution.

– 11.10 –
2. Replace the 99 values in the column headed x by =RAND(). The new values will
be distributed Uniform(0,1) and the values in the y(x) column will continue to be
distributed Normal(0,1) but they will no longer be sorted. Ignore or delete the upper
chart. Delete the table and the lower chart and invoke the Histogram tool again. The
histogram which results will not be quite so convincing as its predecessor but it should
not be very different.

3. Extend the two main columns so that instead of 99 pairs of values there are 1000.
Delete the table and chart and invoke the Histogram tool (remember to extend the
Input Range). Note that the results are again fairly convincing.

4. Rework the original worksheet (on the previous page) but replace the transformation
function NORMSINV by − 12 ln x and invoke the Histogram tool once again. Check that
the results are in reasonable accordance with Example II on page 11.7.

5. Given the probability density functions:

 y , if 0 6 y < 2
( 
1, if 0 6 x < 1
f (x) = and g(y) = 2
0, otherwise 
0, otherwise

determine the function y(x) which will transform f (x) into g(y).
Rework the worksheet again to illustrate the use of the derived transformation function
to transform the uniform distribution whose probability density function is f (x) into
the triangular distribution whose probability density function is g(y).

6. The triangular distribution obtained in the previous exercise is the same as that whose
probability density function was given as f (x) in Example I on page 11.4. By applying
the transformation function used in Example I to the values in the second column, set
up a third column whose values should be distributed in accordance with the triangular
distribution obtained in Example I. Use the Histogram tool to demonstrate this.

– 11.12 –
12 — TRANSFORMING BIVARIATE DENSITY FUNCTIONS

Having seen how to transform the probability density functions associated with a single
random variable, the next logical step is to see how to transform bivariate probability
density functions.

Integration with two Independent Variables


Consider f (x1 , x2 ), a function of two independent variables. Using cartesian coordinates,
f (x1 , x2 ) might be represented by a surface above the x1 –x2 plane; the value of f (x1 , x2 )
at any point (x1 , x2 ) corresponds to the height above the point.
With two independent variables, integration is normally expressed with a double-integral
sign and integration is over some range R, a specified area in the x 1 –x2 plane:
ZZ
f (x1 , x2 ) dx1 dx2
R
Such an integral represents a volume. If R is a circle then this integral corresponds to the
volume of a cylinder standing on R whose upper end is cut by the surface f (x 1 , x2 ).
The following integral gives the volume of a cone whose height is h and whose base is a
circle of radius a centred on the origin (0, 0), this circle being the region R:
p !
x21 + x22
ZZ
h 1− dx1 dx2
R a
p
Note that at the centrep of the circle x21 + x22 = 0 and the value of the integrand is h. At
the edge of the circle x21 + x22 = a and the value of the integrand is 0.
In principle, two integrations are carried out in turn:
Z a "Z √a2 −x22 p ! #
x21 + x22
4 h 1− dx1 dx2 (12.1)
0 0 a
In this rearrangement, integration is over one quadrant of the circle and the result is
multiplied by 4. For a given value of x2 integration
p is along a strip one end of which is at
x1 = 0 and the other end of which is at x1 = a2 − x22 . This accounts for the limits on
the inner integration.
Already, a seemingly simple example of integration with two independent variables is
beginning to become uncomfortably hard!
There are better ways of determining the volume of a cone but by judicious substitution
of both independent variables even the present approach can be greatly simplified.

Integration by Substitution of two new Variables


The general formula for integration by substitution of a new variable was given as (11.1):
Z b Z y(b)
 dx
f (x) dx = f x(y) dy
a y(a) dy
The transformation function is y(x) and its inverse is x(y).

– 12.1 –
The equivalent formula when there are two independent variables is:

 ∂(x1 , x2 )
ZZ ZZ
f (x1 , x2 ) dx1 dx2 = f x1 (y1 , y2 ), x2 (y1 , y2 ) dy1 dy2 (12.2)
Rx Ry ∂(y1 , y2 )

There are two transformation functions, y1 (x1 , x2 ) and y2 (x1 , x2 ), and their inverses are
x1 (y1 , y2 ) and x1 (y1 , y2 ).
The regions Rx and Ry are identical subject to the first being specified in the x1 –x2 plane
and the second being specified in the y1 –y2 plane.
∂(x1 , x2 )
The item is called a Jacobian and is defined as:
∂(y1 , y2 )

∂x1 ∂x1

∂(x1 , x2 ) ∂y1 ∂y2 ∂x1 ∂x2

∂x1 ∂x2
= = −
∂(y1 , y2 ) ∂x2 ∂x2 ∂y1 ∂y2
∂y2 ∂y1
∂y ∂y2
1

To simplify (12.1) above, use the transformation functions:


q
y1 = x21 + x22 x1 = y1 cos y2
and their inverses
−1 x2
 
y2 = tan x2 = y1 sin y2
x1

Note that:
∂(x1 , x2 ) cos y2 −y1 sin y2
= = y1
∂(y1 , y2 ) sin y2 y1 cos y2

Using (12.2), the integration in (12.1) becomes:


Z π2 "Z a ! #
y1
4 h 1− y1 dy1 dy2
0 0 a

This is, of course, simply a transformation from cartesian coordinates to polar coordinates.
In the first, a small element of area is δx1 .δx2 whereas in the second a small element of
area is δy1 .y1 δy2 .
Integration is again over one quadrant of the circle. The inner integration is along a radius
and, in the outer integration, this radius is swept through an angle of 90 ◦ .
Continuing:
Z π2 ! Z π2
a2 a3 a2 a2 π πa2 h
4 h − dy2 = 4 h dy2 = 4h =
0 2 3a 0 6 6 2 3

The result is the familiar formula for the volume of a cone.

– 12.2 –
Application to Bivariate Probability Density Functions
Formula (12.2) has direct application to the process of transforming bivariate probability
density functions. . .
Suppose X1 and X2 are two random variables whose bivariate probability density function
is f (x1 , x2 ). It is common practice to represent a given pair of values of the two random
variables X1 and X2 as a point in the x1 –x2 plane.
By definition:
ZZ
P(X1 , X2 lies in a specified region Rx ) = f (x1 , x2 ) dx1 dx2 (12.3)
Rx

Any function of a random variable (or indeed of two or more random variables) is itself a
random variable. If y1 and y2 are taken as transformation functions, both y1 (X1 , X2 ) and
y2 (X1 , X2 ) will be derived random variables. Let Y1 = y1 (X1 , X2 ) and Y2 = y2 (X1 , X2 ).
Take Ry as identical to the region Rx but specified in the y1 –y2 plane. Necessarily:

P(Y1 , Y2 lies in a specified region Ry ) = P(X1 , X2 lies in a specified region Rx )

From this and by (12.3) and (12.2):

 ∂(x1 , x2 )
ZZ
P(Y1 , Y2 lies in a specified region Ry ) = f x1 (y1 , y2 ), x2 (y1 , y2 ) dy1 dy2
Ry ∂(y1 , y2 )

Notice that the integrand is expressed wholly in terms of y1 and y2 .


Calling this integrand g(y1 , y2 ):
ZZ
P(Y1 , Y2 lies in a specified region Ry ) = g(y1 , y2 ) dy1 dy2
Ry

This demonstrates that g(y1 , y2 ) is the probability density function associated with the
two random variables Y1 and Y2 .
The requirements for f and g to be single valued and non-negative are just as in the one-
variable case and it is customary for the relationship between a probability density function
f (x1 , x2 ), the inverses x1 (y1 , y2 ) and x2 (y1 , y2 ) of a pair of transformation functions, and
the derived probability density function g(y1 , y2 ) to be written:

 ∂(x1 , x2 )
g(y1 , y2 ) = f x1 (y1 , y2 ), x2 (y1 , y2 )
∂(y1 , y2 )

This is directly analogous to relationship (11.4) given for the transformation of a single
random variable into another.

– 12.3 –
Summary — Single Variable and Bivariate Transformations
In this section, a summary of the single variable case and a summary of the bivariate case
are presented together so that the correspondence between the two can readily be seen.

Transformation of a single random variable:

• Start with a random variable X.


• Assume the associated probability density function is f (x).
• Choose a transformation function y(x).
• Let the derived random variable be Y = y(X).
• Assume the associated probability density function is g(y).
• Assume the inverse of the transformation function is x(y).
• The relationship between f (x) and g(y) is:

 dx
g(y) = f x(y)
dy

• As a special case, if f (x) corresponds to a uniform distribution, the relationship is:



dx
g(y) =
dy

Transformation of a pair of random variables:

• Start with two random variables X1 and X2 .


• Assume the associated bivariate probability density function is f (x 1 , x2 ).
• Choose two transformation functions y1 (x1 , x2 ) and y2 (x1 , x2 ).
• Let the derived random variables be Y1 = y1 (X1 , X2 ) and Y2 = y2 (X1 , X2 ).
• Assume the associated bivariate probability density function is g(y 1 , y2 ).
• Assume the inverses of the two transformation functions are x1 (y1 , y2 ) and x2 (y1 , y2 ).
• The relationship between f (x1 , x2 ) and g(y1 , y2 ) is:

 ∂(x1 , x2 )
g(y1 , y2 ) = f x1 (y1 , y2 ), x2 (y1 , y2 )
∂(y1 , y2 )

• As a special case, if f (x1 , x2 ) corresponds to a uniform distribution, the relationship


is:
∂(x1 , x2 )
g(y1 , y2 ) =
(12.4)
∂(y1 , y2 )

– 12.4 –
Example — The Box–Muller Transformation
An earlier attempt to transform a uniform distribution into a normal distribution proved
unsuccessful. Fortunately the difficulties can be overcome by starting with the bivariate
equivalent of the uniform distribution.
Suppose X1 and X2 are two independent random variables each distributed Uniform(0,1).
Bringing these together leads to the following bivariate probability density function:
(
1, if 0 6 x1 , x2 < 1
f (x1 , x2 ) =
0, otherwise

Informally, the function f = 1 when (x1 , x2 ) lies in a unit square which has one corner at
the origin but f = 0 if (x1 , x2 ) lies outside this square.
This is the uniform distribution assumed in relationship (12.4).
Suppose that the transformation functions are:
p p
y1 = −2 ln(x1 ) cos(2πx2 ) and y2 = −2 ln(x1 ) sin(2πx2 ) (12.5)

First, derive the inverse functions:


1 2 2 1 y 
2
x1 = e− 2 (y1 +y2 ) and x2 = tan−1
2π y1

Next, evaluate the Jacobian:


1 2 2 1 2 2
∂x1 ∂x1 −y1 e− 2 (y1 +y2 ) −y2 e− 2 (y1 +y2 )

" #
∂y1 ∂y2 − 12 (y12 +y22 )
 y 2

=

=− 1 e 2
1+

∂x2 ∂x2 1 −y2 /y12 1 1/y1
2π 1 + (y2 /y1 ) 2 y1
∂y1 ∂y2 2π 1 + (y2 /y1 )2 2π 1 + (y2 /y1 )2

From (12.4):
∂(x1 , x2 )
g(y1 , y2 ) =
= √1 e− 21 y12 × √1 e− 21 y22
∂(y1 , y2 ) 2π 2π
Recall (10.3) and note that this bivariate probability density function corresponds to two
independent random variables Y1 and Y2 which are each distributed Normal(0,1).
It is now clear how to transform a uniform distribution into a normal distribution:
• Start with two independent random variables X1 and X2 which are each distributed
Uniform(0,1).
• From the transformation functions y1 and y2 derive two new random variables being
Y1 = y1 (X1 , X2 ) and Y2 = y2 (X1 , X2 ).
• The derived random variables will each independently be distributed Normal(0,1).
This process is known as the Box–Muller transformation.

– 12.5 –
Box–Muller Refinement
The following procedure, written in a hypothetical programming language, makes use of the
Box–Muller transformation; repeated calls of this procedure will return random numbers
which are distributed Normal(0,1):

PROCEDURE normal
X1 = uniform(0,1)
X2 = uniform(0,1)
Y1 = sqrt(-2*ln(X1))*cos(2*pi*X2)
Y2 = sqrt(-2*ln(X1))*sin(2*pi*X2)
RETURN Y1
END

It is assumed that uniform, sqrt, ln, cos, and sin are library procedures which have the
obvious effects. In particular, repeated calls of uniform(0,1) will return random numbers
which are distributed Uniform(0,1).
Mathematically the procedure is fine but it is not altogether satisfactory from a Computer
Science point of view. Most obviously, the value Y2 is computed but never used. It would
be better to arrange for the procedure to have two states. In one state, both Y1 and Y2
would be evaluated and the value of Y1 returned. The value of Y2 would be retained so
that it could be returned on the next call when the procedure would be in the alternate
state in which no evaluation would be necessary.
The procedure also makes two calls of sqrt, two calls of ln and one each of cos and sin.
All six of these calls are quite expensive in computer time and it is possible to be much
more efficient.
Instead of starting with the two random variables X1 and X2 which are each distributed
Uniform(0,1) a better approach is to begin with two other random variables W 1 and W2
whose values represent the (cartesian) coordinates of a point in a unit circle centred on
the origin. All points in the circle are equally likely, just as in the raindrops and pond
example discussed earlier.
Assuming the values of W1 and W2 are w1 and w2 respectively, the random variables X1
and X2 are then given values:

1 w 
2
x1 = w12 + w22 and x2 = tan−1 (12.6)
2π w1

It will be demonstrated shortly that these derived random variables X 1 and X2 are both
distributed Uniform(0,1) and can therefore be used as before.
At this stage, the introduction of the two random variables W1 and W2 hardly seems to
have led to an improvement but it will be shown that, by their use, the number of expensive
procedure calls can be greatly reduced.
To appreciate how this revised approach works and why it leads to greater efficiency, it is
necessary to revisit the circular pond. . .

– 12.6 –
In the figure below, the coordinates of the point D are shown as (w1 , w2 ), these being the
values of the random variables W1 and W2 :

D
r w
2
θ
w1
1

From the figure, r is the distance of point D from the centre and r 2 = w12 + w22 but,
from (12.6), x1 = w12 + w22 .
Hence:

x1 = r 2 or r= x1 (12.7)

The value of the derived random variable X1 is therefore the square of the distance r
of D from the centre and, from the experience of the raindrops and pond example, it is
distributed Uniform(0,1).
w w
   
From the figure, θ = tan−1 2 but, from (12.6), x2 = 1 tan−1 2 .
w1 2π w1
Hence:
θ
x2 = or θ = 2π x2 (12.8)

Assuming a two-argument inverse-tangent function is used (such as ATAN2 in Excel), θ will


be uniformly distributed over the range 0 to 2π. This ensures that the value of the derived
random variable X2 is distributed Uniform(0,1).
It is now clear that both X1 and X2 are distributed Uniform(0,1).
From the figure and from (12.7) and (12.8):

√ w1
w1 = r cos θ = x1 cos(2πx2 ) so cos(2πx2 ) = √
x1

and:
√ w2
w2 = r sin θ = x1 sin(2πx2 ) so sin(2πx2 ) = √
x1

The transformation functions (12.5) can therefore be rewritten:


s s
−2 ln(x1 ) −2 ln(x1 )
y1 = w1 and y2 = w2 (12.9)
x1 x1

– 12.7 –
The procedure written in the hypothetical programming language can now be modified to
accommodate the revised approach:

PROCEDURE normal
REPEAT
W1 = uniform(-1,+1)
W2 = uniform(-1,+1)
X1 = W1*W1+W2*W2
UNTIL X1<1
FACTOR = sqrt(-2*ln(X1)/X1)
Y1 = FACTOR*W1
Y2 = FACTOR*W2
RETURN Y1
END

The first two assignment statements in the REPEAT–UNTIL loop give values to the random
variables W1 and W2 but these values are each in the range −1 to +1. The coordinates
(w1 , w2 ) represent a point which is guaranteed to lie inside a 2 × 2 square centred on the
origin but is not guaranteed to lie inside the unit circle.
A preliminary value w12 + w22 is assigned to the derived random variable X1 ; this is the
square of the distance from the origin. This value is acceptable if it is less than one. If
not, the loop is repeated and new values are determined for W1 and W2 and the derived
random variable X1 .
The value assigned to the identifier FACTOR is the value of the factor common to both
expressions in (12.9). Multiplying this factor by w1 and w2 respectively provides values
for the derived random variables Y1 and Y2 which are both distributed Normal(0,1).
Notice that no value is computed for the derived random variable X 2 since x2 does not
feature in the expressions in (12.9).
This procedure is more efficient that its predecessor and makes only a single call of sqrt
and a single call of ln and there are no calls of cos or sin. Nevertheless, the procedure
still makes no use of Y2. A little extra programming could save the value of Y2 for the
next call of the procedure.
Another improvement would be to enhance the procedure so that it had two arguments
MEAN and STDEV and returned a value which is distributed Normal(MEAN,STDEV 2 ) instead
of Normal(0,1).
Nervous readers might be alarmed at what appears to be a negative argument for the sqrt
function. Remember that x1 is in the range 0 to 1 so ln(x1 ) is guaranteed to be negative
which ensures that −2 ln(x1 ) is positive.
There is a more serious cause for concern in that x1 , the argument of ln, could in principle
be zero. This possibility can be trapped by modifying the condition after UNTIL to 0<X1<1
so that x1 has to be strictly greater than zero.

– 12.8 –

You might also like