210 Book

COURSE NOTES STATS 210 Statistical Theory
Department of Statistics University of Auckland
Contents
1. Probability 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Partitioning sets and events . . . . . . . . . . . . . . . . . . 1.5 Probability: a way of measuring sets . . . . . . . . . . . . . 1.6 Probabilities of combined events . . . . . . . . . . . . . . . . 1.7 The Partition Theorem . . . . . . . . . . . . . . . . . . . . . 1.8 Examples of basic probability calculations . . . . . . . . . . 1.9 Formal probability proofs: non-examinable . . . . . . . . . . 1.10 Conditional Probability . . . . . . . . . . . . . . . . . . . . . 1.11 Examples of conditional probability and partitions . . . . . . 1.12 Bayes Theorem: inverting conditional probabilities . . . . . 1.13 Statistical Independence . . . . . . . . . . . . . . . . . . . . 1.14 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 1.15 Key Probability Results for Chapter 1 . . . . . . . . . . . . 1.16 Chains of events and probability trees: non-examinable . . . 1.17 Equally likely outcomes and combinatorics: non-examinable
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
3 3 3 5 12 14 17 19 20 22 24 30 32 35 37 41 43 46 50 50 51 53 54 58 62 69 76 79 91 94 100 106 112
2. Discrete Probability Distributions 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The probability function, fX (x) . . . . . . . . . . . . . . . . . . 2.3 Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Example of the probability function: the Binomial Distribution 2.5 The cumulative distribution function, FX (x) . . . . . . . . . . . 2.6 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Example: Presidents and deep-sea divers . . . . . . . . . . . . . 2.8 Example: Birthdays and sports professionals . . . . . . . . . . . 2.9 Likelihood and estimation . . . . . . . . . . . . . . . . . . . . . 2.10 Random numbers and histograms . . . . . . . . . . . . . . . . . 2.11 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Variable transformations . . . . . . . . . . . . . . . . . . . . . . 2.13 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.14 Mean and variance of the Binomial(n, p) distribution . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
3. Modelling with Discrete Probability Distributions 3.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . 3.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . 3.3 Negative Binomial distribution . . . . . . . . . . . . . . . . . 3.4 Hypergeometric distribution: sampling without replacement 3.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . 3.6 Subjective modelling . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
118 . 118 . 119 . 124 . 127 . 130 . 136 138 . 138 . 140 . 151 . 156 . 158 . 160 . 162 . 166 . 168 . 173 . 175 . 178
4. Continuous Random Variables 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The probability density function . . . . . . . . . . . . . . . . . . . . . 4.3 The Exponential distribution . . . . . . . . . . . . . . . . . . . . . . 4.4 Likelihood and estimation for continuous random variables . . . . . . 4.5 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Exponential distribution mean and variance . . . . . . . . . . . . . . 4.8 The Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 The Change of Variable Technique: nding the distribution of g(X) . 4.10 Change of variable for non-monotone functions: non-examinable . . . 4.11 The Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 The Beta Distribution: non-examinable . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
5. The Normal Distribution and the Central Limit Theorem 179 5.1 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.2 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 184 6. Wrapping Up 191 6.1 Estimators the good, the bad, and the estimator PDF . . . . . . . . . . . . . 191 6.2 Hypothesis tests: in search of a distribution . . . . . . . . . . . . . . . . . . . . 196
Chapter 1: Probability
1.1 Introduction Denition: A probability is a number between 0 and 1 representing how likely it is that an event will occur. Probabilities can be: 1. Frequentist (based on frequencies),
e.g.
number of times event occurs number of opportunities for event to occur ;
2. Subjective: probability represents a persons degree of belief that an event will occur,
e.g. I think there is an 80% chance it will rain today, written as P(rain) = 0.80.
Regardless of how we obtain probabilities, we always combine and manipulate them according to the same rules. 1.2 Sample spaces Denition: A random experiment is an experiment whose outcome is not known
until it is observed.
Denition: A sample space, , is a set of outcomes of a random experiment. Every possible outcome must be listed once and only once. Denition: A sample point is an element of the sample space. For example, if the sample space is = {s1, s2 , s3}, then each si is a sample point.
Examples: Experiment: Toss a coin twice and observe the result. Sample space: = {HH, HT, T H, T T } An example of a sample point is: HT Experiment: Toss a coin twice and count the number of heads. Sample space: = {0, 1, 2} Experiment: Toss a coin twice and observe whether the two tosses are the same (e.g. HH or TT). Sample space: = {same, dierent} Discrete and continuous sample spaces Denition: A sample space is nite if it has a nite number of elements. Denition: A sample space is discrete if there are gaps between the dierent elements, or if the elements can be listed, even if an innite list (eg. 1, 2, 3, . . .).
In mathematical language, a sample space is discrete if it is nite or countable. Denition: A sample space is continuous if there are no gaps between the elements, so the elements cannot be listed (eg. the interval [0, 1]). Examples: = {0, 1, 2, 3} (discrete and nite) = {4.5, 4.6, 4.7} (discrete, nite)
= {0, 1, 2, 3, . . .} (discrete, innite) = {HH, HT, T H, T T } (discrete, nite) = [0, 90), [90, 360)
= [0, 1] ={all numbers between 0 and 1 inclusive} (continuous, innite)
(discrete, nite)
1.3 Events Suppose you are setting out to create a science of randomness. Somehow you need to harness the idea of randomness, which is all about the unknown, and express it in terms of mathematics. How would you do it? So far, we have introduced the sample space, , which lists all possible outcomes of a random experiment, and might seem unexciting.
Kolmogorov (1903-1987). One of the founders of probability theory.
However, is a set. It lays the ground for a whole mathematical formulation of randomness, in terms of set theory. The next concept that you would need to formulate is that of something that happens at random, or an event. How would you express the idea of an event in terms of set theory? Denition: An event is a subset of the sample space. That is, any collection of outcomes forms an event.
Example: Toss a coin twice. Sample space: = {HH, HT, T H, T T } Let event A be the event that there is exactly one head. We write: A =exactly one head Then A = {HT, T H}. A is a subset of , as in the denition. We write A . Denition: Event A occurs if we observe an outcome that is a member of the set
A.
Note: is a subset of itself, so is an event. The empty set, = {}, is also a subset of . This is called the null event, or the event with no outcomes.
Example: Experiment: throw 2 dice. Sample space: = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), (2, 2), . . . , (2, 6), . . . , (6, 6)} Event A = sum of two faces is 5 = {(1, 4), (2, 3), (3, 2), (4, 1)}
Combining Events Formulating random events in terms of sets gives us the power of set theory to describe all possible ways of combining or manipulating events. For example, we need to describe things like coincidences (events happening together), alternatives, opposites, and so on. We do this in the language of set theory. Example: Suppose our random experiment is to pick a person in the class and see what form(s) of transport they used to get to campus today.
Bus Car Walk Bike
Train People in class

This sort of diagram representing events in a sample space is called a Venn
diagram.
1. Alternatives: the union or operator, We wish to describe an event that is composed of several dierent alternatives. For example, the event that you used a motor vehicle to get to campus is the event that your journey involved a car, or a bus, or both. To represent the set of journeys involving both alternatives, we shade all
outcomes in Bus and all outcomes in Car.
Bus Car Walk Bike
Overall, we have shaded all outcomes in the UNION of Bus and Car.
as Bus UNION Car.
We write the event that you used a motor vehicle as the event Bus Car, read The union operator, , denotes Bus OR Car OR both.
Note: Be careful not to confuse Or and And. To shade the union of Bus and Car, we had to shade everything in Bus AND everything in Car. To remember whether union refers to Or or And, you have to consider what
does an outcome need to satisfy for the shaded event to occur?

The answer is Bus, OR Car, OR both. NOT Bus AND Car. Denition: Let A and B be events on the same sample space : so A and B . The union of events A and B is written A B, and is given by A B = {s : s A
or s B or both} .
2. Concurrences and coincidences: the intersection and operator, The intersection is an event that occurs when two or more events ALL occur
together.
For example, consider the event that your journey today involved BOTH a car AND a train. To represent this event, we shade all outcomes in the OVERLAP
of Car and Train.
Bus Car Walk Bike
Car INTERSECT Train.
We write the event that you used both car and train as Car Train, read as The intersection operator, , denotes both Car AND Train together.
Denition: The intersection of events A and B is written A B and is given by A B = {s : s A AND s B} .
3. Opposites: the complement or not operator The complement of an event is the opposite of the event: everything EXCEPT
the event.
For example, consider the event that your journey today did NOT involve walking. To represent this event, we shade all outcomes in except those in the
event Walk.
Bus Car Walk Bike
We write the event not Walk as Walk. Denition: The complement of event A is written A and is given by A = {s : s A}. / Examples: Experiment: Pick a person in this class at random. Sample space: = {all people in class}. Let event A =person is male and event B =person travelled by bike today. Suppose I pick a male who did not travel by bike. Say whether the following events have occurred: 1) A 3) A
Yes. No.
2) B 4) B
No. Yes.
5) A B = {female or bike rider or both}. No. 6) A B = {male and non-biker}. Yes. 7) A B = {male and bike rider}. No. 8) A B = everything outside A B . A B did not occur, so A B did occur. Question: What is the event ? = Challenge: can you express A B using only unions and complements? Answer: A B = (A B).
Yes.
10
Limitations of Venn diagrams Venn diagrams are generally useful for up to 3 events, although they are not used to provide formal proofs. For more than 3 events, the diagram might not be able to represent all possible overlaps of events. (This was probably the case for our transport Venn diagram.) Example:
A B A B
C
(a) A B C
C
(b) A B C
Properties of union, intersection, and complement The following properties hold. (i) = and = . (ii) For any event A, and AA = AA =
The opposite of everything is nothing.
Everything is either A or not A. Nothing is both A and not A.
(iii) For any events A and B, and
A B = B A, A B = B A.
Commutative.
(iv) (a) (A B) = A B.
A B
(b) (A B) = A B.
A B
11
Distributive laws We are familiar with the fact that multiplication is distributive over addition. This means that, if a, b, and c are any numbers, then a (b + c) = a b + a c. However, addition is not distributive over multiplication: a + (b c) = (a + b) (a + c). For set union and set intersection, union is distributive over intersection, AND
intersection is distributive over union.

Thus, for any sets A, B, and C: A (B C) = (A B) (A C),
and
A (B C) = (A B) (A C).
A B B
More generally, for several events A and B1 , B2, . . . , Bn, A (B1 B2 . . . Bn ) = (A B1) (A B2 ) . . . (A Bn )
n n
i.e. and
Bi
i=1
i=1
(A Bi),
A (B1 B2 . . . Bn ) = (A B1) (A B2 ) . . . (A Bn )
n n
i.e.
Bi
i=1
i=1
(A Bi).
12
1.4 Partitioning sets and events The idea of a partition is fundamental in probability manipulations. Later in this chapter we will encounter the important Partition Theorem. For now, we give some background denitions. Denition: Two events A and B are mutually exclusive, or disjoint, if A B = .
This means events A and B cannot happen together. If A happens, it excludes B from happening, and vice-versa.
A 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 B 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000
Note: Does this mean that A and B are independent?
No: quite the opposite. A EXCLUDES B from happening, so B depends strongly on whether or not A happens.
Denition: Any number of events A1, A2, . . . , Ak are mutually exclusive if every pair of the events is mutually exclusive: ie. Ai Aj = for all i, j with i = j .
A1 A2 00000 A3 11111 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000 1111 1111 0000 0000 11111 00000
Denition: A partition of the sample space is a collection of mutually exclusive events whose union is . That is, sets B1, B2, . . . , Bk form a partition of if and Bi Bj = for all i, j with i = j , B1 B2 . . . Bk = .
13
Examples: B1 , B2, B3, B4 form a partition of :

B1 B3 B1 B3
B1 , . . . , B5 partition :
B2
B4
B2
B4
B5
Important: B and B partition for any event B:
Partitioning an event A
Any set or event A can be partitioned: it doesnt have to be . If B1, . . . , Bk form a partition of , then (A B1), . . . , (A Bk ) form a partition of A.
B1
B2
A 1111 0000 1111 0000 1111 0000 1111 0000

B3
B4
We will see that this is very useful for nding the probability of event A. This is because it is often easier to nd the probability of small chunks of A (the partitioned sections) than to nd the whole probability of A at once. The partition idea shows us how to add the probabilities of these chunks together: see later.
14
1.5 Probability: a way of measuring sets Remember that you are given the job of building the science of randomness. This means somehow measuring chance. If I sent you away to measure heights, the rst thing you would ask is what you are supposed to be measuring the heights of. People? Trees? Mountains? We have the same question when setting out to measure chance. Chance of what? The answer is sets. It was clever to formulate our notions of events and sample spaces in terms of sets: it gives us something to measure. Probability, the name that we give to our chance-measure, is a way of measuring sets. You probably already have a good idea for a suitable way to measure the size of a set or event. Why not just count the number of elements in it? In fact, this is often what we do to measure probability (although counting the number of elements can be far from easy!) But there are circumstances where this is not appropriate. What happens, for example, if one set is far more likely than another, but they have the same number of elements? Should they be the same probability? First set: {Springboks win}.
111111111 000000000 111111111 000000000 111111111 000000000 111 000 11 111 00 000 111 000 1111 0000
Second set: {All Blacks win}.
Both sets have just one element, but
we need to give them dierent probabilities!

More problems arise when the sets are innite
or continuous.
Should the intervals [3, 4] and [13, 14] be the same probability, just because they are the same length? Yes they should, if (say) our random experiment is to pick a random number on [0, 20] but no they shouldnt (hopefully!) if our experiment was the time in years taken by a student to nish their degree.
15
Most of this course is about probability distributions. A probability distribution is a rule according to which probability is apportioned,
or distributed, among the dierent sets in the sample space.
At its simplest, a probability distribution just lists every element in the sample space and allots it a probability between 0 and 1, such that the total sum of
probabilities is 1.
In the rugby example, we could use the following probability distribution: P(Springboks win)= 0.3, P(All Blacks win)= 0.7.
In general, we have the following denition for discrete sample spaces. Discrete probability distributions Denition: Let = {s1 , s2, . . .} be a discrete sample space. A discrete probability distribution on is a set of real numbers {p1, p2, . . .} associated with the sample points {s1 , s2, . . .} such that:
1. 0 pi 1 for all i; 2.
i
pi = 1.
pi is called the probability of the event that the outcome is si . We write: pi = P(si ). The rule for measuring the probability of any set, or event, A , is to sum the probabilities of the elements of A: P(A) =
iA
pi .
E.g. if A = {s3 , s5, s14}, then P(A) = p3 + p5 + p14.
16
Continuous probability distributions On a continuous sample space , e.g. = [0, 1], we can not list all the elements and give them an individual probability. We will need more sophisticated methods detailed later in the course. However, the same principle applies. A continuous probability distribution is a rule under which we can calculate a probability between 0 and 1 for any set, or event, A .
Probability Axioms For any sample space, discrete or continuous, all of probability theory is based on the following three denitions, or axioms.
Axiom 1: Axiom 2: Axiom 3:
P() = 1.
If A1, A2, . . . , An are mutually exclusive events, (no overlap), then

P(A1 A2 . . . An) = P(A1) + P(A2) + . . . + P(An ).
0 P(A) 1 for all events A.
Note: The axioms can never be proved: they are denitions. If our rule for measuring sets satises the three axioms, it is a valid probability distribution. The idea of the axioms is that ALL possible properties of probability should be derivable using ONLY these three axioms. To see how this works, see Section 1.9 (non-examinable). The denition of a discrete probability distribution on page 15 clearly satises the axioms. The challenge of dening a probability distribution on a continuous sample space is left till later. Note: P() = 0. Note: Remember that an EVENT is a SET: an event is a subset of the sample space.
17
1.6 Probabilities of combined events In Section 1.3 we discussed unions, intersections, and complements of events. We now look at the probabilities of these combinations. Everything below applies to events (sets) in either a discrete or a continuous sample space. 1. Probability of a union Let A and B be events on a sample space . There are two cases for the probability of the union A B: 1. A and B are mutually exclusive (no overlap): i.e. A B = . 2. A and B are not mutually exclusive: A B = . For Case 1, we get the probability of A B straight from Axiom 3:
If A B = then P(A B) = P(A) + P(B).

For Case 2, we have the following formula;
For ANY events A, B ,
P(A B) = P(A) + P(B) P(A B).
Note: The formula for Case 2 applies also to Case 1: just substitute P(A B) = P() = 0. For three or more events: e.g. for any A, B, and C,
P(A B C) = P(A) + P(B) + P(C) + P(A B C) .
P(A B) P(A C) P(B C)
18
Explanation For any events A and B, P(A B) = P(A) + P(B) P(A B).
The formal proof of this formula is in Section 1.9 (non-examinable). To understand the formula, think of the Venn diagrams:
A 11111111 00000000 11111111 00000000 111111111 000000000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 1111 0000 11111111 00000000 111111111 000000000 111111111 000000000 B 111111111 000000000 A 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 B \ (A n000000000 B) 111111111
When we add P(A) + P(B), we
add the intersection twice.

So we have to subtract the intersection once to get P(A B): P(A B) = P(A) + P(B) P(A B).
Alternatively, think of A B as two disjoint sets: all of A, and the bits of B without the intersection. So P(A B) = P(A) + P(B) P(A B) .
2. Probability of an intersection There is no easy formula for P(A B). We might be able to use statistical independence (Section 1.13). If A and B are not statistically independent, we often use conditional probability (Section 1.10.) 3. Probability of a complement
A
1111 0000 1111 0000 1111 0000 1111 0000 1111 0000
B
P(A) = 1 P(A). This is obvious, but a formal proof is given in Sec. 1.9.
19
1.7 The Partition Theorem The Partition Theorem is one of the most useful tools for probability calculations. It is based on the fact that probabilities are often easier to calculate if we break down a set into smaller parts. Recall that a partition of is a collection of non-overlapping sets B1 , . . . , Bm which together cover everything in .
B1 B3 B2 B4
Also, if B1 , . . . , Bm form a partition of , then (A B1), . . . , (A Bm ) form a partition of the set or event A.
B1 B2 A B1 A B2
B3
B4 A B3 A B4
The probability of event A is therefore the sum of its parts: P(A) = P(A B1) + P(A B2 ) + P(A B3) + P(A B4 ). The Partition Theorem is a mathematical way of saying the whole is the sum
of its parts.
Theorem 1.7: The Partition Theorem. (Proof in Section 1.9.)
Let B1 , . . . , Bm form a partition of . Then for any event A,

m
P(A) =
i=1
P(A Bi ).
Note: Recall the formal denition of a partition. Sets B1 , B2, . . . , Bm form a partition of if Bi Bj = for all i = j , and m Bi = . i=1
20
1.8 Examples of basic probability calculations An Australian survey asked people what sort of car they would like if they could choose any car at all. 13% of respondents had children and chose a large car. 12% of respondents did not have children and chose a large car. 33% of respondents had children. Find the probability that a respondent: (a) chose a large car; (b) either had children or chose a large car (or both).
First formulate events: Let

C = has children L = chooses large car. C = no children
Next write down all the information given:

P(C) = 0.33 P(C L) = 0.13 P(C L) = 0.12.
(a) Asked for P(L).

P(L) = P(L C) + P(L C) = P(C L) + P(C L) = 0.13 + 0.12 = 0.25. P(chooses large car) = 0.25.
(Partition Theorem)
(b) Asked for P(L C).
P(L C) = P(L) + P(C) P(L C) = 0.25 + 0.33 0.13 = 0.45.
(Section 1.6)
21
Example 2: Facebook statistics for New Zealand university students aged between 18 and 24 suggest that 22% are interested in music, while 34% are interested in sport. Formulate events: M = interested in music, S = interested in sport. (a) What is P(M)? (b) What is P(M S)?
Information given: (a)
P(M) = 0.22
P(S) = 0.34.
P(M) = 1 P(M) = 1 0.22 = 0.78.
(b) We can not calculate P(M S) from the information given.

(c) Given the further information that 48% of the students are interested in neither music nor sport, nd P(M S) and P(M S).
Information given:
P(M S) = 0.48.
Thus
P(M S) = 1 P(M S) = 1 0.48 = 0.52.
Probability that a student is interested in music, or sport, or both.
P(M S) = P(M) + P(S) P(M S) = 0.22 + 0.34 0.52 = 0.04.
(Section 1.6)
Only 4% of students are interested in both music and sport.
22
(d) Find the probability that a student is interested in music, but not sport.
P(M S) = P(M) P(M S) = 0.22 0.04 = 0.18.
(Partition Theorem)
1.9 Formal probability proofs: non-examinable If you are a mathematician, you will be interested to see how properties of probability are proved formally. Only the Axioms, together with standard settheoretic results, may be used. Theorem : The probability measure P has the following properties. (i) P() = 0. (ii) P(A) = 1 P(A) for any event A. (iii) (Partition Theorem.) If B1 , B2, . . . , Bm form a partition of , then for any event A,
m
P(A) =
i=1
P(A Bi).
(iv) P(A B) = P(A) + P(B) P(A B) for any events A, B. Proof: i) For any A, we have A = A ; and A = So P(A) = P(A ) = P(A) + P() (Axiom 3) P() = 0. (mutually exclusive).
ii) = A A; and A A = (mutually exclusive). So 1 = P() = P(A A) = P(A) + P(A). (Axiom 3)

Axiom 1
23
iii) Suppose B1 , . . . , Bm are a partition of : then Bi Bj = if i = j, and m Bi = . i=1 Thus, (A Bi ) (A Bj ) = A (Bi Bj ) = A = , for i = j, ie. (A B1 ), . . . , (A Bm) are mutually exclusive also.
m m
So,
i=1
P(A Bi) = P
i=1
(A Bi )
m
(Axiom 3)
= P A = P(A) . iv)
Bi
i=1
(Distributive laws)
= P(A )
A B = (A ) (B ) (Set theory) = A (B B) B (A A) (Set theory) = (A B) (A B) (B A) (B A) (Distributive laws)
= (A B) (A B) (A B).
These 3 events are mutually exclusive: eg. (A B) (A B) = A (B B) = A = , etc. So, P(A B) = P(A B) + P(A B) + P(A B) (Axiom 3) = P(A) P(A B)
from (iii) using B and B
P(B) P(A B)
from (iii) using A and A
+ P(A B)
= P(A) + P(B) P(A B).
24
1.10 Conditional Probability Conditioning is another of the fundamental tools of probability: probably the most fundamental tool. It is especially helpful for calculating the probabilities of intersections, such as P(A B), which themselves are critical for the useful Partition Theorem. Additionally, the whole eld of stochastic processes (Stats 320 and 325) is based on the idea of conditional probability. What happens next in a process depends, or is conditional, on what has happened beforehand. Dependent events Suppose A and B are two events on the same sample space. There will often be dependence between A and B. This means that if we know that B has occurred, it changes our knowledge of the chance that A will occur. Example: Toss a die once. Let event A = get 6 Let event B= get 4 or better If the die is fair, then P(A) =
1 6
and P(B) = 1 . 2
However, if we know that B has occurred, then there is an increased chance that A has occurred: P(A occurs given that B has occurred) = We write
1 3.
result 6 result 4 or 5 or 6
1 P(A given B) = P(A | B) = . 3
Question: what would be P(B | A)? P(B | A) = P(B occurs, given that A has occurred) = P(get 4, given that we know we got a 6) = 1.
25
Conditioning as reducing the sample space Sally wants to use Facebook to nd a boyfriend at Uni. Her friend Kate tells her not to bother, because there are more women than men on Facebook. Here are the 2012 gures for Facebook users at the University of Auckland: Relationship status Male Female Total Single In a relationship 700 460 560 660 1220 1260 1120 2380
Total 1160
Before we go any further . . . do you agree with Kate?
No, because out of the SINGLE people on Facebook, there are a lot more men than women!
Conditioning is all about the sample space of interest. The table above shows the following sample space: = {Facebook users at UoA}. But the sample space that should interest Sally is dierent: it is S = {members of who are SINGLE}. Suppose we pick a person from those in the table. Dene event M to be: M ={ person is male }. Kate is referring to the following probability: P(M) = 1160 # Ms = = 0.49. total # in table 2380
Kate is correct that there are more women than men on Facebook, but she is using the wrong sample space so her answer is not relevant.
26
Now suppose we reduce our sample space from = {everyone in the table}
to
S = {single people in the table}. Then P(person is male, given that the person is single) = = =
# single males # singles # who are M and S # who are S

700 1260
= 0.56.
We write:
P(M | S) = 0.56.
This is the probability that Sally is interested in and she can rest assured that there are more single men than single women out there. Example: Dene event R that a person is in a relationship. What is the proportion of males among people in a relationship, P(M | R)?
P(M | R) = = =
# males in a relationship # in a relationship # who are M and R # who are R

460 1120
= 0.41.
27
We could follow the same working for any pair of events, A and B: P(A | B) = =
# who are A and B # who are B (# who are A and B ) / (# in ) (# who are B ) / (# in )
P(A B) . P(B)
This is our denition of conditional probability: Denition: Let A and B be two events. The conditional probability that event A occurs, given that event B has occurred, is written P(A | B), and is given by P(A | B) = P(A B) . P(B)
Read P(A | B) as probability of A, given B. Note: P(A | B) gives P(A and B , from within the set of Bs only).
P(A B) gives P(A and B , from the whole sample space ).
Note: Follow the reasoning above carefully. It is important to understand why the conditional probability is the probability of the intersection within the new sample space.
Conditioning on event B means changing the sample space to B.
Think of P(A | B) as the chance of getting an A, from the set of Bs only.
28
Note: In the Facebook example, we found that P(M | S) = 0.56, and P(M | R) = 0.41. This means that a single person on UoA Facebook is more likely to be male than female, but a person in a relationship is more likely to be female than male! Why the dierence? Your guess is as good as mine, but I think its because men in a relationship are too busy buying owers for their girlfriends to have time to spend on Facebook. The symbol P belongs to the sample space Recall the rst of our probability axioms on page 16: P() = 1. This shows that the symbol P is dened with respect to . That is, P BELONGS to the sample space . If we change the sample space, we need to change the symbol P. This is what we do in conditional probability:
to change the sample space from to B , say, we change from the symbol P to the symbol P( | B).
The symbol P( | B) should behave exactly like the symbol P. For example: P(C D) = P(C) + P(D) P(C D),
so
P(C D | B) = P(C | B) + P(D | B) P(C D | B). Trick for checking conditional probability calculations: A useful trick for checking a conditional probability expression is to replace the conditioned set by , and see whether the expression is still true. For example, is P(A | B) + P(A | B) = 1? Answer: Replace B by : this gives P(A | ) + P(A | ) = P(A) + P(A) = 1.
So, yes, P(A | B) + P(A | B) = 1 for any other sample space B .
29
Is P(A | B) + P(A | B) = 1?
Try to replace the conditioning set by : we cant! There are two conditioning sets: B and B . The expression is NOT true. It doesnt make sense to try to add together probabilities from two dierent sample spaces.
The Multiplication Rule For any events A and B, Proof: P(A B) = P(A | B)P(B) = P(B | A)P(A).
Immediate from the denitions:

P(A | B) = P(A B) P(A B) = P(A | B)P(B) , P(B)
P(B A) P(B A) = P(A B) = P(B | A)P(A). P(B | A) = P(A)
and
New statement of the Partition Theorem The Multiplication Rule gives us a new statement of the Partition Theorem: If B1 , . . . , Bm partition S, then for any event A,
m m
P(A) =
i=1
P(A Bi ) =
i=1
P(A | Bi)P(Bi).
Both formulations of the Partition Theorem are very widely used, but especially m the conditional formulation i=1 P(A | Bi )P(Bi ). Warning: Be sure to use this new version of the Partition Theorem correctly:
it is
P(A) = P(A | B1)P(B1) + . . . + P(A | Bm )P(Bm),
NOT P(A) = P(A | B1) + . . . + P(A | Bm).
30
Conditional probability and Peter Pan When Peter Pan was hungry but had nothing to eat, he would pretend to eat. (An excellent strategy, I have always found.) Conditional probability is the Peter Pan of Stats 210. When you dont know something that you need to know, pretend you know it. Conditioning on an event is like pretending that you know that the event has happened. For example, if you know the probability of getting to work on time in dierent weather conditions, but you dont know what the weather will be like today,
pretend you do and add up the dierent possibilities.

P(work on time) = P(work on time | ne) P(ne) + P(work on time | wet) P(wet).
1.11 Examples of conditional probability and partitions Tom gets the bus to campus every day. The bus is on time with probability 0.6, and late with probability 0.4. The sample space can be written as = {bus journeys}. We can formulate events as follows: T = on time; L = late. From the information given, the events have probabilities: P(T ) = 0.6 ; P(L) = 0.4.
(a) Do the events T and L form a partition of the sample space ? Explain why or why not.
Yes: they cover all possible journeys (probabilities sum to 1), and there is no overlap in the events by denition.
31
The buses are sometimes crowded and sometimes noisy, both of which are problems for Tom as he likes to use the bus journeys to do his Stats assignments. When the bus is on time, it is crowded with probability 0.5. When it is late, it is crowded with probability 0.7. The bus is noisy with probability 0.8 when it is crowded, and with probability 0.4 when it is not crowded. (b) Formulate events C and N corresponding to the bus being crowded and noisy. Do the events C and N form a partition of the sample space? Explain why or why not.
Let C = crowded, N =noisy. C and N do NOT form a partition of . It is possible for the bus to be noisy when it is crowded, so there must be some overlap between C and N .
(c) Write down probability statements corresponding to the information given above. Your answer should involve two statements linking C with T and L, and two statements linking N with C. P(C | T ) = 0.5; P(C | L) = 0.7.
P(N | C) = 0.8;
P(N | C) = 0.4.
(d) Find the probability that the bus is crowded.
P(C) = P(C | T )P(T ) + P(C | L)P(L) = 0.5 0.6 + 0.7 0.4 = 0.58.
(Partition Theorem)
(e) Find the probability that the bus is noisy.
P(N ) = P(N | C)P(C) + P(N | C)P(C) = 0.8 0.58 + 0.4 (1 0.58) = 0.632.
(Partition Theorem)
32
1.12 Bayes Theorem: inverting conditional probabilities
Consider P(B A) = P(A B). Apply multiplication rule to each side:

P(B | A)P(A) = P(A | B)P(B)
Thus
P(B | A) =
P(A | B)P(B) . P(A)
()
This is the simplest form of Bayes Theorem, named after Thomas Bayes (170261), English clergyman and founder of Bayesian Statistics. Bayes Theorem allows us to invert the conditioning, i.e. to express P(B | A) in terms of P(A | B). This is very useful. For example, it might be easy to calculate, but we might only observe the later event and wish to deduce the probability that the earlier event occurred, P(earlier event | later event). Full statement of Bayes Theorem: Theorem 1.12: Let B1, B2, . . . , Bm form a partition of . Then for any event A, and for any j = 1, . . . , m, P(A | Bj )P(Bj ) m i=1 P(A | Bi )P(Bi ) P(later event|earlier event),
P(Bj | A) =
(Bayes Theorem)
Proof:
Immediate from () (put B = Bj ), and the Partition Rule, which gives P(A) = m P(A | Bi)P(Bi). i=1
33
Special case of Bayes Theorem when m = 2: use B and B as the partition of :
then
P(B | A) =
P(A | B)P(B) P(A | B)P(B) + P(A | B)P(B)
Example: The case of the Perdious Gardener. Mr Smith owns a hysterical rosebush. It will die with probability 1/2 if watered, and with probability 3/4 if not watered. Worse still, Smith employs a perdious gardener who will fail to water the rosebush with probability 2/3. Smith returns from holiday to nd the rosebush . . . DEAD!!! What is the probability that the gardener did not water it?
Solution: First step: formulate events

Let : D = rosebush dies W = gardener waters rosebush W = gardener fails to water rosebush
Second step: write down all information given

P(D | W ) =
1 2
P(D | W ) =
3 4
P(W ) =
2 3
1 (so P(W ) = 3 )
Third step: write down what were looking for

P(W | D)
Fourth step: compare this to what we know Need to invert the conditioning, so use Bayes Theorem:
P(W | D) = 3 P(D | W )P(W ) 3/4 2/3 = = P(D | W )P(W ) + P(D | W )P(W ) 3/4 2/3 + 1/2 1/3 4
So the gardener failed to water the rosebush with probability 3 . 4
Example: The case of the Defective Ketchup Bottle. Ketchup bottles are produced in 3 dierent factories, accounting for 50%, 30%, and 20% of the total output respectively. The percentage of bottles from the 3 factories that are defective is respectively 0.4%, 0.6%, and 1.2%. A statistics lecturer who eats only ketchup nds a defective bottle in her lunchbox. What is the probability that it came from Factory 1?
Solution: 1. Events: let Fi = bottle comes from Factory i (i=1,2,3) let D = bottle is defective 2. Information given:
P(F1) = 0.5 P(F2) = 0.3 P(F3) = 0.2 P(D | F1) = 0.004 P(D | F2 ) = 0.006 P(D | F3 ) = 0.012
3. Looking for:
P(F1 | D)
(so need to invert conditioning).
4. Bayes Theorem:
P(F1 | D) = = = P(D | F1)P(F1) P(D | F1)P(F1) + P(D | F2)P(F2) + P(D | F3 )P(F3) 0.004 0.5 0.004 0.5 + 0.006 0.3 + 0.012 0.2 0.002 0.0062
= 0.322.
35
1.13 Statistical Independence Two events A and B are statistically independent if the occurrence of one does
not aect the occurrence of the other.
This means P(A | B) = P(A)
and P(B | A) = P(B).
Now P(A | B) =
P(A B) , P(B)
so if P(A | B) = P(A) then P(A B) = P(A) P(B).

We use this as our denition of statistical independence. Denition: Events A and B are statistically independent if P(A B) = P(A)P(B). For more than two events, we say: Denition: Events A1, A2, . . . , An are mutually independent if P(A1 A2 . . . An) = P(A1)P(A2) . . . P(An), AND the same multiplication rule holds for every subcollection of the events too. Eg. events A1, A2, A3, A4 are mutually independent if i) P(Ai Aj ) = P(Ai)P(Aj ) for all i, j with i = j; AND ii) P(Ai Aj Ak ) = P(Ai)P(Aj )P(Ak ) for all i, j, k that are all dierent; AND iii) P(A1 A2 A3 A4 ) = P(A1)P(A2)P(A3)P(A4). Note: If events are physically independent, then they will also be statistically independent.
36
Statistical independence for calculating the probability of an intersection In section 1.6 we said that it is often hard to calculate P(A B). We usually have two choices. 1. IF A and B are statistically independent, then P(A B) = P(A) P(B). 2. If A and B are not known to be statistically independent, we usually have to
use conditional probability and the multiplication rule:

P(A B) = P(A | B)P(B). This still requires us to be able to calculate P(A | B).
Example: Toss a fair coin and a fair die together. The coin and die are physically independent. Sample space: = {H1, H2, H3, H4, H5, H6, T 1, T 2, T 3, T 4, T 5, T 6}
- all 12 items are equally likely.
Let A= heads and B= six.

6 Then P(A) = P({H1, H2, H3, H4, H5, H6}) = 12 = 2 P(B) = P({H6, T 6}) = 12 = 1 6 1 2
1 Now P(A B) = P(Heads and 6) = P({H6}) = 12
But P(A) P(B) = 1 2
1 6
1 12
also,
So P(A B) = P(A)P(B) and thus A and B are statistically indept.
37
Pairwise independence does not imply mutual independence Example: A jar contains 4 balls: one red, one white, one blue, and one red, white & blue. Draw one ball at random. Let A =ball has red on it, B =ball has white on it, C =ball has blue on it. Two balls satisfy A, so P(A) = Pairwise independent: Consider P(A B) = But, P(A) P(B) =
1 4 1 2 2 4 1 = 2.
Likewise, P(B) = P(C) = 1 . 2
(one of 4 balls has both red and white on it).

1 2 1 = 4 , so P(A B) = P(A)P(B).
Likewise, P(A C) = P(A)P(C), and P(B C) = P(B)P(C). So A, B and C are pairwise independent. Mutually independent? Consider P(A B C) = while
1 4
(one of 4 balls)
1 2
1 P(A)P(B)P(C) = 2 1 2
1 8
= P(A B C).
So A, B and C are NOT mutually independent, despite being pairwise indepen-
dent.
1.14 Random Variables We have one more job to do in laying the foundations of our science of randomness. So far we have come up with the following ideas: 1. Things that happen are sets, also called events. 2. We measure chance by measuring sets, using a measure called probability. Finally, what are the sets that we are measuring? It is a nuisance to have lots of dierent sample spaces: = {head, tail}; = {same, dierent}; = {Springboks, All Blacks}.
38
All of these sample spaces could be represented more concisely in terms of numbers: = {0, 1}.
On the other hand, there are many random experiments that genuinely produce
random numbers as their outcomes.

For example, the number of girls in a three-child family; the number of heads from 10 tosses of a coin; and so on. When the outcome of a random experiment is a number, it enables us to quantify many new things of interest: 1. quantify the average value (e.g. the average number of heads we would get if we made 10 coin-tosses again and again); 2. quantify how much the outcomes tend to diverge from the average value; 3. quantify relationships between dierent random quantities (e.g. is the number of girls related to the hormone levels of the fathers?) The list is endless. To give us a framework in which these investigations can take place, we give a special name to random experiments that produce numbers as their outcomes.
A random experiment whose possible outcomes are real numbers is called a random variable.
In fact, any random experiment can be made to have outcomes that are real numbers, simply by mapping the sample space to a set of real numbers using
a function.
For example: function X : R X(Springboks win) = 0; X(All Blacks win) = 1.
This gives us our formal denition of a random variable: Denition: A random variable (r.v.) is a function from a sample space to the real numbers R. We write X : R.
39
Although this is the formal denition, the intuitive denition of a random variable is probably more useful. Intuitively, remember that a random variable
equates to a random experiment whose outcomes are numbers.

A random variable produces random real numbers as the outcome of a random experiment. Dening random variables serves the dual purposes of: 1. Describing many dierent sample spaces in the same terms: e.g. = {0, 1} with P(1) = p and P(0) = 1 p describes EVERY possible
experiment with two outcomes.
2. Giving a name to a large class of random experiments that genuinely produce random numbers, and for which we want to develop general rules for nding averages, variances, relationships, and so on. Example: Toss a coin 3 times. The sample space is = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} One example of a random variable is X : R such that, for sample point si , we have X(si ) = # heads in outcome si . So X(HHH) = 3, X(T HT ) = 1, etc. Another example is Y : R such that Y (si) = 1 if 2nd toss is a head, 0 otherwise.
Then Y (HT H) = 0, Y (T HH) = 1, Y (HHH) = 1, etc.

Probabilities for random variables By convention, we use CAPITAL LETTERS for random variables (e.g. X), and lower case letters to represent the values that the random variable takes (e.g. x). For a sample space and random variable X : R, and for a real number x, P(X = x) = P(outcome s is such that X(s) = x) = P({s : X(s) = x}).
40
Example: toss a fair coin 3 times. All outcomes are equally likely: P(HHH) = P(HHT) = . . . = P(TTT) = 1/8. Let X : R, such that X(s) = # heads in s. Then P(X = 0) = P({T T T }) = 1/8. P(X = 1) = P({HT T, T HT, T T H}) = 3/8. P(X = 2) = P({HHT, HT H, T HH}) = 3/8. P(X = 3) = P({HHH}) = 1/8.
Note that P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) = 1.

Independent random variables Random variables X and Y are independent if each does not aect the other. Recall that two events A and B are independent if P(A B) = P(A)P(B). Similarly, random variables X and Y are dened to be independent if P({X = x} {Y = y}) = P(X = x)P(Y = y)
for all possible values x and y .

We usually replace the cumbersome notation P({X = x} {Y = y}) by the simpler notation P(X = x, Y = y). From now on, we will use the following notations interchangeably: P({X = x} {Y = y}) = P(X = x AND Y = y) = P(X = x, Y = y). Thus X and Y are independent if and only if P(X = x, Y = y) = P(X = x)P(Y = y)
for ALL possible values x, y .
41
1.15 Key Probability Results for Chapter 1

1. If A and B are mutually exclusive (i.e. A B = ), then P(A B) = P(A) + P(B).
2. Conditional probability:
P(A | B) =
P(A B) P(B)
for any A, B.
Or:
P(A B) = P(A | B)P(B).
3. For any A, B, we can write P(A | B) = P(B | A)P(A) . P(B)
This is a simplied version of Bayes Theorem. It shows how to invert the conditioning, i.e. how to nd P(A | B) when you know P(B | A).
4. Bayes Theorem slightly more generalized: for any A, B, P(A | B) = P(B | A)P(A) . P(B | A)P(A) + P(B | A)P(A)
This works because A and A form a partition of the sample space.
5. Complete version of Bayes Theorem: If sets A1 , . . . , Am form a partition of the sample space, i.e. they do not overlap (mutually exclusive) and collectively cover all possible outcomes (their union is the sample space), then P(B | Aj )P(Aj ) P(B | A1 )P(A1 ) + . . . + P(B | Am )P(Am )
m i=1
P(Aj | B) =
P(B | Aj )P(Aj ) . P(B | Ai )P(Ai )
42 6. Partition Theorem: if A1 , . . . , Am form a partition of the sample space, then P(B) = P(B A1 ) + P(B A2 ) + . . . + P(B Am ) . This can also be written as: P(B) = P(B | A1 )P(A1 ) + P(B | A2 )P(A2 ) + . . . + P(B | Am )P(Am ) . These are both very useful formulations.
7. Chains of events: P(A1 A2 A3 ) = P(A1 ) P(A2 | A1 ) P(A3 | A2 A1 ) .
8. Statistical independence: if A and B are independent, then P(A B) = P(A) P(B) and and P(A | B) = P(A) P(B | A) = P(B) . 9. Conditional probability: If P(B) > 0, then we can treat P( | B) just like P: e.g. if A1 and A2 are mutually exclusive, then P(A1 A2 | B) = P(A1 | B) + P(A2 | B) (compare with P(A1 A2 ) = P(A1 ) + P(A2 )); and P(A | B) = 1 P(A | B) for any A.
if A1 ,. . . ,Am partition the sample space, then P(A1 | B) + P(A2 | B) +. . .+ P(Am | B) = 1;
(Note: it is not generally true that P(A | B) = 1 P(A | B).)
The fact that P( | B) is a valid probability measure is easily veried by checking that it satises Axioms 1, 2, and 3.
10. Unions: For any A, B, C, P(A B) = P(A) + P(B) P(A B) ; P(A B C) = P(A) + P(B) + P(C) P(A B) P(A C) P(B C) + P(A B C) . The second expression is obtained by writing P(AB C) = P A(B C) and applying the rst expression to A and (B C), then applying it again to expand P(B C).
43
1.16 Chains of events and probability trees: non-examinable The multiplication rule is very helpful for calculating probabilities when events
happen in sequence.
Example: Two balls are drawn at random without replacement from a box containing 4 white and 2 red balls. Find the probability that: (a) they are both white, (b) the second ball is red.
Solution
Let event Wi = ith ball is white and Ri = ith ball is red. a) P(W1 W2 ) = P(W2 W1) = P(W2 | W1)P(W1) Now P(W1) =
4 3 and P(W2 | W1) = . 6 5
W1
So P(both white) = P(W1 W2) =
3 4 2 = . 5 6 5
b) Looking for P(2nd ball is red). To nd this, we have to condition on what happened in the rst draw. Event 2nd ball is red is actually event {W1R2 , R1R2 } = (W1 R2 ) (R1 R2 ). So P(2nd ball is red) = P(W1 R2 ) + P(R1 R2 ) (mutually exclusive)
= P(R2 | W1)P(W1) + P(R2 | R1)P(R1) = 2 4 5 6 + 1 2 5 6 = 1 . 3
W1
R1
44
Probability trees Probability trees are a graphical way of representing the multiplication rule.
W2 P(W2 | W1 ) = P(R2 | W1 ) = P(W1 ) =
4 6 2 5 3 5
W1
R2
P(R1 ) =
2 6
W2 P(W2 | R1 ) = P(R2 | R1 ) = R2
1 5 4 5
R1
First Draw
Second Draw
Write conditional probabilities on the branches, and multiply to get probability of an intersection: eg. P(W1 W2) =
4 3 2 4 , or P(R1 W2) = . 6 5 6 5
More than two events To nd P(A1 A2 A3) we can apply the multiplication rule successively: P(A1 A2 A3) = P(A3 (A1 A2 )) = P(A3 | A1 A2 )P(A1 A2)
(multiplication rule) (multiplication rule)
= P(A3 | A1 A2 )P(A2 | A1 )P(A1)
Remember as:
P(A1 A2 A3 ) = P(A1)P(A2 | A1 )P(A3 | A2 A1 ).
45
On the probability tree:

P(A3 | A2 A1 ) P(A2 | A1 ) P(A1 ) P(A1 A2 A3 )
P(A1 )
In general, for n events A1, A2, . . . , An, we have P(A1 A2 . . .An ) = P(A1)P(A2 | A1)P(A3 | A2 A1) . . . P(An | An1 . . .A1).
Example: A box contains w white balls and r red balls. Draw 3 balls without replacement. What is the probability of getting the sequence white, red, white?
Answer:
P(W1 R2 W3) = P(W1)P(R2 | W1)P(W3 | R2 W1) = w w+r r w+r1 w1 . w+r2
46
1.17 Equally likely outcomes and combinatorics: non-examinable Sometimes, all the outcomes in a discrete nite sample space are equally likely. This makes it easy to calculate probabilities. If: i) = {s1 , . . . , sk };
1 ii) each outcome si is equally likely, so p1 = p2 = . . . = pk = k ;
iii) event A = {s1 , s2, . . . , sr } contains r possible outcomes, then P(A) =
# outcomes in A r = . k # outcomes in
Example: For a 3-child family, possible outcomes from oldest to youngest are: = {GGG, GGB, GBG, GBB, BGG, BGB, BBG, BBB} = {s1, s2, s3 , s4, s5, s6, s7 , s8} Let {p1, p2, . . . , p8} be a probability distribution on . If every baby is equally likely to be a boy or a girl, then all of the 8 outcomes in are equally likely, so 1 p1 = p2 = . . . = p8 = 8 . Let event A be A = oldest child is a girl. Then A ={GGG, GGB, GBG, GBB}. Event A contains 4 of the 8 equally likely outcomes, so event A occurs with probability P(A) = 4 = 1 . 8 2 Counting equally likely outcomes To count the number of equally likely outcomes in an event, we often need to use permutations or combinations. These give the number of ways of choosing r objects from n distinct objects. For example, if we wish to select 3 objects from n = 5 objects (a, b, c, d, e), we have choices abc, abd, abe, acd, ace, . . . .
47
1. Number of Permutations, Pr The number of permutations, n Pr , is the number of ways of selecting r objects from n distinct objects when dierent orderings constitute dierent choices. That is, choice (a, b, c) counts separately from choice (b, a, c). Then
#permutations = n Pr = n(n 1)(n 2) . . . (n r + 1) =
n! . (n r)!
(n choices for rst object, (n 1) choices for second, etc.)
2. Number of Combinations, n Cr =
n r
The number of combinations, n Cr , is the number of ways of selecting r objects from n distinct objects when dierent orderings constitute the same choice. That is, choice (a, b, c) and choice (b, a, c) are the same. Then
#combinations = Cr =
n r
n! Pr = . r! (n r)!r!
(because n Pr counts each permutation r! times, and we only want to count it once: so divide n Pr by r!)
Use the same rule on the numerator and the denominator # outcomes in A When P(A) = , we can often think about the problem # outcomes in either with dierent orderings constituting dierent choices, or with dierent orderings constituting the same choice. The critical thing is to use the same
rule for both numerator and denominator.
48
Example: (a) Tom has ve elderly great-aunts who live together in a tiny bungalow. They insist on each receiving separate Christmas cards, and threaten to disinherit Tom if he sends two of them the same picture. Tom has Christmas cards with 12 dierent designs. In how many dierent ways can he select 5 dierent designs from the 12 designs available?
Order of cards is not important, so use combinations. Number of ways of selecting 5 distinct designs from 12 is
12
C5 =
12 5
12 ! = 792. (12 5)! 5!
b) The next year, Tom buys a pack of 40 Christmas cards, featuring 10 dierent pictures with 4 cards of each picture. He selects 5 cards at random to send to his great-aunts. What is the probability that at least two of the great-aunts receive the same picture?
Looking for P(at least 2 cards the same) = P(A) (say). Easiest to nd P(all 5 cards are dierent) = P(A). Number of outcomes in A is (# ways of selecting 5 dierent designs) = 40 36 32 28 24 . (40 choices for rst card; 36 for second, because the 4 cards with the rst design are excluded; etc. Note that order matters: e.g. we are counting choice 12345 separately from 23154.) Total number of outcomes is (total # ways of selecting 5 cards from 40) = 40 39 38 37 36 . (Note: order mattered above, so we need order to matter here too.) So
P(A) = 40 36 32 28 24 = 0.392. 40 39 38 37 36
Thus
P(A) = P(at least 2 cards are the same design) = 1 P(A) = 1 0.392 = 0.608.
49
Alternative solution if order does not matter on numerator and denominator: (much harder method) 10 5 4 5 P(A) = . 40 5 This works because there are 10 ways of choosing 5 dierent designs from 10, 5 and there are 4 choices of card within each of the 5 chosen groups. So the total number of ways of choosing 5 cards of dierent designs is 10 45. The total 5 40 number of ways of choosing 5 cards from 40 is 5 . Exercise: Check that this gives the same answer for P(A) as before. Note: Problems like these belong to the branch of mathematics called Combinatorics: the science of counting.
Chapter 2: Discrete Probability Distributions

2.1 Introduction In the next two chapters we meet several important concepts: 1. Probability distributions, and the probability function fX (x):
50
the probability function of a random variable lists the values the random variable can take, and their probabilities. 2. Hypothesis testing: I toss a coin ten times and get nine heads. How unlikely is that? Can we continue to believe that the coin is fair when it produces nine heads out of ten tosses? 3. Likelihood and estimation: what if we know that our random variable is (say) Binomial(5, p), for some p, but we dont know the value of p? We will see how to estimate the value of p using maximum likelihood estimation. 4. Expectation and variance of a random variable: the expectation of a random variable is the value it takes on average. the variance of a random variable measures how much the random variable
varies about its average.
5. Change of variable procedures: calculating probabilities and expectations of g(X), where X is a random variable and g(X) is a function, e.g. g(X) = X or g(X) = X 2 . 6. Modelling: we have a situation in real life that we know is random. But what does the randomness look like? Is it highly variable, or little variability? Does it sometimes give results much higher than average, but never give results much lower (long-tailed distribution)? We will see how dierent probability distributions are suitable for dierent circumstances. Choosing a probability distribution to t a situation is called modelling.
51
2.2 The probability function, fX (x)
The probability function fX (x) lists all possible values of X, and gives a probability to each value.
Recall that a random variable, X, assigns a real number to every possible outcome of a random experiment. The random variable is discrete if the set of real values it can take is nite or countable, eg. {0,1,2,. . . }.
Porsche Ferrari
Random experiment: which car?

MG...
Random variable: X . X gives numbers to the possible outcomes. X=1 Ferrari Porsche X=2 If he chooses. . . MG X=3 Denition: The probability function, fX (x), for a discrete random variable X , is
given by,
fX (x) = P(X = x), Example: Which car? Outcome: Ferrari Porsche x 1 2 1 1 Probability function, fX (x) = P(X = x) 6 6 MG 3
4 6
for all possible outcomes x of X .
is 1 . 6
1 We write: P(X = 1) = fX (1) = 6 : the probability he makes choice 1 (a Ferrari)
Example: Toss a fair coin once, and let X=number of heads. Then X= 0 with probability 0.5, 1 with probability 0.5.
52 1/6 if x = 1, 1/6 if x = 2, We can also write the probability function as: fX (x) = 4/6 if x = 3, 0 otherwise.
The probability function of X is given by: x fX (x) = P(X = x) 0 1 or 0.5 if x=0 fX (x) = 0.5 if x=1 0 otherwise
0.5
0.5
We write (eg.) fX (0) = 0.5, fX (1) = 0.5, fX (7.5) = 0, etc.
fX (x) is just a list of probabilities.
Properties of the probability function i) 0 fX (x) 1 for all x; ii)

x
probabilities are always between 0 and 1.
fX (x) = 1;
probabilities add to 1 overall.

fX (x);
xA
iii) P (X A) =
e.g. in the car example, P(X {1, 2}) = P(X = 1 or 2) = P(X = 1) + P(X = 2) =
1 6
1 6
2 = 6.
This is the probability of choosing either a Ferrari or a Porsche.
53
2.3 Bernoulli trials Many of the discrete random variables that we meet are based on counting the outcomes of a series of trials called Bernoulli trials. Jacques Bernoulli was a Swiss mathematician in the late 1600s. He and his brother Jean, who were bitter rivals, both studied mathematics secretly against their fathers will. Their father wanted Jacques to be a theologist and Jean to be a merchant. Denition: A random experiment is called a set of Bernoulli trials if it consists
of several trials such that:

i) Each trial has only 2 possible outcomes, usually called Success and Fail-
ure;
ii) The probability of success, p, remains constant for all trials; iii) The trials are independent, ie. the event success in trial i does not depend
on the outcome of any other trials.

Examples: 1) Repeated tossing of a fair coin: Each toss is a Bernoulli trial with P(success) = P(head) = 1 . 2 2) Repeated tossing of a fair die: success = 6, failure= not 6. 1 Each toss is a Bernoulli trial with P(success) = 6 . Denition: The random variable Y is called a Bernoulli random variable if it
takes only 2 values, 0 and 1. The probability function is,

fY (y) = p if y = 1 1 p if y = 0
That is,
P(Y = 1) = P(success) = p, P(Y = 0) = P(failure) = 1 p.
54
2.4 Example of the probability function: the Binomial Distribution
The Binomial distribution counts the number of successes in a xed number of Bernoulli trials. Denition: Let X be the number of successes in n independent Bernoulli trials each with probability of success = p. Then X has the Binomial distribution with parameters n and p. We write X Bin(n, p), or X Binomial(n, p).
Thus X Bin(n, p) if X is the number of successes out of n independent trials, each of which has probability p of success.
Probability function If X Binomial(n, p), then the probability function for X is n x p (1 p)nx x
fX (x) = P(X = x) =
for x = 0, 1, . . . , n
Explanation:
For X = x, we need an outcome with x successes and (n x) failures. A single outcome with x successes and n x failures has probability:
px (1 p)nx
(2)
(1)
where: (1) succeeds x times, each with probability p (2) fails (n x) times, each with probability (1 p).
55
There are
n x
possible outcomes with x successes and (n x) failures because
we must select x trials to be our successes, out of n trials in total. Thus,
P(#successes= x) = (#outcomes with x successes) (prob. of each such outcome) n x = p (1 p)nx x
Notes: 1. fX (x) = 0 if
n
x {0, 1, 2, . . . , n}. / fX (x) = 1:
2. Check that
x=0 n
fX (x) =
x=0 x=0
n x p (1 p)nx x
= 1n
= [p + (1 p)]n (Binomial Theorem)
= 1.
It is this connection with the Binomial Theorem that gives the Binomial Distribution its name.
56
Example 1: Let X Binomial(n = 4, p = 0.2). Write down the probability function of X. x 0 1 2 3 4 fX (x) = P(X = x) 0.4096 0.4096 0.1536 0.0256 0.0016
Example 2: Let X be the number of times I get a 6 out of 10 rolls of a fair die. 1. What is the distribution of X? 2. What is the probability that X 2? 1. X Binomial(n = 10, p = 1/6). 2. P(X 2) = 1 P(X < 2) = 1 P(X = 0) P(X = 1) 0 100 10 10 1 1 = 1 1 6 6 0 1 = 0.515.
1 6
1 1 6
101
Example 3: Let X be the number of girls in a three-child family. What is the distribution of X?
Assume: (i) each child is equally likely to be a boy or a girl; (ii) all children are independent of each other. Then X Binomial(n = 3, p = 0.5).
57
Shape of the Binomial distribution The shape of the Binomial distribution depends upon the values of n and p. For small n, the distribution is almost symmetrical for values of p close to 0.5, but highly skewed for values of p close to 0 or 1. As n increases, the distribution becomes more and more symmetrical, and there is noticeable skew only if p is very close to 0 or 1. The probability functions for various values of n and p are shown below.
n = 10, p = 0.5
0.25 0.4
n = 10, p = 0.9
0.12
n = 100, p = 0.9
0.20
0.3
0.15
0.2
0.10
0.05
0.1
0.0
0.0
9 10
9 10
0.0 80
0.02
0.04
0.06
0.08
0.10
90
100
Sum of independent Binomial random variables: If X and Y are independent, and X Binomial(n, p), Y Binomial(m, p), then X + Y Bin(n + m, p). This is because X counts the number of successes out of n trials, and Y counts the number of successes out of m trials: so overall, X + Y counts the total number of successes out of n + m trials. Note: X and Y must both share the same value of p.
58
2.5 The cumulative distribution function, FX (x) We have dened the probability function, fX (x), as fX (x) = P(X = x). The probability function tells us everything there is to know about X. The cumulative distribution function, or just distribution function, written as FX (x), is an alternative function that also tells us everything there is to know about X. Denition: The (cumulative) distribution function (c.d.f.) is
FX (x) = P(X x) for < x < If you are asked to give the distribution of X, you could answer by giving either the distribution function, FX (x), or the probability function, fX (x). Each of these functions encapsulate all possible information about X. The distribution function FX (x) as a probability sweeper The cumulative distribution function, FX (x),
sweeps up all the probability up to and including the point x.
0.20
0.25
0.15
0.10
0.00
0.05
10
0.0
0.1
0.2
0.3
0.4
10
X ~ Bin(10, 0.5)
X ~ Bin(10, 0.9)
59
Example: Let X Binomial(2, 1 ). 2
x 0 1 2 fX (x) = P(X = x) 1 1 1 4 2 4
0 0.25 Then FX (x) = P(X x) = 0.25 + 0.5 = 0.75 0.25 + 0.5 + 0.25 = 1
f (x)
1 2
if if if if
x<0 0x<1 1x<2 x 2.
1 4
x 0 F (x) 1
3 4 1 2 1 4
x 0 1 2
FX (x) gives the cumulative probability up to and including point x. So FX (x) =

yx
fX (y)
Note that FX (x) is a step function: it jumps by amount fX (y) at every point y with positive probability.
60
Reading o probabilities from the distribution function As well as using the probability function to nd the distribution function, we can also use the distribution function to nd probabilities. fX (x) = P(X = x) = P(X x) P(X x 1) = FX (x) FX (x 1).
(if X takes integer values)
This is why the distribution function FX (x) contains as much information as the probability function, fX (x), because we can use either one to nd the other. In general:
P(a < X b) = FX (b) FX (a) Proof: P(X b) = P(X a) + P(a < X b)

a Xb X a a<Xb b
if b > a.
So
FX (b) = FX (a) + P(a < X b) FX (b) FX (a) = P(a < X b).
Warning: endpoints Be careful of endpoints and the dierence between and <. For example, P(X < 10) = P(X 9) = FX (9). Examples: Let X Binomial(100, 0.4). In terms of FX (x), what is: 1. P(X 30)? 2. P(X < 30)? 3. P(X 56)? 1 P(X < 56) = 1 P(X 55) = 1 FX (55). 4. P(X > 42)? 1 P(X 42) = 1 FX (42). 5. P(50 X 60)? P(X 60) P(X 49) = FX (60) FX (49). FX (30). FX (29).
Properties of the distribution function 1) F () =P(X ) = 0. F (+) =P(X +) = 1. (These are true because values are strictly between and ). 2) FX (x) is a non-decreasing function of x: that is, if x1 < x2, then FX (x1) FX (x2). 3) P(a < X b) = FX (b) FX (a) if b > a. 4) F is right-continuous: that is, limh0 F (x + h) = F (x).
62
2.6 Hypothesis testing You have probably come across the idea of hypothesis tests, p-values, and signicance in other courses. Common hypothesis tests include t-tests and chisquared tests. However, hypothesis tests can be conducted in much simpler circumstances than these. The concept of the hypothesis test is at its easiest to understand with the Binomial distribution in the following example. All other hypothesis tests throughout statistics are based on the same idea. Example: Weird Coin? I toss a coin 10 times and get 9 heads. How weird is that? What is weird ? Getting 9 heads out of 10 tosses: well call this weird. Getting 10 heads out of 10 tosses: even more weird! Getting 8 heads out of 10 tosses: less weird. Getting 1 head out of 10 tosses: same as getting 9 tails out of 10 tosses:
H H
just as weird as 9 heads if the coin is fair. 9 heads if the coin is fair.
Getting 0 heads out of 10 tosses: same as getting 10 tails: more weird than Set of weird outcomes If our coin is fair, the outcomes that are as weird or weirder than 9 heads are:
9 heads, 10 heads, 1 head, 0 heads.

So how weird is 9 heads or worse, if the coin is fair? Dene X =#heads out of 10 tosses. Distribution of X, if the coin is fair: X Binomial(n = 10, p = 0.5).
63
Probability of observing something at least as weird as 9 heads, if the coin is fair: We can add the probabilities of all the outcomes that are at least as weird as 9 heads out of 10 tosses, assuming that the coin is fair. P(X = 9)+P(X = 10)+P(X = 1)+P(X = 0)
where X Binomial(10, 0.5).
Probabilities for Binomial(n = 10, p = 0.5)

0.25 P(X=x) 0.0 0.05 0.15
5 x
10
For X Binomial(10, 0.5), we have: P(X = 9) + P(X = 10) + P(X = 1) + P(X = 0) = 10 10 (0.5)9(0.5)1 + (0.5)10(0.5)0 + 9 10 10 10 (0.5)1(0.5)9 + (0.5)0(0.5)10 1 0 = 0.00977 + 0.00098 + 0.00977 + 0.00098 = 0.021.
Is this weird?
Yes, it is quite weird. If we had a fair coin and tossed it 10 times, we would only expect to see something as extreme as 9 heads on about 2.1% of occasions.
64
Is the coin fair? Obviously, we cant say. It might be: after all, on 2.1% of occasions that you toss a fair coin 10 times, you do get something as weird as 9 heads or more. However, 2.1% is a small probability, so it is still very unusual for a fair coin to produce something as weird as what weve seen. If the coin really was fair, it would be very unusual to get 9 heads or more. We can deduce that, EITHER we have observed a very unusual event with a fair
coin, OR the coin is not fair.

In fact, this gives us some evidence that the coin is not fair. The value 2.1% measures the strength of our evidence. The smaller this proba-
bility, the more evidence we have.

Formal hypothesis test We now formalize the procedure above. Think of the steps: We have a question that we want to answer: Is the coin fair? There are two alternatives:
1. The coin is fair. 2. The coin is not fair.
Our observed information is X, the number of heads out of 10 tosses. We write down the distribution of X if the coin is fair: X Binomial(10, 0.5). We calculate the probability of observing something AT LEAST AS EXTREME as our observation, X = 9, if the coin is fair: prob=0.021.
The probability is small (2.1%). We conclude that this is unlikely with a fair coin, so we have observed some evidence that the coin is NOT fair.
65
Null hypothesis and alternative hypothesis We express the steps above as two competing hypotheses. Null hypothesis: the rst alternative, that the coin IS fair.
We expect to believe the null hypothesis unless we see convincing evidence that it is wrong.
Alternative hypothesis: the second alternative, that the coin is NOT fair.
In hypothesis testing, we often use this same formulation. The null hypothesis is specic.
It species an exact distribution for our observation: X Binomial(10, 0.5). It simply states that the null hypothesis is wrong. It does not say what the right answer is.
The alternative hypothesis is general.
We use H0 and H1 to denote the null and alternative hypotheses respectively. The null hypothesis is H0 : the coin is fair. The alternative hypothesis is H1 : the coin is NOT fair. More precisely, we write:
Number of heads, X Binomial(10, p), and

H0 : p = 0.5 H1 : p = 0.5.
Think of null hypothesis as meaning the default: the hypothesis we will accept unless we have a good reason not to.
66
p-values In the hypothesis-testing framework above, we always measure evidence AGAINST
the null hypothesis.

That is, we believe that our coin is fair unless we see convincing evidence otherwise. We measure the strength of evidence against H0 using the p-value. In the example above, the p-value was p = 0.021. A p-value of 0.021 represents quite strong evidence against the null hypothesis. It states that, if the null hypothesis is TRUE, we would only have a 2.1% chance
of observing something as extreme as 9 heads or tails.

Many of us would see this as strong enough evidence to decide that the null hypothesis is not true. In general, the p-value is the probability of observing something AT LEAST AS EXTREME AS OUR OBSERVATION, if H0 is TRUE. This means that SMALL p-values represent STRONG evidence against H0.
Small p-values mean Strong evidence.

Large
p-values mean
Little
evidence.
Note: Be careful not to confuse the term p-value, which is 0.021 in our example, with the Binomial probability p. Our hypothesis test is designed to test whether the Binomial probability is p = 0.5. To test this, we calculate the p-value of 0.021 as a measure of the strength of evidence against the hypothesis that p = 0.5.
67
Interpreting the hypothesis test There are dierent schools of thought about how a p-value should be interpreted. Most people agree that the p-value is a useful measure of the strength of evidence against the null hypothesis. The smaller the p-value, the stronger the evidence against H0. Some people go further and use an accept/reject framework. Under this framework, the null hypothesis H0 should be rejected if the p-value is less than 0.05 (say), and accepted if the p-value is greater than 0.05. In this course we use the strength of evidence interpretation. The p-value measures how far out our observation lies in the tails of the distribution specied by H0. We do not talk about accepting or rejecting H0 . This decision should usually be taken in the context of other scientic information. However, as a rule of thumb, consider that p-values of 0.05 and less start to suggest that the null hypothesis is doubtful. Statistical signicance You have probably encountered the idea of statistical signicance in other courses.
Statistical signicance refers to the p-value.

The result of a hypothesis test is signicant at the 5% level if the p-value is less than 0.05. This means that the chance of seeing what we did see (9 heads), or more, is less
than 5% if the null hypothesis is true.

Saying the test is signicant is a quick way of saying that there is evidence against the null hypothesis, usually at the 5% level.
68
In the coin example, we can say that our test of H0 : p = 0.5 against H1 : p = 0.5 is signicant at the 5% level, because the p-value is 0.021 which is < 0.05. This means: we have some evidence that p = 0.5. It does not mean: the dierence between p and 0.5 is large, or the dierence between p and 0.5 is important in practical terms. Statistically signicant means that we have evidence that
there IS a dierence. It says NOTHING about the SIZE, or the IMPORTANCE, of the dierence. Substantial evidence of a dierence, not Evidence of a substantial dierence.
Beware! The p-value gives the probability of seeing something as weird as what we did see, if H0 is true. This means that 5% of the time, we will get a p-value < 0.05 WHEN H0 IS
TRUE!!
Similarly, about once in every thousand tests, we will get a p-value < 0.001, when H0 is true!
A small p-value does NOT mean that H0 is denitely wrong.

One-sided and two-sided tests The test above is a two-sided test. This means that we considered it just as
weird to get 9 tails as 9 heads.

If we had a good reason, before tossing the coin, to believe that the binomial probability could only be = 0.5 or > 0.5, i.e. that it would be impossible to have p < 0.5, then we could conduct a one-sided test: H0 : p = 0.5 versus H1 : p > 0.5. This would have the eect of halving the resultant p-value.
69
2.7 Example: Presidents and deep-sea divers Men in the class: would you like to have daughters? Then become a deep-sea diver, a ghter pilot, or a heavy smoker. Would you prefer sons? Easy! Just become a US president. Numbers suggest that men in dierent professions tend to have more sons than daughters, or the reverse. Presidents have sons, ghter pilots have daughters. But is it real, or just chance? We can use hypothesis tests to decide. The facts The 44 US presidents from George Washington to Barack Obama have had a total of 153 children, comprising 88 sons and only 65 daughters: a sex ratio of 1.4 sons for every daughter. Two studies of deep-sea divers revealed that the men had a total of 190 children, comprising 65 sons and 125 daughters: a sex ratio of 1.9 daughters for every son. Could this happen by chance? Is it possible that the men in each group really had a 50-50 chance of producing
sons and daughters?

This is the same as the question in Section 2.6. For the presidents: If I tossed a coin 153 times and got only 65 heads, could
I continue to believe that the coin was fair?
For the divers: If I tossed a coin 190 times and got only 65 heads, could I continue to believe that the coin was fair?
70
Hypothesis test for the presidents We set up the competing hypotheses as follows.
Let X be the number of daughters out of 153 presidential children. Then X Binomial(153, p), where p is the probability that each child is a daughter.
Null hypothesis: Alternative hypothesis: p-value:
H0 : p = 0.5. H1 : p = 0.5.
We need the probability of getting a result AT LEAST AS EXTREME as X = 65 daughters, if H0 is true and p really is 0.5.
Which results are at least as extreme as X = 65? X = 0, 1, 2, . . . , 65, for even fewer daughters. X = (153 65), . . . , 153, for too many daughters, because we would be just as surprised if we saw 65 sons, i.e. (153 65) = 88 daughters. Probabilities for X Binomial(n = 153, p = 0.5)
0.06 0.00 0.02 0.04
20
40
60
80
100
120
140
71
Calculating the p-value The p-value for the president problem is given by P(X 65) + P(X 88) where X Binomial(153, 0.5). In principle, we could calculate this as P(X = 0) + P(X = 1) + . . . + P(X = 65) + P(X = 88) + . . . + P(X = 153) = 153 153 (0.5)0(0.5)153 + (0.5)1(0.5)152 + . . . 0 1
This would take a lot of calculator time! Instead, we use a computer with a package such as R. R command for the p-value The R command for calculating the lower-tail p-value for the Binomial(n = 153, p = 0.5) distribution is pbinom(65, 153, 0.5). Typing this in R gives: > pbinom(65, 153, 0.5) [1] 0.03748079
0.02 0.00 0.04 0.06
20
40
60
80
100
120
140
This gives us the lower-tail p-value only: P(X 65) = 0.0375. To get the overall p-value, we have two choices: 1. Multiply the lower-tail p-value by 2: 2 0.0375 = 0.0750. In R: > 2 * pbinom(65, 153, 0.5) [1] 0.07496158
72
This works because the upper-tail p-value, by denition, is always going to be the same as the lower-tail p-value. The upper tail gives us the probability of nding something equally surprising at the opposite end of the distribution. 2. Calculate the upper-tail p-value explicitly (only works for H0 : p = 0.5):
The upper-tail p-value is

P(X 88) = 1 P(X < 88) = 1 P(X 87) = 1 pbinom(87, 153, 0.5). In R: > 1-pbinom(87, 153, 0.5) [1] 0.03748079 The overall p-value is the sum of the lower-tail and the upper-tail p-values: pbinom(65, 153, 0.5) + 1 - pbinom(87, 153, 0.5) = 0.0375 + 0.0375 = 0.0750.
(Same as before.)
Note: The R command pbinom is equivalent to the cumulative distribution function for the Binomial distribution:
pbinom(65, 153, 0.5) = P(X 65) = FX (65)
where X Binomial(153, 0.5) for X Binomial(153, 0.5).
The overall p-value in this example is 2 FX (65). Note: In the R command pbinom(65, 153, 0.5), the order that you enter the numbers 65, 153, and 0.5 is important. If you enter them in a dierent order, you will get an error. An alternative is to use the longhand command pbinom(q=65, size=153, prob=0.5), in which case you can enter the terms in any order.
73
Summary: are presidents more likely to have sons? Back to our hypothesis test. Recall that X was the number of daughters out of 153 presidential children, and X Binomial(153, p), where p is the probability that each child is a daughter. Null hypothesis: Alternative hypothesis: p-value: H0 : p = 0.5. H1 : p = 0.5.
2 FX (65) = 0.075.
What does this mean? The p-value of 0.075 means that, if the presidents really were as likely to have
daughters as sons, there would only be 7.5% chance of observing something as unusual as only 65 daughters out of the total 153 children.
This is slightly unusual, but not very unusual. We conclude that there is no real evidence that presidents are more likely to have
sons than daughters. The observations are compatible with the possibility that there is no dierence.
Does this mean presidents are equally likely to have sons and daughters? No:
the observations are also compatible with the possibility that there is a dierence. We just dont have enough evidence either way.
Hypothesis test for the deep-sea divers For the deep-sea divers, there were 190 children: 65 sons, and 125 daughters. Let X be the number of sons out of 190 diver children. Then X Binomial(190, p), where p is the probability that each child is a son. Note: We could just as easily formulate our hypotheses in terms of daughters instead of sons. Because pbinom is dened as a lower-tail probability, however, it is usually easiest to formulate them in terms of the low result (sons).
74
Null hypothesis: Alternative hypothesis: p-value:
H0 : p = 0.5. H1 : p = 0.5.
Probability of getting a result AT LEAST AS EXTREME as X = 65 sons, if H0 is true and p really is 0.5.
Results at least as extreme as X = 65 are: X = 0, 1, 2, . . . , 65, for even fewer sons. X = (19065), . . . , 190, for the equally surprising result in the opposite direction
(too many sons).

Probabilities for X Binomial(n = 190, p = 0.5)
0.05 0.00 0.01 0.02 0.03 0.04 0.06
20
40
60
80
100
120
140
160
180
R command for the p-value p-value = 2pbinom(65, 190, 0.5). Typing this in R gives: > 2*pbinom(65, 190, 0.5) [1] 1.603136e-05 This is 0.000016, or a little more than one chance in 100 thousand.
75
We conclude that it is extremely unlikely that this observation could have oc-
curred by chance, if the deep-sea divers had equal probabilities of having sons and daughters.
We have very strong evidence that deep-sea divers are more likely to have daughters than sons. The data are not really compatible with H0 . What next? p-values are often badly used in science and business. They are regularly treated as the end point of an analysis, after which no more work is needed. Many scientic journals insist that scientists quote a p-value with every set of results, and often only p-values less than 0.05 are regarded as interesting. The outcome is that some scientists do every analysis they can think of until they nally come up with a p-value of 0.05 or less. A good statistician will recommend a dierent attitude. It is very rare in science
for numbers and statistics to tell us the full story.

Results like the p-value should be regarded as an investigative starting point, rather than the nal conclusion. Why is the p-value small? What possible mechanism could there be for producing this result? If you were a medical statistician and you gave me a p-value, I would ask you for a mechanism. Dont accept that Drug A is better than Drug B only because the p-value says so: nd a biochemist who can explain what Drug A does that Drug B doesnt. Dont accept that sun exposure is a cause of skin cancer on the basis of a p-value alone: nd a mechanism by which skin is damaged by the sun. Why might divers have daughters and presidents have sons? Deep-sea divers are thought to have more daughters than sons because the underwater work at high atmospheric pressure lowers the level of the hormone testosterone in the mens blood, which is thought to make them more likely to conceive daughters. For the presidents, your guess is as good as mine . . .
2.8 Example: Birthdays and sports professionals Have you ever wondered what makes a professional sports player? Talent? Dedication? Good coaching? Or is it just that they happen to have the right birthday. . . ? The following text is taken from Malcolm Gladwells book Outliers. It describes the play-by-play for the rst goal scored in the 2007 nals of the Canadian ice hockey junior league for star players aged 17 to 19. The two teams are the Tigers and Giants. Theres one slight dierence . . . instead of the players names, were given their birthdays. March 11 starts around one side of the Tigers net, leaving the puck for his teammate January 4, who passes it to January 22, who ips it back to March 12, who shoots point-blank at the Tigers goalie, April 27. April 27 blocks the shot, but its rebounded by Giants March 6. He shoots! Tigers defensemen February 9 and February 14 dive to block the puck while January 10 looks on helplessly. March 6 scores! Notice anything funny? Here are some gures. There were 25 players in the Tigers squad, born between 1986 and 1990. Out of these 25 players, 14 of them were born in January, February, or March. Is it believable that this should happen by chance, or do we have evidence that there is a birthday-eect in becoming a star ice hockey player? Hypothesis test
Let X be the number of the 25 players who are born from January to March.
We need to set up hypotheses of the following form: Null hypothesis: H0 : there is no birthday eect. Alternative hypothesis: H1 : there is a birthday eect.
77
What is the distribution of X under H0 and under H1 ?
Under H0 , there is no birthday eect. So the probability that each player has a birthday in Jan to March is about 1/4. (3 months out of a possible 12 months).
Thus the distribution of X under H0 is X Binomial(25, 1/4). Under H1 , there is a birthday eect, so p = 1/4. Our formulation for the hypothesis test is therefore as follows.
Number of Jan to March players, X Binomial(25, p).

Null hypothesis: Alternative hypothesis: Our observation: The observed proportion of players born from Jan to March is 14/25 = 0.56. This is more than the 0.25 predicted by H0 . Is it suciently greater than 0.25 to provide evidence against H0 ? Just using our intuition, we can make a guess, but we might be wrong. The answer also depends on the sample size (25 in this case). We need the p-value to measure the evidence properly. p-value: H0 : p = 0.25. H1 : p = 0.25.
Probability of getting a result AT LEAST AS EXTREME as X = 14 Jan to March players, if H0 is true and p really is 0.25.
Results at least as extreme as X = 14 are:
Upper tail: X = 14, 15, . . . , 25, for even more Jan to March players. Lower tail: an equal probability in the opposite direction, for too few Jan to March players.
78
Note: We do not need to calculate the values corresponding to our lower-tail pvalue. It is more complicated in this example than in Section 2.7, because we do not have Binomial probability p = 0.5. In fact, the lower tail probability lies somewhere between 0 and 1 player, but it cannot be specied exactly. We get round this problem for calculating the p-value by just multiplying the upper-tail p-value by 2. Probabilities for X Binomial(n = 25, p = 0.25)
0.00
0.05
0.10
0.15
10
12
14
16
18
20
22
24
R command for the p-value We need twice the UPPER-tail p-value: p-value = 2 (1pbinom(13, 25, 0.25)). (Recall P(X 14) = 1 P(X 13).) Typing this in R gives: 2*(1-pbinom(13, 25, 0.25)) [1] 0.001831663 This p-value is very small. It means that if there really was no birthday eect, we would expect to see results
as unusual as 14 out of 25 Jan to March players less than 2 in 1000 times.

We conclude that we have strong evidence that there is a birthday eect in this
ice hockey team. Something beyond ordinary chance seems to be going on. The data are barely compatible with H0 .
79
Why should there be a birthday eect? These data are just one example of a much wider - and astonishingly strong phenomenon. Professional sports players not just in ice hockey, but in soccer, baseball, and other sports have strong birthday clustering. Why? Its because these sports select talented players for age-class star teams at young ages, about 10 years old. In ice hockey, the cut-o date for age-class teams is January 1st. A 10-year-old born in December is competing against players who are nearly a year older, born in January, February, and March. The age dierence makes a big dierence in terms of size, speed, and physical coordination. Most of the talented players at this age are simply older and bigger. But there then follow years in which they get the best coaching and the most practice. By the time they reach 17, these players really are the best. 2.9 Likelihood and estimation So far, the hypothesis tests have only told us whether the Binomial probability p might be, or probably isnt, equal to the value specied in the null hypothesis. They have told us nothing about the size, or potential importance, of the departure from H0. For example, for the deep-sea divers, we found that it would be very unlikely to
observe as many as 125 daughters out of 190 children if the chance of having a daughter really was p = 0.5.
But what does this say about the actual value of p? Remember the p-value for the test was 0.000016. Do you think that: 1. p could be as big as 0.8?
No idea! The p-value does not tell us.

2. p could be as close to 0.5 as, say, 0.51?
The test doesnt even tell us this much! If there was a huge sample size (number of children), we COULD get a p-value as small as 0.000016 even if the true probability was 0.51.
Common sense, however, gives us a hint. Because there were almost twice as many daughters as sons, my guess is that the probability of a having a daughter is something close to p = 2/3. We need some way of formalizing this.
80
Estimation The process of using observations to suggest a value for a parameter is called estimation. The value suggested is called the estimate of the parameter. In the case of the deep-sea divers, we wish to estimate the probability p that the child of a diver is a daughter. The common-sense estimate to use is 125 number of daughters = = 0.658. total number of children 190
p=
However, there are many situations where our common sense fails us. For example, what would we do if we had a regression-model situation (see other courses) and wished to specify an alternative form for p, such as p = + (diver age). How would we estimate the unknown intercept and slope , given known information on diver age and number of daughters and sons? We need a general framework for estimation that can be applied to any situation. The most useful and general method of obtaining parameter estimates is the method of maximum likelihood estimation. Likelihood Likelihood is one of the most important concepts in statistics. Return to the deep-sea diver example. X is the number of daughters out of 190 children. We know that X Binomial(190, p), and we wish to estimate the value of p. The available data is the observed value of X: X = 125.
81
Suppose for a moment that p = 0.5. What is the probability of observing X = 125?
When X Binomial(190, 0.5),

P(X = 125) = 190 (0.5)125(1 0.5)190125 125
= 3.97 106.
Not very likely!!
What about p = 0.6? What would be the probability of observing X = 125 if p = 0.6?

P(X = 125) = 190 (0.6)125(1 0.6)190125 125
= 0.016.
This still looks quite unlikely, but it is almost 4000 times more likely than getting X = 125 when p = 0.5.
So far, we have discovered that it would be thousands of times more likely to observe X = 125 if p = 0.6 than it would be if p = 0.5. This suggests that p = 0.6 is a better estimate than p = 0.5. You can probably see where this is heading. If p = 0.6 is a better estimate than p = 0.5, what if we move p even closer to our common-sense estimate of 0.658?

P(X = 125) = 190 (0.658)125(1 0.658)190125 125
= 0.061.
This is even more likely than for p = 0.6. So p = 0.658 is the best estimate yet.
82
Can we do any better? What happens if we increase p a little more, say to p = 0.7?

P(X = 125) = 190 (0.7)125(1 0.7)190125 125
= 0.028.
This has decreased from the result for p = 0.658, so our observation of 125 is LESS likely under p = 0.7 than under p = 0.658.
Overall, we can plot a graph showing how likely our observation of X = 125 is under each dierent value of p.
0.06 P(X=125) when X~Bin(190, p) 0.00 0.50 0.01 0.02 0.03 0.04 0.05
0.55
0.60
0.65 p
0.70
0.75
0.80
The graph reaches a clear maximum. This is a value of p at which the observation X = 125 is MORE LIKELY than at any other value of p. This maximum likelihood value of p is our maximum likelihood estimate. We can see that the maximum occurs somewhere close to our common-sense estimate of p = 0.658.
83
The likelihood function Look at the graph we plotted overleaf: Horizontal axis: The unknown parameter, p. Vertical axis: of p.
The probability of our observation, X = 125, under this value
This function is called the likelihood function. It is a function of the unknown parameter p. For our xed observation X = 125, the likelihood function shows how LIKELY the observation 125 is for every dierent value of p. The likelihood function is:
L(p) = P(X = 125) when X Binomial(190, p), = = 190 125 p (1 p)190125 125 190 125 p (1 p)65 125
for 0 < p < 1 .
This function of p is the curve shown on the graph on page 82. In general, if our observation were X = x rather than X = 125, the likelihood function is a function of p giving P(X = x) when X Binomial(190, p). We write: L(p ; x) = P(X = x) when X Binomial(190, p), = 190 x p (1 p)190x . x
84
Dierence between the likelihood function and the probability function The likelihood function is a probability of x, but it is a FUNCTION of p. The likelihood gives the probability of a FIXED observation x, for every possible value of the parameter p. Compare this with the probability function, which is the probability of every dierent value of x, for a FIXED value of p.
0.06 0.06 P(X=x) when p=0.6 0.00 0.01 0.02 0.03 0.04 0.05
P(X=125) when X~Bin(190, p)
0.00
0.01
0.02
0.03
0.04
0.05
0.50
0.55
0.60
0.65 p
0.70
0.75
0.80
100
120 x
140
Likelihood function, L(p ; x). Function of p for xed x. Gives P(X = x) as p changes. (x = 125 here, but could be anything.)
Probability function, fX (x). Function of x for xed p. Gives P(X = x) as x changes. (p = 0.6 here, but could be anything.)
Maximizing the likelihood We have decided that a sensible parameter estimate for p is the maximum likelihood estimate: the value of p at which the observation X = 125 is more likely than at any other value of p. We can nd the maximum likelihood estimate using calculus. The likelihood function is L(p ; 125) = 190 125 p (1 p)65. 125
85
We wish to nd the value of p that maximizes this expression. To nd the maximizing value of p, dierentiate the likelihood with respect to p:
dL = dp
190 125
125 p124 (1 p)65 + p125 65 (1 p)64 (1)
(Product Rule)
= 190 p124 (1 p)64 125 190 124 p (1 p)64 125 125(1 p) 65p
125 190p .
The maximizing value of p occurs when dL = 0. dp This gives: dL = dp 190 124 p (1 p)64 125 125 190p 125 190p = 0 = 0 p = 125 = 0.658 . 190
86
For the diver example, the maximum likelihood estimate of 125/190 is the same
as the common-sense estimate (page 80):

p= 125 number of daughters = . total number of children 190
This gives us condence that the method of maximum likelihood is sensible. The hat notation for an estimate It is conventional to write the estimated value of a parameter with a hat, like this: p. For example, p= 125 . 190
The correct notation for the maximization is: dL dp 125 . 190
=0
p=p
p=
Summary of the maximum likelihood procedure 1. Write down the distribution of X in terms of the unknown parameter: X Binomial(190, p). 2. Write down the observed value of X:
Observed data: X = 125.

3. Write down the likelihood function for this observed value: L(p ; 125) = P(X = 125) when X Binomial(190, p) = 190 125 p (1 p)65 125
for 0 < p < 1.
87
4. Dierentiate the likelihood with respect to the parameter, and set to 0 for the maximum: dL = dp 190 124 p (1 p)64 125 125 190p = 0, when p = p.
This is the Likelihood Equation.
5. Solve for p: From the graph, we can see that p = 0 and p = 1 are not maxima. p= 125 . 190
This is the maximum likelihood estimate (MLE) of p. Verifying the maximum Strictly speaking, when we nd the maximum likelihood estimate using dL dp = 0,
p=p
we should verify that the result is a maximum (rather than a minimum) by showing that d2 L < 0. dp2 p=p In Stats 210, we will be relaxed about this. You will usually be told to assume that the MLE occurs in the interior of the parameter range. Where possible, it is always best to plot the likelihood function, as on page 82. This conrms that the maximum likelihood estimate exists and is unique. In particular, care must be taken when the parameter has a restricted range like 0 < p < 1 (see later).
88
Estimators For the example above, we had observation X = 125, and the maximum likelihood estimate of p was 125 . p= 190 It is clear that we could follow through the same working with any value of X, which we can write as X = x, and we would obtain p= x . 190
Exercise: Check this by maximizing the likelihood using x instead of 125. This means that even before we have made our observation of X, we can provide a RULE for calculating the maximum likelihood estimate once X is observed: Rule: Let
Whatever value of X we observe, the maximum likelihood estimate of p will be

p= X . 190
X Binomial(190, p).
Note that this expression is now a random variable: it depends on the random value of X . A random variable specifying how an estimate is calculated from an observation is called an estimator. In the example above, the maximum likelihood estimaTOR of p is p= X . 190 x . 190
The maximum likelihood estimaTE of p, once we have observed that X = x, is

p=
89
General maximum likelihood estimator for Binomial(n, p) Take any situation in which our observation X has the distribution X Binomial(n, p),
where n is KNOWN and p is to be estimated.

We make a single observation X = x. Follow the steps on page 86 to nd the maximum likelihood estimator for p. 1. Write down the distribution of X in terms of the unknown parameter: X Binomial(n, p).
(n is known.)
2. Write down the observed value of X:
Observed data: X = x.
3. Write down the likelihood function for this observed value:
L(p ; x) = P(X = x) when X Binomial(n, p) = n x p (1 p)nx x
for 0 < p < 1.
4. Dierentiate the likelihood with respect to the parameter, and set to 0 for the maximum: dL = dp n x1 p (1 p)nx1 x x np = 0, when p = p.
(Exercise)
5. Solve for p: p= x . n
This is the maximum likelihood estimate of p.
90
The maximum likelihood estimator of p is p= X . n
(Just replace the x in the MLE with an X , to convert from the estimate to the estimator.)
By deriving the general maximum likelihood estimator for any problem of this sort, we can plug in values of n and x to get an instant MLE for any Binomial(n, p) problem in which n is known. Example: Recall the president problem in Section 2.7. Out of 153 children, 65 were daughters. Let p be the probability that a presidential child is a daughter. What is the maximum likelihood estimate of p? Solution: Plug in the numbers n = 153, x = 65:
the maximum likelihood estimate is

p= x 65 = = 0.425 . n 153
Note: We showed in Section 2.7 that p was not signicantly dierent from 0.5 in
this example.
However, the MLE of p is denitely dierent from 0.5. This comes back to the meaning of signicantly dierent in the statistical sense. Saying that p is not signicantly dierent from 0.5 just means that we cant DISTINGUISH any dierence between p and 0.5 from routine sampling
variability.
We expect that p probably IS dierent from 0.5, just by a little. The maximum likelihood estimate gives us the best estimate of p. Note: We have only considered the class of problems for which X Binomial(n, p) and n is KNOWN. If n is not known, we have a harder problem: we have two parameters, and one of them (n) should only take discrete values 1, 2, 3, . . .. We will not consider problems of this type in Stats 210.
91
2.10 Random numbers and histograms We often wish to generate random numbers from a given distribution. Statistical packages like R have custom-made commands for doing this. To generate (say) 100 random numbers from the Binomial(n = 190, p = 0.6) distribution in R, we use: rbinom(100, 190, 0.6) or in long-hand, rbinom(n=100, size=190, prob=0.6) Caution: the R inputs n and size are the opposite to what you might expect: n gives the required sample size, and size gives the Binomial parameter n! Histograms The usual graph used to visualise a set of random numbers is the histogram. The height of each bar of the histogram shows how many of the random numbers fall into the interval represented by the bar. For example, if each histogram bar covers an interval of length 5, and if 24 of the random numbers fall between 105 and 110, then the height of the histogram bar for the interval (105, 110) would be 24. Here are histograms from applying the command rbinom(100, 190, 0.6) three dierent times.
10 8 6 8 frequency of x 6 frequency of x frequency of x 6 4 4 2 2 0 0 7
80
90
100
120
140
80
90
100
120
140
80
90
100
120
140
Each graph shows 100 random numbers from the Binomial(n = 190, p = 0.6)
distribution.
92
Note: The histograms above have been specially adjusted so that each histogram bar covers an interval of just one integer. For example, the height of the bar plotted at x = 109 shows how many of the 100 random numbers are equal to
109.
Usually, histogram bars would cover a larger interval, and the histogram would be smoother. For example, on the right is a histogram using the default settings in R, obtained from the command hist(rbinom(100, 190, 0.6)). Each histogram bar covers an interval of
25 30
Histogram of rbinom(100, 190, 0.6)
Frequency
10
15
20
100
110
120
130
5 integers.
rbinom(100, 190, 0.6)
In all the histograms above, the sum of the heights of all the bars is 100, because there are 100 observations. Histograms as the sample size increases Histograms are useful because they show the approximate shape of the
underlying probability function.

They are also useful for exploring the eect of increasing sample size. All the histograms below have bars covering an interval of 1 integer. They show how the histogram becomes smoother and less erratic as sample size increases. Eventually, with a large enough sample size, the histogram starts to look identical
to the probability function.

Note: dierence between a histogram and the probability function The histogram plots OBSERVED FREQUENCIES of a set of random numbers. The probability function plots EXACT PROBABILITIES for the distribution. The histogram should have the same shape as the probability function, especially
as the sample size gets large.
93
Sample size 1000: rbinom(1000, 190, 0.6)

60 60 50 60 frequency of x
80 90 100 120 140
50
40
frequency of x
frequency of x
40
30
30
20
20
10
10
80
90
100
120
140
0
80
10
20
30
40
50
90
100
120
140
Sample size 10,000: rbinom(10000, 190, 0.6)

600 600 600 frequency of x
80 90 100 120 140
500
500
frequency of x
400
frequency of x
400
300
300
200
200
100
100
80
90
100
120
140
0
80
100
200
300
400
500
90
100
120
140
Sample size 100,000: rbinom(100000, 190, 0.6)

6000 6000 5000 5000 5000 frequency of x
80 90 100 120 140
4000
4000
frequency of x
frequency of x
3000
3000
2000
2000
1000
1000
80
90
100
120
140
0
80
1000
2000
3000
4000
6000
90
100
120
140
P(X=x) when X~Bin(190, 0.6)
xed and exact.

The histograms become stable in shape and approach the shape of the probability function as sample size gets large.
0.00 80
0.01
0.02
0.03
0.04
The probability function is
0.05
0.06
Probability function for Binomial(190, 0.6):
100 x
120
140
2.11 Expectation Given a random variable X that measures something, we often want to know what is the average value of X ? For example, here are 30 random observations taken from the distribution X Binomial(n = 190, p = 0.6):
P(X=x) when p=0.6 0.04 0.00 0.01 0.02 0.03 0.05
R command: rbinom(30, 190, 0.6) 116 116 117 122 111 112 114 120 112 102 125 116 97 105 108 117 118 111 116 121 107 113 120 114 114 124 116 118 119 120 The average, or mean, of the rst ten values is:
0.06
100
120 x
140
116 + 116 + . . . + 112 + 102 = 114.2. 10 The mean of the rst twenty values is: 116 + 116 + . . . + 116 + 121 = 113.8. 20 The mean of the rst thirty values is: 116 + 116 + . . . + 119 + 120 = 114.7. 30 The answers all seem to be close to 114. What would happen if we took the average of hundreds of values? 100 values from Binomial(190, 0.6): R command: mean(rbinom(100, 190, 0.6)) Result: 114.86 Note: You will get a dierent result every time you run this command.
95
1000 values from Binomial(190, 0.6): R command: mean(rbinom(1000, 190, 0.6)) Result: 114.02 1 million values from Binomial(190, 0.6): R command: mean(rbinom(1000000, 190, 0.6)) Result: 114.0001 The average seems to be converging to the value 114. The larger the sample size, the closer the average seems to get to 114. If we kept going for larger and larger sample sizes, we would keep getting answers closer and closer to 114. This is because 114 is the DISTRIBUTION MEAN:
the mean value that we would get if we were able to draw an innite sample from the Binomial(190, 0.6) distribution.
This distribution mean is called the expectation, or expected value, of the Bino-
mial(190, 0.6) distribution.

It is a FIXED property of the Binomial(190, 0.6) distribution. This means it is a xed constant: there is nothing random about it.
Denition: The expected value, also called the expectation or mean, of a discrete random variable X, can be written as either E(X), or E(X), or X ,
and is given by
X = E(X) =
x
xfX (x) =
x
xP(X = x) .
The expected value is a measure of the centre, or average, of the set of values that X can take, weighted according to the probability of each value.
If we took a very large sample of random numbers from the distribution of X , their average would be approximately equal to X .
96
Example: Let X Binomial(n = 190, p = 0.6). What is E(X)? E(X) =

x 190
xP(X = x) 190 (0.6)x(0.4)190x. x
=
x=0
Although it is not obvious, the answer to this sum is n p = 190 0.6 = 114. We will see why in Section 2.14. Explanation of the formula for expectation We will move away from the Binomial distribution for a moment, and use a simpler example. Let the random variable X be dened as X = 1 with probability 0.9, 1 with probability 0.1.
X takes only the values 1 and 1. What is the average value of X? = 0 would not be useful, because it ignores the fact that usually X = 1, and only occasionally is X = 1.
Using
1+(1) 2
Instead, think of observing X many times, say 100 times. Roughly 90 of these 100 times will have X = 1. Roughly 10 of these 100 times will have X = 1
The average of the 100 values will be roughly

90 1 + 10 (1) , 100 = 0.9 1 + 0.1 (1) ( = 0.8. )
We could repeat this for any sample size.
97
As the sample gets large, the average of the sample will get ever closer to
0.9 1 + 0.1 (1).
This is why the distribution mean is given by

E(X) = P(X = 1) 1 + P(X = 1) (1),
or in general,
E(X) =
x
P(X = x) x.
E(X) is a xed constant giving the
average value we would get from a large sample of X .
Linear property of expectation Expectation is a linear operator: Theorem 2.11: Let a and b be constants. Then E(aX + b) = aE(X) + b. Proof: Immediate from the denition of expectation. E(aX + b) =
x
(ax + b)fX (x) xfX (x) + b

x x
= a
fX (x)
= a E(X) + b 1.
98
Example: nding expectation from the probability function Example 1: Let X Binomial(3, 0.2). Write down the probability function of X and nd E(X).
We have:
P(X = x) = x fX (x) = P(X = x)
3 (0.2)x(0.8)3x for x = 0, 1, 2, 3. x
0
0.512
1
0.384
2
0.096
3
0.008
Then
3
E(X) =
x=0
xfX (x) = 0 0.512 + 1 0.384 + 2 0.096 + 3 0.008 = 0.6.
Note: We have: E(X) = 0.6 = 3 0.2 for X Binomial(3, 0.2). We will prove in Section 2.14 that whenever X Binomial(n, p), then E(X) = np. Example 2: Let Y be Bernoulli(p) (Section 2.3). That is, 1 with probability p, 0 with probability 1 p.
Y =
Find E(Y ).
0 y P(Y = y) 1 p
1
p
E(Y ) = 0 (1 p) + 1 p = p.
99
Expectation of a sum of random variables: E(X + Y )
For ANY random variables X1 , X2, . . . , Xn ,

E (X1 + X2 + . . . + Xn ) = E(X1) + E(X2) + . . . + E(Xn ). In particular, E(X + Y ) = E(X) + E(Y ) for ANY X and Y . This result holds for any random variables X1 , . . . , Xn. It does NOT require X1 , . . . , Xn to be independent. We can summarize this important result by saying:
The expectation of a sum is the sum of the expectations ALWAYS.

The proof requires multivariate methods, to be studied in later courses. Note: We can combine the result above with the linear property of expectation. For any constants a1 , . . . , an , we have: E (a1 X1 + a2 X2 + . . . + an Xn ) = a1 E(X1) + a2 E(X2) + . . . + an E(Xn ). Expectation of a product of random variables: E(XY ) There are two cases when nding the expectation of a product: 1. General case:
For general X and Y ,
E(XY ) is NOT equal to E(X)E(Y ).
We have to nd E(XY ) either using their joint probability function (see later), or using their covariance (see later). 2. Special case: when X and Y are INDEPENDENT:
When X and Y are INDEPENDENT,
E(XY ) = E(X)E(Y ).
2.12 Variable transformations We often wish to transform random variables through a function. For example, given the random variable X, possible transformations of X include: X2 , X, 4X 3 , ... We often summarize all possible variable transformations by referring to Y = g(X) for some function g . For discrete random variables, it is very easy to nd the probability function for Y = g(X), given that the probability function for X is known. Simply change
all the values and keep the probabilities the same.

Example 1: Let X Binomial(3, 0.2), and let Y = X 2 . Find the probability function of Y . The probability function for X is: x P(X = x) 0 0.512 1 0.384 2 0.096 3 0.008
Thus the probability function for Y = X 2 is: y P(Y = y) 02 0.512 12 0.384 22 0.096 32 0.008
This is because Y takes the value 02 whenever X takes the value 0, and so on. Thus the probability that Y = 02 is the same as the probability that X = 0. Overall, we would write the probability function of Y = X 2 as: y P(Y = y) 0 0.512 1 0.384 4 0.096 9 0.008
To transform a discrete random variable, transform the values
and leave the probabilities alone.
Example 2: Mr Chance hires out giant helium balloons for advertising. His balloons come in three sizes: heights 2m, 3m, and 4m. 50% of Mr Chances customers choose to hire the cheapest 2m balloon, while 30% hire the 3m balloon and 20% hire the 4m balloon. The amount of helium gas in cubic metres required to ll the balloons is h3 /2, where h is the height of the balloon. Find the probability function of Y , the amount of helium gas required for a randomly chosen customer.
Let X be the height of balloon ordered by a random customer. The probability function of X is: height, x (m)
P(X = x)
2 0.5
3 0.3
4 0.2
Let Y be the amount of gas required: Y = X 3 /2. The probability function of Y is:
gas, y (m 3 )
P(Y = y)
4 0.5
13.5 0.3
32 0.2
Transform the values, and leave the probabilities alone.
Expected value of a transformed random variable We can nd the expectation of a transformed random variable just like any other random variable. For example, in Example 1 we had X Binomial(3, 0.2), and Y = X 2. The probability function for X is: and for Y = X 2 : x P(X = x) y P(Y = y) 0 0.512 0 0.512 1 0.384 1 0.384 2 0.096 4 0.096 3 0.008 9 0.008
102
Thus the expectation of Y = X is: E(Y ) = E(X 2 ) = 0 0.512 + 1 0.384 + 4 0.096 + 9 0.008 = 0.84.
Note: E(X 2 ) is NOT the same as {E(X)}2 . Check that {E(X)}2 = 0.36. To make the calculation quicker, we could cut out the middle step of writing down the probability function of Y . Because we transform the values and keep the probabilities the same, we have: E(X 2 ) = 02 0.512 + 12 0.384 + 22 0.096 + 32 0.008. If we write g(X) = X 2 , this becomes: E{g(X)} = E(X 2) = g(0) 0.512 + g(1) 0.384 + g(2) 0.096 + g(3) 0.008. Clearly the same arguments can be extended to any function g(X) and any discrete random variable X:
E{g(X)} =
x
g(x)P(X = x).
Transform the values, and leave the probabilities alone. Denition: For any function g and discrete random variable X, the expected value of g(X) is given by E{g(X)} =
x
g(x)P(X = x) =
x
g(x)fX (x).
103
Example: Recall Mr Chance and his balloon-hire business from page 101. Let X be the height of balloon selected by a randomly chosen customer. The probability function of X is: height, x (m) 2 3 4 P(X = x) 0.5 0.3 0.2 (a) What is the average amount of gas required per customer?
Gas required was X 3 /2 from page 101. Average gas per customer is E(X 3 /2).
X3 E 2 =
x
x3 P(X = x) 2
33 43 23 0.5 + 0.3 + 0.2 2 2 2
= 12.45 m3 gas. (b) Mr Chance charges $400h to hire a balloon of height h. What is his expected earning per customer?
Expected earning is E(400X).

E(400X) = 400 E(X) = 400 2.7 = $1080 per customer. (c) How much does Mr Chance expect to earn in total from his next 5 customers?
(expectation is linear)
= 400 (2 0.5 + 3 0.3 + 4 0.2)
Let Z1, . . . , Z5 be the earnings from the next 5 customers. Each Zi has E(Zi) = 1080 by part (b). The total expected earning is
E(Z1 + Z2 + . . . + Z5 ) = E(Z1) + E(Z2) + . . . + E(Z5) = 5 1080 = $5400.
104
Getting the expectation. . .

Wr g! on
Suppose
X=
3 with probability 3/4, 8 with probability 1/4.
Then 3/4 of the time, X takes value 3, and 1/4 of the time, X takes value 8. So
E(X) =
3 43
+ 1 8. 4
add up the values times how often they occur What about E( X)? 3 with probability 3/4, X= 8 with probability 1/4.
105
add up the values times how often they occur 3 1 E( X) = 4 3 + 4 8.
Common mistakes i) E( X) = EX =
on Wr g!
3 43
+ 18 4
ii) E( X) =
3 43
1 48
on Wr
g!
iii) E( X) =
ng!
3 43
+ +
1 48
Wr
3 4 3
1 4 8
2.13 Variance Example: Mrs Tractor runs the Rational Bank of Remuera. Every day she hopes to ll her cash machine with enough cash to see the well-heeled citizens of Remuera through the day. She knows that the expected amount of money withdrawn each day is $50,000. How much money should she load in the machine? $50,000? No: $50,000 is the average, near the centre
of the distribution. About half the time, the money required will be GREATER than the average.
How much money should Mrs Tractor put in the machine if she wants to be 99% certain that there will be enough for the days transactions? Answer: it depends how much the amount withdrawn varies above and below
its mean.
For questions like this, we need the study of variance. Variance is the average squared distance of a random variable from its own mean.
2 Denition: The variance of a random variable X is written as either Var(X) or X ,
and is given by
2 X = Var(X) = E (X X )2 = E (X EX)2 .
Similarly, the variance of a function of X is

2
Var(g(X)) = E
g(X) E(g(X))
Note: The variance is the square of the standard deviation of X, so
sd(X) =
Var(X) =
2 X = X .
107
Variance as the average squared distance from the mean The variance is a measure of how spread out are the values that X can take. It is the average squared distance between a value of X and the central (mean) value, X .
Possible values of X
x1
x2
x3
x4
x5
x6
x2 X x4 X X
(central value)
Var(X) = E [(X X )2 ]
(2) (1)
(1) Take distance from observed values of X to the central point, X . Square it to balance positive and negative distances. (2) Then take the average over all values X can take: ie. if we observed X many times, nd what would be the average squared distance between X and X .
2 Note: The mean, X , and the variance, X , of X are just numbers: there is nothing
random or variable about them.

Example: Let X = 3 with probability 3/4, 8 with probability 1/4. 1 3 + 8 = 4.25 4 4 1 3 (3 4.25)2 + (8 4.25)2 = 4 4 = 4.6875.
Then
E(X) = X = 3
2 Var(X) = X
When we observe X, we get either 3 or 8: this is random. 2 But X is xed at 4.25, and X is xed at 4.6875, regardless of the outcome of
X.
108
For a discrete random variable,
Var(X) = E (X X )2 =
(x X )2fX (x) =
(x X )2P(X = x).
This uses the denition of the expected value of a function of X:
Var(X) = E(g(X)) where g(X) = (X X )2 .

Theorem 2.13A: (important)
Var(X) = E(X 2) (EX)2 = E(X 2 ) 2 X

Proof: Var(X) = E (X X )2 = E[ X 2 2 X
r.v.
by denition X + 2 ] X
constant
r.v. constant
= E(X 2) 2X E(X) + 2 X = E(X 2) 22 + 2 X X = E(X 2) 2 . X
by Thm 2.11
Note: e.g.
E(X 2) =
xx
fX (x) = X=
xx
P(X = x) . This is not the same as (EX)2:
3 with probability 0.75, 8 with probability 0.25.
Then X = EX = 4.25, so 2 = (EX)2 = (4.25)2 = 18.0625. X But 3 1 E(X 2 ) = 32 + 82 = 22.75. 4 4
Thus
E(X 2 ) = (EX)2 in general.
109
Theorem 2.13B: If a and b are constants and g(x) is a function, then i) Var(aX + b) = a2 Var(X). ii) Var(a g(X) + b) = a2 Var{g(X)}. Proof:
(part (i)) Var(aX + b) = E {(aX + b) E(aX + b)}2

= E {aX + b aE(X) b}2 = E {aX aE(X)}2 = E a2 {X E(X)}2 = a2 E {X E(X)}2 = a2 Var(X) . Part (ii) follows similarly. Note: These are very dierent from the corresponding expressions for expectations (Theorem 2.11). Variances are more dicult to manipulate than expectations.
by Thm 2.11
by Thm 2.11
Example: nding expectation and variance from the probability function Recall Mr Chances balloons from page 101. The random variable Y is the amount of gas required by a randomly chosen customer. The probability function of Y is: gas, y (m 3) P(Y = y) Find Var(Y ). 4 0.5 13.5 0.3 32 0.2
110
We know that E(Y ) = Y = 12.45 from page 103. First method: use Var(Y ) = E[(Y Y )2]:
Var(Y ) = (4 12.45)2 0.5 + (13.5 12.45)2 0.3 + (32 12.45)2 0.2

= 112.47.
Second method: use E(Y 2 ) 2 : (usually easier) Y E(Y 2 ) = 42 0.5 + 13.52 0.3 + 322 0.2 = 267.475.
So Var(Y ) = 267.475 (12.45)2 = 112.47
as before.
Variance of a sum of random variables: Var(X + Y ) There are two cases when nding the variance of a sum: 1. General case:
For general X and Y , Var(X + Y ) is NOT equal to Var(X) + Var(Y ).
We have to nd Var(X + Y ) using their covariance (see later courses).
2. Special case: when X and Y are INDEPENDENT:
When X and Y are INDEPENDENT, Var(X + Y ) = Var(X) + Var(Y ).
111
Interlude:
Guess whether each of the following statements is true or false.
RUE T
or
FALSE
??
1. Toss a fair coin 10 times. The probability of getting 8 or more heads is less than 1%.
2. Toss a fair coin 200 times. The chance of getting a run of at least 6 heads or 6 tails in a row is less than 10%.
3. Consider a classroom with 30 pupils of age 5, and one teacher of age 50. The probability that the pupils all outlive the teacher is about 90%.
4. Open the Business Herald at the pages giving share prices, or open an atlas at the pages giving country areas or populations. Pick a column of gures.
share A Barnett Advantage I AFFCO Air NZ . . .
last sale 143 23 18 52 . . .
The gures are over 5 times more likely to begin with the digit 1 than with the digit 9.
Answers: 1. FALSE it is 5.5%. 2. FALSE : it is 97%. 3. FALSE : in NZ the probability is about 50%. 4. TRUE : in fact they are 6.5 times more likely.
112
2.14 Mean and variance of the Binomial(n, p) distribution Let X Binomial(n, p). We have mentioned several times that E(X) = np. We now prove this and the additional result for Var(X). If X Binomial(n, p), then: E(X) = X = np 2 Var(X) = X = np(1 p).
We often write q = 1 p, so Var(X) = npq .

Easy proof: X as a sum of Bernoulli random variables If X Binomial(n, p), then X is the number of successes out of n independent trials, each with P(success) = p. This means that we can write: X = Y1 + Y2 + . . . + Yn ,
where each
Yi =
1 with probability p, 0 with probability 1 p.
That is, Yi counts as a 1 if trial i is a success, and as a 0 if trial i is a failure. Overall, Y1 + . . . + Yn is the total number of successes out of n independent trials, which is the same as X . Note: Each Yi is a Bernoulli(p) random variable (Section 2.3). Now if X = Y1 + Y2 + . . . + Yn , and Y1 , . . . , Yn are independent, then: E(X) = E(Y1) + E(Y2) + . . . + E(Yn)
(does NOT require independence),
Var(X) = Var(Y1 ) + Var(Y2) + . . . + Var(Yn ) (DOES require independence).
113
The probability function of each Yi is: Thus,
y 0 P(Yi = y) 1 p
1 p
Also, So
E(Yi) = 0 (1 p) + 1 p = p. E(Yi2) = 02 (1 p) + 12 p = p.
Var(Yi) = E(Yi2 ) (EYi)2

= p p2 = p(1 p). Therefore:
E(X) = E(Y1) + E(Y2) + . . . + E(Yn) = p +p +...+ p = n p. And:
Var(X) = Var(Y1) + Var(Y2) + . . . + Var(Yn)

= n p(1 p).
Thus we have proved that E(X) = np and Var(X) = np(1 p).
114
Hard proof: for mathematicians (non-examinable) We show below how the Binomial mean and variance formulae can be derived directly from the probability function. n x E(X) = xfX (x) = x p (1 p)nx = x x=0 x=0
n n n
x
x=0
n! px (1 p)nx (n x)!x!
But
x 1 = and also the rst term xfX (x) is 0 when x = 0. x! (x 1)!

n
So, continuing, E(X) =

x=1
n! px(1 p)nx (n x)!(x 1)!
Next: make ns into (n 1)s, xs into (x 1)s, wherever possible: e.g. n x = (n 1) (x 1), n! = n(n 1)! etc. This gives,
n
px = p px1
E(X) =
x=1
n(n 1)! p p(x1) (1 p)(n1)(x1) [(n 1) (x 1)]!(x 1)!

n
np
what we want
x=1
n 1 x1 p (1 p)(n1)(x1) x1
need to show this sum = 1
Finally we let y = x 1 and let m = n 1. When x = 1, y = 0; and when x = n, y = n 1 = m. So

m
E(X) = np
y=0
m y p (1 p)my y (Binomial Theorem)
= np(p + (1 p))m E(X) = np, as required.
115
For Var(X), use the same ideas again.

x 1 For E(X), we used x! = (x1)! ; so instead of nding E(X 2), it will be easier to nd E[X(X 1)] = E(X 2) E(X) because then we will be able to cancel x(x1) 1 = (x2)! . x!
Here goes:
n
E[X(X 1)] = =
x=0 n x=0
x(x 1)
n x p (1 p)nx x
x(x 1)n(n 1)(n 2)! p2p(x2) (1 p)(n2)(x2) [(n 2) (x 2)]!(x 2)!x(x 1)
First two terms (x = 0 and x = 1) are 0 due to the x(x 1) in the numerator. Thus
n
E[X(X 1)] = p n(n 1) = n(n 1)p2
x=2 m y=0
n 2 x2 p (1 p)(n2)(x2) x2 m y p (1 p)my y if m = n 2, y = x 2.
sum=1 by Binomial Theorem
So E[X(X 1)] = n(n 1)p2 . Thus Var(X) = E(X 2) (E(X))2 = E(X 2) E(X) + E(X) (E(X))2 = E[X(X 1)] + E(X) (E(X))2 = n(n 1)p2 + np n2 p2 = np(1 p). Note the steps: take out x(x1) and replace n by (n2), x by (x2) wherever possible.
116
Variance of the MLE for the Binomial p parameter In Section 2.9 we derived the maximum likelihood estimator for the Binomial parameter p. Reminder: Take any situation in which our observation X has the distribution X Binomial(n, p), where n is KNOWN and p is to be estimated. Make a single observation X = x. The maximum likelihood estimator of p is p= X . n
In practice, estimates of parameters should always be accompanied by estimates
of their variability.
For example, in the deep-sea diver example introduced in Section 2.7, we estimated the probability that a diver has a daughter is p= 125 X = = 0.658. n 190
What is our margin of error on this estimate? Do we believe it is 0.658 0.3 (say), in other words almost useless, or do we believe it is very precise, perhaps 0.658 0.02? We assess the usefulness of estimators using their variance. Given p = X , we have: n X n
Var(p) = Var
= = =
1 Var(X) n2 1 np(1 p) n2 p(1 p) . n
for X Binomial(n, p)
()
117
In practice, however, we do not know the true value of p, so we cannot calculate the exact Var(p). Instead, we have to ESTIMATE Var(p) by replacing the unknown p in equation () by p. We call our estimated variance Var (p): p(1 p) . n
Var (p) =
The standard error of p is:
se(p) =
Var (p).
We usually quote the margin of error associated with p as p(1 p) . n
Margin of error = 1.96 se(p) = 1.96
This result occurs because the Central Limit Theorem guarantees that p will be approximately Normally distributed in large samples (large n). We will study the Central Limit Theorem in Chapter 5. The expression p 1.96 se(p) gives an approximate 95% condence interval for p under the Normal approximation. Example: For the deep-sea diver example, with n = 190, p = 0.658,
so:
se(p) =
0.658 (1 0.658) 190
= 0.034. For our nal answer, we should therefore quote: p = 0.658 1.96 0.034 = 0.658 0.067
or
p = 0.658
(0.591, 0.725).
Chapter 3: Modelling with Discrete Probability Distributions
118
In Chapter 2 we introduced several fundamental ideas: hypothesis testing, likelihood, expectation, and variance. Each of these was illustrated by the Binomial distribution. We now introduce several other discrete distributions and discuss their properties and usage. First we revise Bernoulli trials and the Binomial distribution. Bernoulli Trials A set of Bernoulli trials is a series of trials such that: i) each trial has only 2 possible outcomes: Success and Failure; ii) the probability of success, p, is constant for all trials; iii) the trials are independent. Examples: 1) Repeated tossing of a fair coin: each toss is a Bernoulli trial with 1 P(success) = P(head) = 2 . 2) Having children: each child can be thought of as a Bernoulli trial with outcomes {girl, boy} and P(girl) = 0.5.
3.1 Binomial distribution Description: X Binomial(n, p) if X is the number of successes out of a xed number n of Bernoulli trials, each with P(success) = p. Probability function: fX (x) = P(X = x) = Mean: E(X) = np. Variance: Var(X) = np(1 p). Sum of independent Binomials: If X Binomial(n, p) and Y Binomial(m, p), and if X and Y are independent, and if X and Y both share the same parameter p, then X + Y Binomial(n + m, p).
n x
px(1 p)nx for x = 0, 1, . . . , n.
119
Shape: Usually symmetrical unless p is close to 0 or 1. Peaks at approximately np.

n = 10, p = 0.5 (symmetrical)
0.25 0.4
n = 10, p = 0.9 n = 100, p = 0.9 (skewed for p close to 1) (less skew for p = 0.9 if n is large)
0.12 0.3 0.2 0.1 0.0
0.20
0.15
0.10
0.05
0.0
9 10
9 10
0.0 80
0.02
0.04
0.06
0.08
0.10
90
100
3.2 Geometric distribution Like the Binomial distribution, the Geometric distribution is dened in terms of a sequence of Bernoulli trials. The Binomial distribution counts the number of successes out of a xed
number of trials. success occurs.
The Geometric distribution counts the number of trials before the rst This means that the Geometric distribution counts the number of failures before
the rst success.

If every trial has probability p of success, we write: X Geometric(p). Examples: 1) X =number of boys before the rst girl in a family: X Geometric(p = 0.5). 2) Fish jumping up a waterfall. On every jump the sh has probability p of reaching the top. Let X be the number of failed jumps before
the sh succeeds. Then X Geometric(p).
120
Properties of the Geometric distribution i) Description X Geometric(p) if X is the number of failures before the rst success in a series of Bernoulli trials with P(success) = p. ii) Probability function For X Geometric(p), fX (x) = P(X = x) = (1 p)xp for x = 0, 1, 2, . . . Explanation: P(X = x) = (1 p)x
need x failures
nal trial must be a success
Dierence between Geometric and Binomial: For the Geometric distribution, the trials must always occur in the order F F . . . F S .
x failures
For the Binomial distribution, failures and successes can occur in any order: e.g. F F . . . F S, F SF . . . F , SF . . . F , etc. This is why the Geometric distribution has probability function P(x failures, 1 success) = (1 p)xp, while the Binomial distribution has probability function P(x failures, 1 success) = iii) Mean and variance E(X) = For X Geometric(p), x+1 (1 p)xp. x
1p q = p p q 1p = 2 2 p p
Var(X) =
121
iv) Sum of independent Geometric random variables If X1 , . . . , Xk are independent, and each Xi Geometric(p), then X1 + . . . + Xk Negative Binomial(k, p). v) Shape Geometric probabilities are always greatest at x = 0. The distribution always has a long right tail (positive skew). The length of the tail depends on p. For small p, there could be many failures before the rst success, so the tail is long. For large p, a success is likely to occur almost immediately, so the tail is short.
p = 0.3 (small p)
0.30 0.5
(see later)
p = 0.5 (moderate p)
0.8
p = 0.9 (large p)
0.25
0.20
0.4
0.15
0.3
0.10
0.2
0.0
0.05
0.0
0.1
9 10
9 10
0.0 0
0.2
0.4
0.6
9 10
vi) Likelihood For any random variable, the likelihood function is just the probability function expressed as a function of the unknown parameter. If: X Geometric(p); p is unknown; the observed value of X is x; then the likelihood function is: L(p ; x) = p(1 p)x
for 0 < p < 1.
Example: we observe a sh making 5 failed jumps before reaching the top of a waterfall. We wish to estimate the probability of success for each jump.
Then L(p ; 5) = p(1 p)5 for 0 < p < 1. Maximize L with respect to p to nd the MLE, p.
122
For mathematicians: proof of Geometric mean and variance formulae (non-examinable) We wish to prove that E(X) = We use the following results:
x=1 1p p
and Var(X) =
1p p2
when X Geometric(p).
xq x1 =
1 (1 q)2 2 (1 q)3
(for |q| < 1),
(3.1)
and
x=2
x(x 1)q x2 =
(for |q| < 1).
(3.2)
Proof of (3.1) and (3.2): Consider the innite sum of a geometric progression:
x=0
qx =
1 1q d dq 1 1q
(for |q| < 1).
Dierentiate both sides with respect to q: d dq

x=0 x=1 x=0
qx
d x 1 (q ) = dq (1 q)2 xq x1 = 1 , (1 q)2 as stated in (3.1).
Note that the lower limit of the summation becomes x = 1 because the term for x = 0 vanishes. The proof of (3.2) is obtained similarly, by dierentiating both sides of (3.1) with respect to q (Exercise).
123
Now we can nd E(X) and Var(X). E(X) = =

x=0
xP(X = x) xpq x xq x xq x1 (by equation (3.1)) (because 1 q = p) (where q = 1 p) (lower limit becomes x = 1 because term in x = 0 is zero)
= p
x=0 x=1
= pq
x=1
= pq = pq = q , p
1 (1 q)2 1 p2
as required.
For Var(X), we use Var(X) = E(X 2 ) (EX)2 = E {X(X 1)} + E(X) (EX)2 . Now E{X(X 1)} = =
x=0 x=0
()
x(x 1)P(X = x) x(x 1)pq x

x=2
(where q = 1 p) (note that terms below x = 2 vanish) (by equation (3.2))
= pq
x(x 1)q x2
= pq 2 = Thus by (), 2q 2 . p2
2 (1 q)3
2q 2 q Var(X) = 2 + p p as required, because q + p = 1.
q p
q(q + p) q = 2, 2 p p
124
3.3 Negative Binomial distribution The Negative Binomial distribution is a generalised form of the Geometric distribution: the Geometric distribution counts the number of failures before the rst
success;
the Negative Binomial distribution counts the number of failures before the k th success. If every trial has probability p of success, we write: X NegBin(k, p). Examples: 1) X =number of boys before the second girl in a family: X NegBin(k = 2, p = 0.5). 2) Tom needs to pass 24 papers to complete his degree. He passes each paper with probability p, independently of all other papers. Let X be the number of papers
Tom fails in his degree.

Then X NegBin(24, p). Properties of the Negative Binomial distribution i) Description X NegBin(k, p) if X is the number of failures before the k th success in a series of Bernoulli trials with P(success) = p. ii) Probability function For X NegBin(k, p), fX (x) = P(X = x) = k+x1 k p (1 p)x x
for x = 0, 1, 2, . . .
125
Explanation: For X = x, we need x failures and k successes. The trials stop when we reach the kth success, so the last trial must be a
success.
For example, if x = 3 failures and k = 2 successes, we could have: F F F SS So: P(X = x) = k+x1 x (k 1) successes and x failures out of (k 1 + x) trials. iii) Mean and variance E(X) = For X NegBin(k, p), k(1 p) kq = p p k(1 p) kq = 2 p2 p k successes pk F F SF S F SF F S SF F F S
This leaves x failures and k 1 successes to occur in any order: a total of k 1 + x trials.
(1 p)x
x failures
Var(X) =
These results can be proved from the fact that the Negative Binomial distribution is obtained as the sum of k independent Geometric random variables: X = Y1 + . . . + Yk , where each Yi Geometric(p), kq , E(X) = kE(Yi) = p kq Var(X) = kVar(Yi) = 2 . p iv) Sum of independent Negative Binomial random variables If X and Y are independent, and X NegBin(k, p), Y NegBin(m, p), with the same value of p, then X + Y NegBin(k + m, p). Yi indept,
126
v) Shape The Negative Binomial is exible in shape. Below are the probability functions for various dierent values of k and p.
k = 3, p = 0.5
0.5
k = 3, p = 0.8
0.08
k = 10, p = 0.5
0.15
0.4
0.10
0.3
0.05
0.2
0.0
0.0
0.1
9 10
9 10
0.0 0 2 4 6 8 10 12 14 16 18 20 22 24
vi) Likelihood As always, the likelihood function is the probability function expressed as a function of the unknown parameters. If: X NegBin(k, p); k is known; p is unknown; the observed value of X is x; then the likelihood function is: L(p ; x) = k+x1 k p (1 p)x x
for 0 < p < 1.
Example: Tom fails a total of 4 papers before nishing his degree. What is his pass probability for each paper? X =# failed papers before 24 passed papers: X NegBin(24, p).
Observation: X = 4 failed papers. Likelihood:

L(p ; 4) = 24 + 4 1 24 p (1 p)4 = 4 27 24 p (1 p)4 4
0.02
0.04
0.06
for 0 < p < 1.
Maximize L with respect to p to nd the MLE, p.
127
3.4 Hypergeometric distribution: sampling without replacement The hypergeometric distribution is used when we are sampling without replace-
ment from a nite population.

i) Description Suppose we have N objects: M of the N objects are special; the other N M objects are not special.
We remove n objects at random without replacement. Let X = number of the n removed objects that are special. Then X Hypergeometric(N, M, n). Example: Ron has a box of Chocolate Frogs. There are 20 chocolate frogs in the box. Eight of them are dark chocolate, and twelve of them are white chocolate. Ron grabs a random handful of 5 chocolate frogs and stus them into his mouth when he thinks that noone is looking. Let X be the number of dark chocolate frogs he picks.
Then X Hypergeometric(N = 20, M = 8, n = 5).

ii) Probability function For X Hypergeometric(N, M, n),
M x N M nx N n
fX (x) = P(X = x) =
for x = max(0, n + M N ) to x = min(n, M).
128
Explanation: We need to choose x special objects and n x other objects. Number of ways of selecting x special objects from the M available is: M . x
Number of ways of selecting n x other objects from the N M available is: N M . nx Total number of ways of choosing x special objects and (nx) other objects is: M N M . nx x Overall number of ways of choosing n objects from N is:
N n
Thus:
number of desired ways = P(X = x) = total number of ways
M x
N M nx N n
Note: We need 0 x M (number of special objects), and 0 n x N M (number of other objects). After some working, this gives us the stated constraint that x = max(0, n + M N ) to x = min(n, M). Example: What is the probability that Ron selects 3 white and 2 dark chocolates? X =# dark chocolates. There are N = 20 chocolates, including M = 8 dark
chocolates. We need
P(X = 2) =
8 2 12 3 20 5
28 220 = 0.397 . 15504
iii) Mean and variance For X Hypergeometric(N, M, n), E(X) = np Var(X) = np(1 p)
N n N 1 M N.
where p =
129
iv) Shape The Hypergeometric distribution is similar to the Binomial distribution when n/N is small, because removing n objects does not change the overall composition of the population very much when n/N is small. For n/N < 0.1 we often approximate the Hypergeometric(N, M, n) distribution by the Binomial(n, p = M ) distribution. N
Hypergeometric(30, 12, 10)
0.30 0.25
Binomial(10, 12 ) 30
0.25
0.20
0.15
0.10
0.0
0.05
9 10
0.0 0
0.05
0.10
0.15
0.20
9 10
Note: The Hypergeometric distribution can be used for opinion polls, because these involve sampling without replacement from a nite population. The Binomial distribution is used when the population is sampled with replace-
ment.
As noted above, Hypergeometric(N, M, n) Binomial(n, M ) N A note about distribution names Discrete distributions often get their names from mathematical power series. Binomial probabilities sum to 1 because of the Binomial Theorem: p + (1 p)
n
as N .
= <sum of Binomial probabilities> = 1.
Negative Binomial probabilities sum to 1 by the Negative Binomial expansion: i.e. the Binomial expansion with a negative power, k: p
k
1 (1 p)
= <sum of NegBin probabilities> = 1.
Geometric probabilities sum to 1 because they form a Geometric series: p

x=0
(1 p)x = <sum of Geometric probabilities> = 1.
3.5 Poisson distribution When is the next volcano due to erupt in Auckland?
Any moment now, unfortunately! (give or take 1000 years or so. . . )

A volcano could happen in Auckland this afternoon, or it might not happen for another 1000 years. Volcanoes are almost impossible to predict: they seem to happen completely at random. A distribution that counts the number of random events in a xed space of time
is the Poisson distribution.

How many cars will cross the Harbour Bridge today? X Poisson. How many road accidents will there be in NZ this year? X Poisson. How many volcanoes will erupt over the next 1000 years? X Poisson. The Poisson distribution arose from a mathematical formulation called the Poisson Process, published by Simon-Denis Poisson in 1837. e
Poisson Process The Poisson process counts the number of events occurring in a xed time or
space, when events occur independently and at a constant average rate.
Example: Let X be the number of road accidents in a year in New Zealand. Suppose that: i) all accidents are independent of each other; ii) accidents occur at a constant average rate of per year; iii) accidents cannot occur simultaneously. Then the number of accidents in a year, X, has the distribution X Poisson().
131
Number of accidents in one year Let X be the number of accidents to occur in one year: X Poisson(). The probability function for X Poisson() is x P(X = x) = e x! Number of accidents in t years Let Xt be the number of accidents to occur in time t years. Then Xt Poisson(t), and (t)x t P(Xt = x) = e x!
for x = 0, 1, 2, . . .
for x = 0, 1, 2, . . .
General denition of the Poisson process Take any sequence of random events such that: i) all events are independent; ii) events occur at a constant average rate of per unit time; iii) events cannot occur simultaneously. Let Xt be the number of events to occur in time t. Then Xt Poisson(t), and P(Xt = x) = (t)x t e x!
for x = 0, 1, 2, . . .
Note: For a Poisson process in space, let XA = # events in area of size A. Then XA Poisson(A). Example: XA = number of raisins in a volume A of currant bun.
132
Where does the Poisson formula come from? (Sketch idea, for mathematicians; non-examinable). The formal denition of the Poisson process is as follows. Denition: The random variables {Xt : t > 0} form a Poisson process with rate if: i) events occurring in any time interval are independent of those occurring in any other disjoint time interval; ii) lim
t0
P(exactly one event occurs in time interval[t, t + t]) t
= ;
iii) lim
t0
P(more than one event occurs in time interval[t, t + t]) t
= 0.
These conditions can be used to derive a partial dierential equation on a function known as the probability generating function of Xt . The partial dierential x equation is solved to provide the form P(Xt = x) = (t) et . x! Poisson distribution The Poisson distribution is not just used in the context of the Poisson process. It is also used in many other situations, often as a subjective model (see Section 3.6). Its properties are as follows. i) Probability function For X Poisson(), fX (x) = P(X = x) = x e x!
for x = 0, 1, 2, . . .
The parameter is called the rate of the Poisson distribution.
133
ii) Mean and variance The mean and variance of the Poisson() distribution are both . E(X) = Var(X) =
when X Poisson().
Notes: 1. It makes sense for E(X) = : by denition, is the average number of events per unit time in the Poisson process. 2. The variance of the Poisson distribution increases with the mean (in fact, variance = mean). This is often the case in real life: there is more uncertainty associated with larger numbers than with smaller numbers. iii) Sum of independent Poisson random variables If X and Y are independent, and X Poisson(), Y Poisson(), then iv) Shape X + Y Poisson( + ).
The shape of the Poisson distribution depends upon the value of . For small , the distribution has positive (right) skew. As increases, the distribution becomes more and more symmetrical, until for large it has the familiar bellshaped appearance. The probability functions for various are shown below.
=1
0.20
= 3.5
0.04
= 100
0.3
0.2
0.15
0.1
0.10
0.0
0.0
0.05
9 10
9 10
0.0 60
0.01
0.02
0.03
80
100
120
140
134
v) Likelihood and Estimator Variance As always, the likelihood function is the probability function expressed as a function of the unknown parameters. If: X Poisson(); is unknown; the observed value of X is x; then the likelihood function is: x L(; x) = e x!
for 0 < < .
Example: 28 babies were born in Mt Roskill yesterday. Estimate .
Let X =# babies born in a day in Mt Roskill. Assume that X Poisson(). Observation: X = 28 babies. Likelihood:
L( ; 28) = 28 e 28!
for 0 < < .
Maximize L with respect to to nd the MLE, . We nd that = x = 28. Thus the estimator variance is: Var() = Var(X) = , because X Poisson().
Similarly, the maximum likelihood estimator of is: = X .
Because we dont know , we have to estimate the variance: Var() = . vi) R command for the p-value: If X Poisson(), then the R command for P(X x) is ppois(x, lambda). Proof of Poisson mean and variance formulae (non-examinable) We wish to prove that E(X) = Var(X) = for X Poisson().
x For X Poisson(), the probability function is fX (x) = e for x = 0, 1, 2, . . . x!
135
So E(X) =
x xfX (x) = x e x! x=0 x=0 =

x=1
x e (x 1)!
(note that term for x = 0 is 0)
x=1 y=0
x1 e (writing everything in terms of x 1) (x 1)! y e y! (putting y = x 1)
= = , So E(X) = , as required. For Var(X), we use:
because the sum=1 (sum of Poisson probabilities) .
Var(X) = E(X 2 ) (EX)2
= E[X(X 1)] + E(X) (EX)2 = E[X(X 1)] + 2 .
x x(x 1) e But E[X(X 1)] = x! x=0 =

x=2
x e (x 2)!
(terms for x = 0 and x = 1 are 0)
x=2 y=0
x2 e (writing everything in terms of x 2) (x 2)! y e y! (putting y = x 2)
= 2 . So Var(X) = E[X(X 1)] + 2 = 2 + 2 = , as required.
136
3.6 Subjective modelling Most of the distributions we have talked about in this chapter are exact models for the situation described. For example, the Binomial distribution describes exactly the distribution of the number of successes in n Bernoulli trials. However, there is often no exact model available. If so, we will use a subjective
model.
In a subjective model, we pick a probability distribution to describe a situation
just because it has properties that we think are appropriate to the situation, such as the right sort of symmetry or skew, or the right sort of relationship between variance and mean.
Example: Distribution of word lengths for English words. Let X = number of letters in an English word chosen at random from the dictio-
nary.
If we plot the frequencies on a barplot, we see that the shape of the distribution
is roughly Poisson.
English word lengths: X 1 Poisson(6.22)
Word lengths from 25109 English words
probability 0.0 0 0.05
0.10
0.15
10 number of letters
15
20
The Poisson probabilities (with estimated by maximum likelihood) are plotted as points overlaying the barplot. We need to use X 1 + Poisson because X cannot take the value 0. The t of the Poisson distribution is quite good.
137
In this example we can not say that the Poisson distribution represents the number of events in a xed time or space: instead, it is being used as a subjective
model for word length.

Can a Poisson distribution t any data? The answer is no: in fact the Poisson
distribution is very inexible.

Best Poisson fit
awful.
0.0 0
0.02
The t of the Poisson distribution is
0.04
Here are stroke counts from 13061 Chinese characters. X is the number of strokes in a randomly chosen character. The best-tting Poisson distribution (found by MLE) is overlaid.
probability
0.06
0.08
0.10
10
20 number of strokes
30
It turns out, however, that the Chinese stroke distribution is well-described by
a Negative Binomial model.

Stroke counts from 13061 Chinese characters
The best-tting Negative Binomial distribution
0.08
(found by MLE)
0.06 probability
is NegBin(k = 23.7, p = 0.64). The t is very good. However, X does not represent the number of failures before the kth success: the NegBin is a
0 10 20 number of strokes 30
0.0
0.02
0.04
subjective model.
138
Chapter 4: Continuous Random Variables

4.1 Introduction When Mozart performed his opera Die Entfhrung aus dem u Serail, the Emperor Joseph II responded wryly, Too many notes, Mozart! In this chapter we meet a dierent problem: too many numbers! We have met discrete random variables, for which we can list all the values
and their probabilities, even if the list is innite:

e.g. for X Geometric(p), x fX (x) = P(X = x) 0 p 1 pq 2 pq 2 ... ...
But suppose that X takes values in a continuous set, e.g. [0, ) or (0, 1). We cant even begin to list all the values that X can take. For example, how would you list all the numbers in the interval [0, 1]? the smallest number is 0, but what is the next smallest? 0.01? 0.0001? 0.0000000001? We just end up talking nonsense. In fact, there are so many numbers in any continuous set that each of them
must have probability 0.

If there was a probability > 0 for all the numbers in a continuous set, however small, there simply wouldnt be enough probability to go round.
A continuous random variable takes values in a continuous interval (a, b). It describes a continuously varying quantity such as time or height. When X is continuous, P(X = x) = 0 for ALL x. The probability function is meaningless.
Although we cannot assign a probability to any value of X, we are able to assign probabilities to intervals: eg. P(X = 1) = 0, but P(0.999 X 1.001) can be > 0. This means we should use the distribution function, FX (x) = P(X x).
139
The cumulative distribution function, FX (x) Recall that for discrete random variables: FX (x) is a step function: FX (x) = P(X x);
FX (x)
probability accumulates in discrete steps;
P(a < X b) = P(X (a, b]) = F (b) F (a). For a continuous random variable: FX (x) 1 FX (x) = P(X x); FX (x) is a continuous function:
probability accumulates continuously;
As before, P(a < X b) = P(X (a, b]) = F (b) F (a). However, for a continuous random variable, P(X = a) = 0. So it makes no dierence whether we say P(a < X b) or P(a X b).
For a continuous random variable,

P(a < X < b) = P(a X b) = FX (b) FX (a). This is not true for a discrete random variable: in fact,
For a discrete random variable with values 0, 1, 2, . . .,

P(a < X < b) = P(a + 1 X b 1) = FX (b 1) FX (a).
Endpoints are not important for continuous r.v.s. Endpoints are very important for discrete r.v.s.
140
4.2 The probability density function Although the cumulative distribution function gives us an interval-based tool for dealing with continuous random variables, it is not very good at telling us what the distribution looks like. For this we use a dierent tool called the probability density function. The probability density function (p.d.f.) is the best way to describe and recognise a continuous random variable. We use it all the time to calculate probabilities and to gain an intuitive feel for the shape and nature of the distribution. Using the p.d.f. is like recognising your friends by their faces. You can chat on the phone, write emails or send txts to each other all day, but you never really know a person until youve seen their face. Just like a cell-phone for keeping in touch, the cumulative distribution function is a tool for facilitating our interactions with the continuous random variable. However, we never really understand the random variable until weve seen its face the probability density function. Surprisingly, it is quite dicult to describe exactly what the probability density function is. In this section we take some time to motivate and describe this fundamental idea. All-time top-ten 100m sprint times The histogram below shows the best 10 sprint times from the 168 all-time top male 100m sprinters. There are 1680 times in total, representing the top 10 times up to 2002 from each of the 168 sprinters. Out of interest, here are the summary statistics: Min. 1st Qu. Median Mean 3rd Qu. Max. 9.78 10.08 10.15 10.14 10.21 10.41
frequency 100 200 300 0
9.8
10.0 time (s)
10.2
10.4
141
We could plot this histogram using dierent time intervals:

0.1s intervals
200 400 600 0
9.8
10.0 time
10.2
10.4
0.05s intervals
100 200 300 0
9.8
10.0 time
10.2
10.4
0.02s intervals
100 0 50 150
9.8
10.0 time
10.2
10.4
0.01s intervals
20 40 60 80 0
9.8
10.0 time
10.2
10.4
We see that each histogram has broadly the same shape, although the heights of
the bars and the interval widths are dierent.

The histograms tell us the most intuitive thing we wish to know about the distribution: its shape: the most probable times are close to 10.2 seconds; the distribution of times has a long left tail (left skew); times below 10.0s and above 10.3 seconds have low probability.
142
We could t a curve over any of these histograms to show the desired shape, but the problem is that the histograms are not standardized: every time we change the interval width, the heights of the bars change. How can we derive a curve or function that captures the common shape of the histograms, but keeps a constant height? What should that height be? The standardized histogram We now focus on an idealized (smooth) version of the sprint times distribution, rather than using the exact 1680 sprint times observed. We are aiming to derive a curve, or function, that captures the shape of the histograms, but will keep the same height for any choice of histogram bar width. First idea: plot the probabilities instead of the frequencies.
The height of each histogram bar now represents the probability of getting an observation in that bar.
probability 0.4 0.0 0.2
9.8 probability 0.20
10.0 time interval
10.2
10.4
0.0
0.10
9.8 probability 0.08
10.0 time interval
10.2
10.4
0.0
0.04
9.8
10.0 time interval
10.2
10.4
This doesnt work, because the height (probability) still depends upon the bar
width. Wider bars have higher probabilities.
143
Second idea: plot the probabilities divided by bar width.
The height of each histogram bar now represents the probability of getting an observation in that bar, divided by the width of the bar.
probability / interval width 4
0.1s intervals
9.8
10.0 time interval
10.2
10.4
0.05s intervals
9.8
10.0 time interval
10.2
10.4
0.02s intervals
9.8
10.0 time interval
10.2
10.4
0.01s intervals
9.8
10.0 time interval
10.2
10.4
This seems to be exactly what we need! The same curve ts nicely over all the
histograms and keeps the same height regardless of the bar width.
These histograms are called standardized histograms. The nice-tting curve is the probability density function. But. . . what is it?!
144
The probability density function We have seen that there is a single curve that ts nicely over any standardized histogram from a given distribution. This curve is called the probability density function (p.d.f.). We will write the p.d.f. of a continuous random variable X as p.d.f. = fX (x). The p.d.f. fX (x) is NOT the probability of x for example, in the sprint times we can have fX (x) = 4, so it is denitely NOT a probability. However, as the histogram bars of the standardized histogram get narrower, the bars get closer and closer to the p.d.f. curve. The p.d.f. is in fact the limit
of the standardized histogram as the bar width approaches zero.

What is the height of the standardized histogram bar? For an interval from x to x + t, the standardized histogram plots the probability of an observation falling between x and x +t, divided by the width of the interval, t. Thus the height of the standardized histogram bar over the interval from x to x + t is:
probability P(x X x + t) FX (x + t) FX (x) = = , interval width t t where FX (x) is the cumulative distribution function.
Now consider the limit as the histogram bar width (t) goes to 0: this limit is DEFINED TO BE the probability density function at x, fX (x): fX (x) = lim
t0
FX (x + t) FX (x) t
by denition.
This expression should look familiar: it is the derivative of FX (x).
145
The probability density function (p.d.f.) is therefore the function

fX (x) = FX (x).
It is dened to be a single, unchanging curve that describes the SHAPE of any histogram drawn from the distribution of X .
Formal denition of the probability density function Denition: Let X be a continuous random variable with distribution function FX (x). The probability density function (p.d.f.) of X is dened as dFX = FX (x). dx
fX (x) =
It gives: the RATE at which probability is accumulating at any given point, FX (x); the SHAPE of the distribution of X . Using the probability density function to calculate probabilities As well as showing us the shape of the distribution of X, the probability density function has another major use: it calculates probabilities by integration. Suppose we want to calculate P(a X b). We already know that: P(a X b) = FX (b) FX (a). But we also know that: dFX = fX (x), dx
so In fact:
FX (x) =
fX (x) dx
b
(without constants).
FX (b) FX (a) =
fX (x) dx.
a
146
This is a very important result: Let X be a continuous random variable with probability density function fX (x). Then
b
P(a X b) = P(X [ a, b ] ) =
fX (x) dx .
a
This means that we can calculate probabilities by integrating the p.d.f.

fX (x) P(a X b) is the AREA under the curve fX (x) between a and b.
The total area under the p.d.f. curve is:
total area =
fX (x) dx = FX () FX () = 1 0 = 1.
This says that the total area under the p.d.f. curve is equal to the total probability that X takes a value between and +, which is 1.
fX (x)
total area under the curve fX (x) is 1 : fX (x) dx = 1.
147
Using the p.d.f. to calculate the distribution function, FX (x) Suppose we know the probability density function, fX (x), and wish to calculate the distribution function, FX (x). We use the following formula:
x
Distribution function,
FX (x) =
fX (u) du.
Proof:
x
f (u)du = FX (x) FX () = FX (x) 0 = FX (x).
Using the dummy variable, u:

x
Writing FX (x) =
fX (u) du means:
fX (u)
integrate fX (u) as u ranges from to x.
FX (x) = P(X x)
x
x
Writing FX (x) =
fX (x) dx is WRONG and MEANINGLESS: LOSES A
MARK every time.

In words,
nonsense!
x fX (x) dx
means: integrate fX (x) as x ranges from to x. Its
How can x range from to x?!
148
Why do we need fX (x)? Why not stick with FX (x)? These graphs show FX (x) and fX (x) from the mens 100m sprint times (X is a random top ten 100m sprint time).
F(x) 0.8 f(x) 0 1 2 3 4 9.8 10.0 x 10.2 10.4 9.8
0.0
0.4
10.0 x
10.2
10.4
Just using FX (x) gives us very little intuition about the problem. For example, which is the region of highest probability? Using the p.d.f., fX (x), we can see that it is about 10.1 to 10.2 seconds. Using the c.d.f., FX (x), we would have to inspect the part of the curve with the
steepest gradient: very dicult to see.

Example of calculations with the p.d.f. k e2x 0 for 0 < x < , otherwise.
f(x)
Let
fX (x) =
(i) Find the constant k. (iii) Find the cumulative distribution function, FX (x), for all x. (i) We need:
0
(ii) Find P(1 < X 3).
fX (x) dx = 1 k e2x dx = 1
0
0 dx +
0
e2x k 2
= 1
149
k (e e0 ) = 1 2 k (0 1) = 1 2 k = 2.
(ii)
P(1 < X 3) = =
fX (x) dx
1 3 1
2 e2x dx
3 1
2e2x 2
= e23 + e21 = 0.132.
(iii)
x
FX (x) =
0
fX (u) du
x
0 du +
0 x 0
2 e2u du
for x > 0
2e2u = 0+ 2 = e2x + e0 = 1 e2x = 0.
for x > 0.
When x 0, FX (x) = So overall,
x 0 du
FX (x) =
0 for x 0, 2x for x > 0. 1e
150
Total area under the p.d.f. curve is 1:
fX (x) dx = 1.
The p.d.f. is NOT a probability: fX (x) 0 always, Calculating probabilities:
but we do NOT require fX (x) 1.
1. If you only need to calculate one probability P(a X b): integrate the
p.d.f.:
P(a X b) =
fX (x) dx.
a
2. If you will need to calculate several probabilities, it is easiest to nd the distribution function, FX (x):
x
FX (x) =
fX (u) du.
Then use:
P(a X b) = FX (b) FX (a)
for any a, b.
Endpoints: DO NOT MATTER for continuous random variables: P(X a) = P(X < a)
and P(X a) = P(X > a) .
ov 5 N45? 3 2
4.3 The Exponential distribution When will the next volcano erupt in Auckland? We never quite answered this question in Chapter 3. The Poisson distribution was used to count the
2 Oct 2012?
9J 207 un 4?
number of volcanoes that would occur in a xed space of time.

We have not said how long we have to wait for the next volcano: this is a
continuous random variable.

Auckland Volcanoes About 50 volcanic eruptions have occurred in Auckland over the last 100,000 years or so. The rst two eruptions occurred in the Auckland Domain and Albert Park right underneath us! The most recent, and biggest, eruption was Rangitoto, about 600 years ago. There have been about 20 eruptions in the last 20,000 years, which has led the Auckland Regional Council to assess current volcanic risk by assuming that volcanic eruptions in Auckland follow a Poisson 1 process with rate = 1000 volcanoes per year. For background information, see: www.arc.govt.nz/environment/volcanoes-of-auckland/. Distribution of the waiting time in the Poisson process The length of time between events in the Poisson process is called the waiting
time.
To nd the distribution of a continuous random variable, we often work with the cumulative distribution function, FX (x). This is because FX (x) = P(X x) gives us a probability, unlike the p.d.f. fX (x). We are comfortable with handling and manipulating probabilities. Suppose that {Nt : t > 0} forms a Poisson process with rate = We know that Nt Poisson(t) ;
1 . 1000
Nt is the number of volcanoes to have occurred by time t, starting from now. (t)n t e . n!
so P(Nt = n) =
152
Let X be a continuous random variable giving the number of years waited before the next volcano, starting now. We will derive an expression for FX (x). (i) When x < 0: FX (x) = P(X x) = P( less than 0 time before next volcano) = 0. (ii) When x 0: FX (x) = P(X x) = P(amount of time waited for next volcano is x) = P(there is at least one volcano between now and time x) = P(# volcanoes between now and time x is 1) = P(Nx 1) = 1 P(Nx = 0) (x)0 x e = 1 0! = 1 ex . Overall: FX (x) = P(X x) = 1 ex 0
for x 0, for x < 0.
The distribution of the waiting time X is called the Exponential distribution because of the exponential formula for FX (x). Example: What is the probability that there will be a volcanic eruption in Auckland within the next 50 years?
Put =
1 1000 .
We need P(X 50).

P(X 50) = FX (50) = 1 e50/1000 = 0.049.
There is about a 5% chance that there will be a volcanic eruption in Auckland over the next 50 years. This is the gure given by the Auckland Regional Council at the above web link (under Future Hazards).
153
The Exponential Distribution We have dened the Exponential() distribution to be the distribution of the waiting time (time between events) in a Poisson process with rate . We write X Exponential(), or X Exp(). However, just like the Poisson distribution, the Exponential distribution has many other applications: it does not always have to arise from a Poisson process. Let X Exponential(). Note: > 0 always. Distribution function: FX (x) = P(X x) = 1 ex for x 0, 0 for x < 0.
Probability density function:
fX (x) =
FX (x)
ex 0
for x 0, for x < 0.
P.d.f., fX (x)
Link with the Poisson process
C.d.f., FX (x) = P(X x).
Let {Nt : t > 0} be a Poisson process with rate . Then: Nt is the number of events to occur by time t; Nt Poisson(t) ; so P(Nt = n) =
(t)n t ; n! e
Dene X to be either the time till the rst event, or the time from now until the next event, or the time between any two events. Then X Exponential(). X is called the waiting time of the process.
Me
Memorylessness We have said that the waiting time of the Poisson process can be dened either as the time from the start to the rst event, or the time from now until the next event, or the time between any two events.
mo a si ry like eve !
zzz
All of these quantities have the same distribution: X Exponential(). The derivation of the Exponential distribution was valid for all of them, because events occur at a constant average rate in the Poisson process. This property of the Exponential distribution is called memorylessness: the distribution of the time from now until the rst event is the same as the distribution of the time from the start until the rst event: the time
from the start till now has been forgotten!

time from start to first event
this time forgotten
time from now to first event
START
NOW
FIRST EVENT
The Exponential distribution is famous for this memoryless property: it is the only memoryless distribution. For volcanoes, memorylessness means that the 600 years we have waited since
Rangitoto erupted have counted for nothing.

The chance that we still have 1000 years to wait for the next eruption is the same today as it was 600 years ago when Rangitoto erupted. Memorylessness applies to any Poisson process. It is not always a desirable property: you dont want a memoryless waiting time for your bus! The Exponential distribution is often used to model failure times of components: for example X Exponential() is the amount of time before a light bulb fails. In this case, memorylessness means that old is as good as new or, put another way, new is as bad as old ! A memoryless light bulb is quite likely to fail almost immediately.
155
For private reading: proof of memorylessness Let X Exponential() be the total time waited for an event. Let Y be the amount of extra time waited for the event, given that we have already waited time t (say). We wish to prove that Y has the same distribution as X, i.e. that the time t already waited has been forgotten. This means we need to prove that Y Exponential(). Proof: We will work with FY (y) and prove that it is equal to 1 ey . This proves that Y is Exponential() like X. First note that X = t+Y , because X is the total time waited, and Y is the time waited after time t. Also, we must condition on the event {X > t}, because we know that we have already waited time t. So P(Y y) = P(X t + y | X > t). FY (y) = P(Y y) = P(X t + y | X > t) = P(X t + y AND X > t) P(X > t) (denition of conditional probability) P(t < X t + y) 1 P(X t) FX (t + y) FX (t) 1 FX (t)
= =
(1 e(t+y) ) (1 et ) = 1 (1 et ) = et e(t+y) et
et (1 ey ) = et Thus the conditional probability of waiting time y extra, given that we have already waited time t, is the same as the probability of waiting time y in total. The time t already waited is forgotten. = 1 ey . So Y Exponential() as required.
156
4.4 Likelihood and estimation for continuous random variables For discrete random variables, we found the likelihood using the probability function, fX (x) = P(X = x). For continuous random variables, we nd the likelihood using the probability density function, fX (x) = dFX . dx Although the notation fX (x) means something dierent for continuous and
discrete random variables, it is used in exactly the same way for likelihood and estimation.
Note: Both discrete and continuous r.v.s have the same denition for the cumulative distribution function: FX (x) = P(X x). Example: Exponential likelihood Suppose that: X Exponential(); is unknown; the observed value of X is x. Then the likelihood function is: L( ; x) = fX (x) = ex
for 0 < < .
We estimate by setting
dL = 0 to nd the MLE, . d
Two or more independent observations Suppose that X1 , . . . , Xn are continuous random variables such that: X1 , . . . , Xn are INDEPENDENT;
all the Xi s have the same p.d.f., fX (x); then the likelihood is fX (x1)fX (x2) . . . fX (xn).
157
Example: Suppose that X1, X2 , . . . , Xn are independent, and Xi Exponential() for all i. Find the maximum likelihood estimate of .
0.004 likelihood 0.0 0 0.002
Likelihood graph shown for = 2 and n = 10. x1, . . . , x10 generated by R command rexp(10, 2).
n lambda
Solution:
L( ; x1, . . . , xn) =
i=1 n
fX (xi) exi
i=1 n
n i=1
= e
xi
Dene x =
1 n
n i=1 xi
for 0 < < .
to be the sample mean of x1, . . . , xn. So

n
xi = nx.
i=1
Thus
L( ; x1, . . . , xn) = n enx
for 0 < < .
Solve
dL = 0 to nd the MLE of : d dL = nn1 enx n nx enx = 0 d nn1enx (1 x) = 0 = 0, = , = 1 . x
The MLE of is
1 = . x
158
4.5 Hypothesis tests Hypothesis tests for continuous random variables are just like hypothesis tests for discrete random variables. The only dierence is: endpoints matter for discrete random variables, but not for continuous ran-
dom variables.
Example: discrete. Suppose H0 : X Binomial(n = 10, p = 0.5), and we have observed the value x = 7. Then the upper-tail p-value is P(X 7) = 1 P(X 6) = 1 FX (6). Example: continuous. Suppose H0 : X Exponential(2), and we have observed the value x = 7. Then the upper-tail p-value is P(X 7) = 1 P(X 7) = 1 FX (7). Other than this trap, the procedure for hypothesis testing is the same: Use H0 to specify the distribution of X completely, and oer a one-tailed or two-tailed alternative hypothesis H1 . Find the one-tailed or two-tailed p-value as the probability of seeing an observation at least as weird as what we have seen, if H0 is true. That is, nd the probability under the distribution specied by H0 of seeing an observation further out in the tails than the value x that we have seen. Example with the Exponential distribution A very very old person observes that the waiting time from Rangitoto to the next volcanic eruption in Auckland is 1500 years. Test the hypothesis that 1 1 = 1000 against the one-sided alternative that < 1000 .
1 Note: If < 1000 , we would expect to see BIGGER values of X , NOT smaller. This is because X is the time between volcanoes, and is the rate at which volcanoes occur. A smaller value of means volcanoes occur less often, so the time X between them is BIGGER.
Make observation x.
159
Hypotheses: Let X Exponential(). H0 : = H1 1 1000 1 : < 1000
one-tailed test
Observation: x = 1500 years. Values weirder than x = 1500 years: all values BIGGER than x = 1500. p-value: P(X 1500) when X Exponential( =
1 1000 ).
So
p value = P(X 1500) = 1 P(X 1500) = 1 FX (1500) = 0.223.
when X Exponential( =
1 1000 )
= 1 (1 e1500/1000)
R command: 1-pexp(1500, 1/1000)
There is no evidence against H0. The observation x = 1500 years is consistent with the hypothesis that = 1/1000, i.e. that volcanoes erupt once every 1000 years on average.
0.0010
Interpretation:
f(x)
0 0
0.0002
0.0006
1000
2000
3000
4000
5000
160
4.6 Expectation and variance Remember the expectation of a discrete random variable is the long-term av-
erage:
X = E(X) =
x
xP(X = x) =
x
xfX (x).
(For each value x, we add in the value and multiply by the proportion of times we would expect to see that value: P(X = x).) For a continuous random variable, replace the probability function with the
probability density function, and replace

X = E(X) =
by
xfX (x) dx,
where fX (x) = FX (x) is the probability density function.
Note: There exists no concept of a probability function fX (x) = P(X = x) for continuous random variables. In fact, if X is continuous, then P(X = x) = 0 for all x. The idea behind expectation is the same for both discrete and continuous random variables. E(X) is: the long-term average of X;
a sum of values multiplied by how common they are: xf (x) or xf (x) dx.
Expectation is also the balance point of fX (x) for both continuous and discrete X. Imagine fX (x) cut out of cardboard and balanced on a pencil.
161
Discrete: E(X) =
x
Continuous: xfX (x) E(X) =

xfX (x) dx
E(g(X)) =
x
g(x)fX (x)
E(g(X)) =
g(x)fX (x) dx
Transform the values, leave the probabilities alone; fX (x) =P(X = x)
Transform the values, leave the probability density alone.

fX (x) =FX (x) (p.d.f.)
Variance If X is continuous, its variance is dened in exactly the same way as a discrete random variable:
2 Var(X) = X = E (X X )2 = E(X 2) 2 = E(X 2) (EX)2. X
For a continuous random variable, we can either compute the variance using
Var(X) = E (X X )
or
(x X )2fX (x)dx,
Var(X) = E(X ) (EX) =
x2 fX (x)dx (EX)2.
The second expression is usually easier (although not always).
162
Properties of expectation and variance All properties of expectation and variance are exactly the same for continuous
and discrete random variables.

For any random variables, X, Y , and X1 , . . . , Xn , continuous or discrete, and for constants a and b: E(aX + b) =aE(X) + b. E(ag(X) + b) =aE(g(X)) + b. E(X + Y ) =E(X) + E(Y ). E(X1 + . . . + Xn ) =E(X1) + . . . + E(Xn). Var(aX + b) =a2 Var(X). Var(ag(X) + b) =a2 Var(g(X)). The following statements are generally true only when X and Y are
INDEPENDENT:
E(XY ) =E(X)E(Y ) when X , Y independent. Var(X + Y ) =Var(X) + Var(Y ) when X , Y independent.
4.7 Exponential distribution mean and variance When X Exponential(), then: E(X) =
1
Var(X) =
1 2
Note: If X is the waiting time for a Poisson process with rate events per year 1 (say), it makes sense that E(X) = . For example, if = 4 events per hour, the average time waited between events is 1 hour. 4
Proof : E(X) =
xfX (x) dx
x dx. 0 xe dv u dx dx = uv dv dx du v dx dx.
163
Integration by parts: recall that Let u = x, so Then

du dx
= 1, and let
0
= ex , so v = ex .
E(X) =
xe
dx =
0
dv dx dx
0 0
= =
uv
du dx dx
0
xe
(ex) dx
= 0+ = E(X) =
1 1
1 x e 0 1
e0
Variance: Var(X) = E(X 2 ) (EX)2 = E(X 2) Now Let E(X ) = u = x2 , so

2 du dx
1 2 .
x fX (x) dx =
0
x2ex dx. = ex , so v = ex . x2ex

0 0
= 2x, and let

0
dv dx
Then
E(X 2) = uv
du dx = dx
+
0
2xex dx
2 = 0+ = So Var(X) = E(X ) (EX)

2 2
xex dx
2 2 E(X) = 2 . 1
2
2 = 2 1 . 2
Var(X) =
164
Interlude: Guess the Mean, Median, and Variance For any distribution: the mean is the average that would be obtained if a large number of observations were drawn from the distribution; the median is the half-way point of the distribution: every observation has a 50-50 chance of being above the median or below the median; the variance is the average squared distance of an observation from the mean. Given the probability density function of a distribution, we should be able to guess roughly the distribution mean, median, and variance . . . but it isnt easy! Have a go at the examples below. As a hint: the mean is the balance-point of the distribution. Imagine that the p.d.f. is made of cardboard and balanced on a rod. The mean is the point where the rod would have to be placed for the cardboard to balance. the median is the half-way point, so it divides the p.d.f. into two equal areas of 0.5 each. the variance is the average squared distance of observations from the mean; so to get a rough guess (not exact), it is easiest to guess an average distance from the mean and square it.
f(x) 0.012 0.008
Guess the mean, median, and variance. (answers overleaf)
0.0 0
0.004
50
100
150 x
200
250
300
165
Answers:
f(x) 0.012
median (54.6) 0.008 mean (90.0) variance = (118) = 13924

2
0.0 0
0.004
50
100
150 x
200
250
300
Notes: The mean is larger than the median. This always happens when the distribution has a long right tail (positive skew) like this one. The variance is huge . . . but when you look at the numbers along the horizontal axis, it is quite believable that the average squared distance of an observation from the mean is 1182. Out of interest, the distribution shown is a Lognormal distribution. Example 2: Try the same again with the example below. Answers are written below the graph.
f(x) 0.0 0.2 0.4 0.6 0.8 1.0 0
2 x
Answers: Median = 0.693; Mean=1.0; Variance=1.0.
166
4.8 The Uniform distribution X has a Uniform distribution on the interval [a, b] if X is equally likely to fall anywhere in the interval [a, b]. We write X Uniform[a, b], or X U[a, b]. Equivalently, X Uniform(a, b), or X U(a, b). Probability density function, fX (x)
If X U [a, b], then
1 ba fX (x) = 0
if a x b, otherwise.
fX (x)
1 ba
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
Distribution function, FX (x)

x x
FX (x) =
fY (y) dy =
a
1 dy ba
x a
if
axb
= =
xa ba
y ba
if
a x b.
FX (x) 1 0 1 0 1 0 1 0 111 000 11111 00000 1 10 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 111 000 1 0 0 000000 1111111111 0000000000x 111111 a b
Thus
0 1
FX (x) =
xa ba
if x < a, if a x b, if x > b.
167
Mean and variance:
If X Uniform[a, b],
Proof : E(X) =

a+b , E(X) = 2
b
(b a)2 . Var(X) = 12 1 x2 dx = ba 2 = = = 1 ba 1 ba a+b . 2

b a b a
xf (x) dx =
a
1 x ba
1 (b2 a2 ) 2 1 (b a)(b + a) 2
Var(X) = E[(X X ) ] =
b a
(x X )2 1 (x X )3 dx = ba ba 3 = 1 ba
(b X )3 (a X )3 3
ab 2 .
But X = EX = So, Var(X) =
a+b 2 ,
so
b X =
ba 2
and
a X =
1 ba
(b a)3 (a b)3 23 3
(b a)3 + (b a)3 = (b a) 24 = (b a)2 . 12
Example: let X Uniform[0, 1]. Then fX (x) = 1 if 0 x 1 0 otherwise.

0+1 2
X = E(X) =
1 2
(half-way through interval [0, 1]).

1 . 12
2 X = Var(X) =
1 (1 12
0)2 =
168
4.9 The Change of Variable Technique: nding the distribution of g(X) Let X be a continuous random variable. Suppose the p.d.f. of X , fX (x), is known; we wish to nd the p.d.f. of Y .
the r.v. Y is dened as Y = g(X) for some function g ;
We use the Change of Variable technique. Example: Let X Uniform(0, 1), and let Y = log(X). The p.d.f. of X is fX (x) = 1 for 0 < x < 1. What is the p.d.f. of Y , fY (y)? Change of variable technique for monotone functions Suppose that g(X) is a monotone function R R. This means that g is an increasing function, or g is a decreasing f n . When g is monotone, it is invertible, or (11) (one-to-one). That is, for every y there is a unique x such that g(x) = y . This means that the inverse function, g 1 (y), is well-dened as a function for a certain range of y. When g : R R, as it is here, then g can only be (11) if it is monotone.
y = g(x) = x2
1.0 0.8 0.8 1.0
x = g 1 (y) =
0.6
0.4
x 0.2 0.0 0.0 0.2 0.4 x 0.6 0.8 1.0 0.0 0.0 0.2 0.4
0.6
0.2
0.4 y
0.6
0.8
1.0
169
Change of Variable formula Let g : R R be a monotone function and let Y = g(X). Then the p.d.f. of Y = g(X) is fY (y) = fX (g 1(y))
d 1 dy g (y)
Easy way to remember
Write
y = y(x)(= g(x)) x = x(y)(= g 1(y)) fY (y) = fX x(y)

dx dy
Then
Working for change of variable questions 1) Show you have checked g(x) is monotone over the required range. 2) Write y = y(x) for x in <range of x>, e.g. for a < x < b. 3) So x = x(y) for y in <range of y >: for y(a) < y(x) < y(b) if y is increasing; for y(a) > y(x) > y(b) if y is decreasing. dx = <expression involving y >. dy dx 5) So fY (y) = fX (x(y)) by Change of Variable formula, dy = ... . Quote range of values of y as part of the FINAL answer. 4) Then
Refer back to the question to nd fX (x): you often have to deduce this from information like X Uniform(0, 1) or X Exponential(). Or it may be given explicitly.
170
Note: There should be no xs left in the answer! x(y) and dx are expressions involving y only. dy
4
y = log(x)
Example 1: Let X Uniform(0, 1), and let Y = log(X). Find the p.d.f. of Y . 1) y(x) = log(x) is monotone decreasing,
y 0 0.0 1 2
so we can apply the Change of Variable formula.
0.2
0.4 x
0.6
0.8
1.0
2) Let y = y(x) = log x for 3) Then x = x(y) = ey 4)
<
x y
<
1.
for log(0) >
> log(1),
ie. 0 < y < .
d y dx = (e ) = ey = ey for 0 < y < . dy dy fY (y) = fX (x(y)) dx dy
5) So
for 0 < y < for 0 < y < .
= fX (ey )ey
But X Uniform(0, 1), so fX (x) = 1 for 0 < x < 1, fX (ey ) = 1 for 0 < y < . Thus fY (y) = fX (ey )ey = ey for 0 < y < . So Y Exponential(1).
Note: In change of variable questions, you lose a mark for: 1. not stating g(x) is monotone over the required range of x;
2. not giving the range of y for which the result holds, as part of the FINAL answer. (eg. fY (y) = . . . for 0 < y < ).
171
Example 2: Let X be a continuous random variable with p.d.f. fX (x) =

1 3 4x
for 0 < x < 2, otherwise.
Let Y = 1/X. Find the probability density function of Y , fY (y).
Let Y = 1/X . The function y(x) = 1/x is monotone decreasing for 0 < x < 2, so we can apply the Change of Variable formula. Let Then
y = y(x) = 1/x x = x(y) = 1/y dx = | y 2 | = 1/y 2 dy
for 0 < x < 2. for for

1 0 1 2
> y > 1, 2 < y < .
i.e.
1 2
< y < .
Change of variable formula:
fY (y) = fX (x(y)) = = =
dx dy
1 3 dx (x(y)) 4 dy 1 1 1 3 2 4 y y 1 4y 5
for
1 < y < . 2
Thus
1 for 1 < y < , 2 4y 5 fY (y) = 0 otherwise.
172
For mathematicians: proof of the change of variable formula Separate into cases where g is increasing and where g is decreasing. i) g increasing g is increasing if u < w g(u) < g(w). 1 1 Note that putting u = g (x), and w = g (y), we obtain g 1 (x) < g 1(y) g(g 1 (x)) < g(g 1 (y)) x < y,
so g 1 is also an increasing function. Now
FY (y) = P(Y y) = P(g(X) y) = P(X g 1 (y)) = FX (g 1(y)). So the p.d.f. of Y is fY (y) =
put
u = X, in to see this. w = g1 (y)
d FY (y) dy d = FX (g 1(y)) dy d = FX (g 1(y)) (g 1(y)) (Chain Rule) dy d = fX (g 1(y)) (g 1(y)) dy

d 1 dy (g (y))
Now g is increasing, so g 1 is also increasing (by overleaf), so d and thus fY (y) = fX (g 1(y))| dy (g 1(y))| as required. ii) g decreasing, i.e. u > w g(u) < g(w). ()
> 0,
(Putting u = g 1 (x) and w = g 1(y) gives g 1(x) > g 1 (y) x < y, so g 1 is also decreasing.)
FY (y) = P(Y y) = P(g(X) y)
Thus the p.d.f. of Y is d d fY (y) = 1 FX (g 1(y)) = fX g 1(y) g 1 (y) . dy dy
= 1 FX (g 1(y)).
= P(X g 1 (y))
(put u = X, w = g 1(y) in ())
173
This time, g is decreasing, so g So once again,
is also decreasing, and thus .
d d g 1 (y) = g 1 (y) dy dy d g 1 (y) dy
fY (y) = fX g 1(y)
4.10 Change of variable for non-monotone functions: non-examinable Suppose that Y = g(X) and g is not monotone. We wish to nd the p.d.f. of Y . We can sometimes do this by using the distribution function directly. Example: Let X have any distribution, with distribution function FX (x). Let Y = X 2 . Find the p.d.f. of Y .
Clearly, Y 0, so FY (y) = 0 if y < 0. For y 0:

FY (y) = P(Y y) = P(X 2 y) = P( y X y) = FX ( y) FX ( y) .
Y y
So
FY (y) = 0 FX ( y) FX ( y)
if y < 0, if y 0.
174
So the p.d.f. of Y is
fY (y) = d d d FY = (FX ( y)) (FX ( y)) dy dy dy 1 1 = 1 y 2 FX ( y) + 1 y 2 FX ( y) 2 2 = 1 fX ( y) + fX ( y) 2 y
for y 0.
1 fY (y) = fX ( y) + fX ( y) for y 0, whenever Y = X 2 . 2 y
Example: Let X Normal(0, 1). This is the familiar bell-shaped distribution (see later). The p.d.f. of X is: 1 2 fX (x) = ex /2. 2 Find the p.d.f. of Y = X 2 .
By the result above, Y = X 2 has p.d.f.

fY (y) = 1 1 (ey/2 + ey/2) 2 y 2 1 = y 1/2ey/2 for y 0. 2
This is in fact the Chi-squared distribution with = 1 degrees of freedom. The Chi-squared distribution is a special case of the Gamma distribution (see next section). This example has shown that if X Normal(0, 1), then Y = X 2 Chi-squared(df=1).
175
4.11 The Gamma distribution The Gamma(k, ) distribution is a very exible family of distributions. It is dened as the sum of k independent Exponential r.v.s:
if X1 , . . . , Xk Exponential()and X1 , . . . , Xk are independent, then X1 + X2 + . . . + Xk Gamma(k, ).

Special Case: When k = 1, Gamma(1, ) = Exponential()
(the sum of a single Exponential r.v.)

Probability density function, fX (x)
For X Gamma(k, ),
fX (x) =
k k1 x e (k) x
if x 0, otherwise.
Here, (k), called the Gamma function of k, is a constant that ensures fX (x) integrates to 1, i.e.
0 fX (x)dx
= 1. It is dened as (k) =
0
y k1 ey dy .
When k is an integer, (k) = (k 1)! Mean and variance of the Gamma distribution:
For X Gamma(k, ),
E(X) =
and Var(X) =
k 2
Relationship with the Chi-squared distribution The Chi-squared distribution with degrees of freedom, 2 , is a special case of the Gamma distribution. 2 = Gamma(k = , = 1 ). 2 2 So if Y 2 , then E(Y ) =
k
= , and Var(Y ) =
k 2
= 2.
176
Gamma p.d.f.s
k=1
k=2
Notice: right skew (long right tail); exibility in shape controlled by the 2 parameters
k=5
Distribution function, FX (x) There is no closed form for the distribution function of the Gamma distribution. If X Gamma(k, ), then FX (x) can can only be calculated by computer.
k=5
177
Proof that E(X) =

0
and Var(X) =
0
k 2
(non-examinable)
EX =
xfX (x) dx = = = = = =
k xk1 x x e dx (k) (k)
k x dx 0 (x) e k y 1 0 y e ( ) dy
(k) 1 (k + 1) (k) 1 k (k) (k) k .

0
(letting y = x,
dx dy
1 = )
(property of the Gamma function),
Var(X) = E(X ) (EX)
= =
0
k2 x fX (x) dx 2
2
x2k xk1ex k2 dx 2 (k) (k)

k+1 y e 0 y
1 k+1 x e dx 0 ( )(x)
k2 2 where y = x, 1 dx = dy
1 = 2
dy
(k)
k2 2
1 (k + 2) k 2 2 = 2 (k) 1 (k + 1)k (k) k 2 = 2 2 (k) = k . 2
178
Gamma distribution arising from the Poisson process Recall that the waiting time between events in a Poisson process with rate has the Exponential() distribution. That is, if Xi =time waited between event i 1 and event i, then Xi Exp(). The time waited from time 0 to the time of the kth event is X1 + X2 + . . . + Xk , the sum of k independent Exponential() r.v.s. Thus the time waited until the kth event in a Poisson process with rate has the Gamma(k, ) distribution. Note: There are some similarities between the Exponential() distribution and the (discrete) Geometric(p) distribution. Both distributions describe the waiting time before an event. In the same way, the Gamma(k, ) distribution is similar to the (discrete) Negative Binomial(k, p) distribution, as they both describe the waiting time before the kth event.
4.12 The Beta Distribution: non-examinable The Beta distribution has two parameters, and . We write X Beta(, ). P.d.f. f (x) = 1 x1(1 x)1 for 0 < x < 1, B(, ) 0 otherwise.
The function B(, ) is the Beta function and is dened by the integral
1
B(, ) =
0
x1(1 x)1 dx, B(, ) =
for > 0, > 0.
It can be shown that
()() . ( + )
Chapter 5: The Normal Distribution and the Central Limit Theorem
179
The Normal distribution is the familiar bell-shaped distribution. It is probably the most important distribution in statistics, mainly because of its link with the Central Limit Theorem, which states that any large sum of independent,
identically distributed random variables is approximately Normal:

X1 + X2 + . . . + Xn approx Normal if X1 , . . . , Xn are i.i.d. and n is large. Before studying the Central Limit Theorem, we look at the Normal distribution and some of its general properties.
5.1 The Normal Distribution The Normal distribution has two parameters, the mean, , and the variance, 2 . and 2 satisfy < < , We write X Normal(, 2), 2 > 0.
or
X N(, 2).
Probability density function, fX (x) 1 2 2
fX (x) =
{(x)2 /22 }
for < x < .
Distribution function, FX (x) There is no closed form for the distribution function of the Normal distribution. If X Normal(, 2), then FX (x) can can only be calculated by computer. R command: FX (x) = pnorm(x, mean=, sd=sqrt( 2)).
180
Probability density function, fX (x)
Distribution function, FX (x)
Mean and Variance For X Normal(, 2), Linear transformations If X Normal(, 2), then for any constants a and b, aX + b Normal a + b, a2 2 . In particular, X E(X) = ,
Var(X) = 2 .
X Normal( 2 )
Normal(0, 1).
181
Proof:
Let a =
1 and b = .
Let Z = aX + b =
X . Then 2 , 2 Normal(0, 1).
Z Normal a + b, a2 2 Normal
Z Normal(0, 1) is called the standard Normal random variable. General proof that aX + b Normal a + b, a2 2 : Let X Normal(, 2), and let Y = aX + b. We wish to nd the distribution of Y . Use the change of variable technique. 1) y(x) = ax +b is monotone, so we can apply the Change of Variable technique. 2) Let y = y(x) = ax + b for < x < . 3) Then x = x(y) = dx 1 1 . = = dy a |a| fY (y) = fX (x(y)) dx = fX dy yb a 1 2 2 1 . |a|
2
yb a
for < y < .
4)
5) So
()
/2 2
But X Normal(, 2), so fX (x) = Thus fX

yb a
e(x)
yb 1 2 2 = e( a ) /2 2 2
1 2 2 2 e(y(a+b)) /2a . = 2 2
182
Returning to (),
fY (y) = fX yb a 1 1 2 2 2 e(y(a+b)) /2a for < y < . = |a| 2a2 2
But this is the p.d.f. of a Normal(a + b, a2 2 ) random variable.
So, if X Normal(, 2),
then
aX + b Normal a + b, a2 2 .
Sums of Normal random variables

2 2 If X and Y are independent, and X Normal(1 , 1 ), Y Normal(2 , 2 ), then 2 2 X + Y Normal 1 + 2 , 1 + 2 .
2 More generally, if X1, X2 , . . . , Xn are independent, and Xi Normal(i , i ) for i = 1, . . . , n, then
a1 X1 +a2 X2 +. . .+an Xn Normal (a1 1 +. . .+an n ),
2 2 (a2 1 +. . .+a2 n) . 1 n
For mathematicians: properties of the Normal distribution 1. Proof that

fX (x) dx = 1.

The full proof that
fX (x) dx =
1 2 2
e{(x)
/(2 2 )}
dx = 1
relies on the following result: FACT: ey dy =

2
This result is non-trivial to prove. See Calculus courses for details. Using this result, the proof that fX (x) dx = 1 follows by using the change (x ) in the integral. of variable y = 2
183
2. Proof that E(X) = . E(X) =

xfX (x) dx =
1 2 2 x e(x) /2 dx 2 2
x
Change variable of integration: let z =
: then x = z + and
dx dz
= .
Thus E(X) = =
1 2 (z + ) ez /2 dz 2 2 z 2 ez /2 dz 2
this is an odd function of z (i.e. g(z) = g(z)), so it integrates to 0 over range to .
1 2 ez /2 dz 2
p.d.f. of N (0, 1) integrates to 1.
Thus E(X) = 0 + 1 = . 3. Proof thatVar(X) = 2 . Var(X) = E (X )2 = 1 2 2 (x )2 e(x) /(2 ) dx 2 2

2
= =
1 2 z 2 ez /2 dz 2 1 2 zez /2 2

putting z = + 1 2 ez /2 dz 2
(integration by parts)
= 2 {0 + 1} = 2.
184
5.2 The Central Limit Theorem (CLT) also known as. . . the Piece of Cake Theorem
The Central Limit Theorem (CLT) is one of the most fundamental results in statistics. In its simplest form, it states that if a large number of independent random variables are drawn from any distribution, then the distribution of their sum (or alternatively their sample average) always converges to the Normal distribution. Theorem (The Central Limit Theorem):
Let X1 , . . . , Xn be independent r.v.s with mean and variance 2, from ANY distribution. For example, Xi Binomial(n, p) for each i, so = np and 2 = np(1 p). Then the sum Sn = X1 + . . . + Xn = that tends to Normal as n .
n i=1 Xi
has a distribution
The mean of the Normal distribution is E(Sn) = The variance of the Normal distribution is
n
n i=1 E(Xi )
= n.
Var(Sn) = Var
i=1 n
Xi
=
i=1
Var(Xi ) because X1 , . . . , Xn are independent
= n 2.
So
Sn = X1 + X2 + . . . + Xn Normal(n, n 2) as n .
185
Notes: 1. This is a remarkable theorem, because the limit holds for any distribution of X1 , . . . , Xn. 2. A sucient condition on X for the Central Limit Theorem to apply is that Var(X) is nite. Other versions of the Central Limit Theorem relax the conditions that X1, . . . , Xn are independent and have the same distribution. 3. The speed of convergence of Sn to the Normal distribution depends upon the distribution of X. Skewed distributions converge more slowly than symmetric Normal-like distributions. It is usually safe to assume that the Central Limit Theorem applies whenever n 30. It might apply for as little as n = 4. The Central Limit Theorem in action : simulation studies The following simulation study illustrates the Central Limit Theorem, making use of several of the techniques learnt in STATS 210. We will look particularly at how fast the distribution of Sn converges to the Normal distribution. Example 1: Triangular distribution: fX (x) = 2x for 0 < x < 1. Find E(X) and Var(X):
1
f (x)
= E(X) =
0 1
xfX (x) dx
0 1 x
=
0
2x2 dx
1 0
2x3 3
2 = . 3
2 = Var(X) = E(X 2) {E(X)}2

1
=
0 1
x fX (x) dx 2x3 dx
1 0
2 3
=
0
4 9
2x4 = 4 1 = . 18
4 9
186
Let Sn = X1 + . . . + Xn where X1 , . . . , Xn are independent. Then E(Sn) = E(X1 + . . . + Xn ) = n = 2n 3
Var(Sn) = Var(X1 + . . . + Xn ) = n 2
So Sn approx Normal
2n n , 18 3
n . Var(Sn) = 18
by independence
for large n, by the Central Limit Theorem.
The graph shows histograms of 10 000 values of Sn = X1 +. . .+Xn for n = 1, 2, 3, n and 10. The Normal p.d.f. Normal(n, n 2) = Normal( 2n , 18 ) is superimposed 3 across the top. Even for n as low as 10, the Normal curve is a very good approximation.
n=1
2.0 1.2 1.0
n=2
1.0 0.8
n=3
0.5
n = 10
0.4
1.5
0.8
0.6
1.0
0.6
0.4
0.5
0.4
0.2
0.0
0.0
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.0
0.5
1.0
1.5
2.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0
0.1
0.2
0.3
Sn
Sn
Sn
Sn f (x)
3 Example 2: U-shaped distribution: fX (x) = 2 x2 for 1 < x < 1. 3 We nd that E(X) = = 0, Var(X) = 2 = 5 . (Exercise)
-1
Let Sn = X1 + . . . + Xn where X1 , . . . , Xn are independent. Then E(Sn) = E(X1 + . . . + Xn ) = n = 0
Var(Sn) = Var(X1 + . . . + Xn ) = n 2
So Sn approx Normal 0,
3n 5
by independence
Var(Sn) =
3n . 5
for large n, by the Central Limit Theorem.
Even with this highly non-Normal distribution for X, the Normal curve provides a good approximation to Sn = X1 + . . . + Xn for n as small as 10.
n=1
1.4 1.2
0.6
n=2
0.4
n=3
0.15
n = 10
187
1.0
0.3
0.8
0.6
0.4
0.2
0.4
0.2
0.0
0.2
0.0
0.0
0.1
-1.0
-0.5
0.0
0.5
1.0
-2
-1
-3
-2
-1
0.0
0.05
0.10
-5
Sn
Sn
Sn
Sn
Normal approximation to the Binomial distribution, using the CLT Let Y Binomial(n, p). We can think of Y as the sum of n Bernoulli random variables: Y = X1 + X2 + . . . + Xn , where Xi = 1 if trial i is a success (prob = p), 0 otherwise (prob = 1 p)
So Y = X1 + . . . + Xn and each Xi has = E(Xi) = p, 2 = Var(Xi ) = p(1 p).

Thus by the CLT, Y = X1 + X2 + . . . + Xn Normal(n, n 2)
= Normal np, np(1 p) .
Thus,
Bin(n, p) Normal
np
mean of Bin(n,p)
var of Bin(n,p)
np(1 p)
as n with p xed.
The Binomial distribution is therefore well approximated by the Normal distribution when n is large, for any xed value of p. The Normal distribution is also a good approximation to the Poisson() distribution when is large:
Poisson() Normal(, )when is large.
188 Binomial(n = 100, p = 0.5)

0.08 0.04
Poisson( = 100)
0.06
0.04
0.0
0.02
30
40
50
60
70
0.0 60
0.01
0.02
0.03
80
100
120
140
Why the Piece of Cake Theorem? The Central Limit Theorem makes whole realms of statistics into a piece
of cake.
After seeing a theorem this good, you deserve a piece of cake! Example: Remember the margin of error for an opinion poll? An opinion pollster wishes to estimate the level of support for Labour in an upcoming election. She interviews n people about their voting preferences. Let p be the true, unknown level of support for the Labour party in New Zealand. Let X be the number of of the n people interviewed by the opinion pollster who plan to vote Labour. Then X Binomial(n, p). At the end of Chapter 2, we said that the maximum likelihood estimator for p is X p= . n In a large sample (large n), we now know that X approx Normal(np, npq) So p= X pq approx Normal p, n n
where q = 1 p.
(linear transformation of Normal r.v.)
189
So
pp
pq n
approx Normal(0, 1).
Now if Z Normal(0, 1), we nd (using a computer) that the 95% central probability region of Z is from 1.96 to +1.96: P(1.96 < Z < 1.96) = 0.95. Check in R: pnorm(1.96, mean=0, sd=1) - pnorm(-1.96, mean=0, sd=1) Putting Z = pp
pq n
, we obtain pp
pq n
P 1.96 < Rearranging: P p 1.96
< 1.96
0.95.
pq < p < p + 1.96 n
pq n
0.95.
This enables us to form an estimated 95% condence interval for the unknown parameter p: estimated 95% condence interval is p 1.96 p(1 p) n
to
p + 1.96
p(1 p) . n
The 95% condence interval has RANDOM end-points, which depend on p.
About 95% of the time, these random end-points will enclose the true unknown value, p.
Condence intervals are extremely important for helping us to assess how useful
our estimate is.

A narrow condence interval suggests a useful estimate (low variance); a wide condence interval suggests a poor estimate (high variance). Next time you see the newspapers quoting the margin of error on an opinion poll: Remember: margin of error = 1.96 p(1p) ; n Think: Central Limit Theorem! Have: a piece of cake.
190
Using the Central Limit Theorem to nd the distribution of the mean, X Let X1 , . . . , Xn be independent, identically distributed with mean E(Xi) = and variance Var(Xi ) = 2 for all i. The sample mean, X, is dened as: X= X1 + X2 + . . . + Xn . n
So X =
Sn , where Sn = X1 + . . . + Xn approx Normal(n, n 2) by the CLT. n
Because X is a scalar multiple of a Normal r.v. as n grows large, X itself is approximately Normal for large n: X1 + X2 + . . . + Xn 2 approx Normal , n n
as n .
The following three statements of the Central Limit Theorem are equivalent: X= X1 + X2 + . . . + Xn approx Normal , n
2 n
as n . as n .
Sn = X1 + X2 + . . . + Xn approx Normal n, n 2
Sn n X = approx Normal (0, 1) as n . 2 /n n 2 The essential point to remember about the Central Limit Theorem is that large sums or sample means of independent random variables converge to a Normal distribution, whatever the distribution of the original r.v.s. More general version of the CLT A more general form of CLT states that, if X1 , . . . , Xn are independent, and 2 E(Xi) = i , Var(Xi) = i (not necessarily all equal), then Zn =
n i=1 (Xi i ) n 2 i=1 i
Normal(0, 1) as n .
Other versions of the CLT relax the condition that X1 , . . . , Xn are independent.
Chapter 6: Wrapping Up
Probably the two major ideas of this course are: likelihood and estimation; hypothesis testing.
191
Most of the techniques that we have studied along the way are to help us with these two goals: expectation, variance, distributions, change of variable, and the Central Limit Theorem. Lets see how these dierent ideas all come together.
6.1 Estimators the good, the bad, and the estimator PDF We have seen that an estimator is a capital letter replacing a small letter. Whats the point of that? Example: Let X Binomial(n, p) with known n and observed value X = x. x The maximum likelihood estimate of p is p = n . The maximum likelihood estimator of p is p =
X n.
Example: Let X Exponential() with observed value X = x. 1 The maximum likelihood estimate of is = x . The maximum likelihood estimator of is = Why are we interested in estimators? The answer is that estimators are random variables. This means they have distributions, means, and variances that tell us how well we can trust our single observation, or estimate, from this distribution.
1 X.
192
Good and bad estimators Suppose that X1 , X2, . . . , Xn are independent, and Xi Exponential() for all i. is unknown, and we wish to estimate it. In Chapter 4 we calculated the maximum likelihood estimator of : 1 n = . = X1 + X2 + . . . + Xn X Now is a random variable with a distribution. For a given value of n, we can calculate the p.d.f. of . How?
We know that T = X1+. . .+Xn Gamma(n, ) when Xi i.i.d. Exponential().

So we know the p.d.f. of T .
n Now = T .
So we can nd the p.d.f. of using the change of variable technique. Here are the p.d.f.s of for two dierent values of n: Estimator 1: n = 100. 100 pieces of information about . Estimator 2: n = 10. 10 pieces of information about .
f( )
p.d.f. of Estimator 1
True (unknown)
Clearly, the more information we have, the better. The p.d.f. for n = 100 is focused much more tightly about the true value (unknown) than the p.d.f. for n = 10.
193
It is important to recognise what we do and dont know in this situation: What we dont know: the true ; What we do know: the p.d.f. curve;
WHERE we are on the p.d.f. curve.
we know were SOMEWHERE on that curve.
So we need an estimator such that EVERYWHERE on the estimators p.d.f.
curve is good!
f( )
True (unknown)
This is why we are so concerned with estimator variance. A good estimator has low estimator variance: everywhere on the estimators p.d.f. curve is guaranteed to be good. A poor estimator has high estimator variance: some places on the estimators p.d.f. curve may be good, while others may be very bad. Because we dont know where we are on the curve, we cant trust any estimate from this poor estimator. The estimator variance tells us how much the estimator can be trusted. Note: We were lucky in this example to happen to know that T = X1 + . . . + Xn Gamma(n, ) when Xi i.i.d. Exponential(), so we could nd the p.d.f. of our estimator = n/T . We wont usually be so lucky: so what should we do?
Use the Central Limit Theorem!
194
Example: calculating the maximum likelihood estimator The following question is in the same style as the exam questions. Let X be a continuous random variable with probability density function 2(s x) for 0 < x < s , s2 fX (x) = 0 otherwise . s (a) Show that E(X) = . 3
s
Here, s is a parameter to be estimated, where s is the maximum value of X and s > 0.
Use E(X) =
0 2
2 xfX (x) dx = 2 s
s 0
(sx x2) dx.
s2 (b) Show that E(X ) = . 6
Use E(X ) =
0
2 x fX (x) dx = 2 s
2
s 0
(sx2 x3) dx.

s2 18 .
(c) Find Var(X). Use Var(X) = E(X 2) (EX)2. Answer: Var(X) =
(d) Suppose that we make a single observation X = x. Write down the likelihood function, L(s ; x), and state the range of values of s for which your answer is valid. 2(s x) for x < s < . L(s ; x) = s2 (e) The likelihood graph for a particular value of x is shown here. Show that the maximum likelihood estimator of s is s = 2X . You should refer to the graph in your answer.
0.15 Likelihood 0.0 3 0.05 0.10
6 s
195
L(s ; x) = 2s2(s x)
So
dL = 2 2s3(s x) + s2 ds = 2s3(2(s x) + s) = 2 (2x s). s3 s=
At the MLE,
dL =0 ds
or
s = 2x.
From the graph, we can see that s = is not the maximum. So s = 2x. Thus the maximum likelihood estimator is
s = 2X.
(f) Find the estimator variance, Var(s), in terms of s. Hence nd the estimated variance, Var(s), in terms of s.
Var(s) = Var(2X)
= 22Var(X) s2 = 4 18 2s2 Var(s) = . 9
by (c)
So also:
2s2 . Var(s) = 9
196
(g) Suppose we make the single observation X = 3. Find the maximum likelihood estimate of s, and its estimated variance and standard error. s = 2X = 2 3 = 6. 2s2 2 62 Var(s) = = =8 9 9 se(s) =
Var(s) =
8 = 2.82.
This means s is a POOR estimator: the twice standard-error interval would be 6 2 2.82 to 6 + 2 2.82: that is, 0.36 to 11.64 ! Taking the twice standard error interval strictly applies only to the Normal distribution, but it is a useful rule of thumb to see how good the estimator is.
(h) Write a sentence in plain English to explain what the maximum likelihood estimate from part (g) represents. The value s = 6 is the value of s under which the observation X = 3 is more likely than it is at any other value of s.
6.2 Hypothesis tests: in search of a distribution When we do a hypothesis test, we need a test statistic: some random variable with a distribution that we can specify exactly under H0 and that diers under H1 . It is nding the distribution that is the dicult part. Weird coin: is my coin fair? Let X be the number of heads out of 10 tosses. X Binomial(10, p). We have an easy distribution and can do a hypothesis test. Too many daughters? Do divers have more daughters than sons? Let X be the number of daughters out of 190 diver children. X Binomial(190, p). Easy.
197
Too long between volcanoes? Let X be the length of time between volcanic eruptions. If we assume volcanoes occur as a Poisson process, then X Exponential(). We have a simple distribution and test statistic (X): we can test the observed length of time between eruptions and see if it this is a believable observation under a hypothesized value of . More advanced tests Most things in life are not as easy as the three examples above. Here are some observations. Do they come from a distribution (any distribution) with mean 0? 3.96 2.32 -1.81 -0.14 3.22 1.37 -0.17 1.85 0.61 -0.58 1.07 -0.52 0.40 1.54 -1.42 -0.85 0.51 1.66 1.48 1.54
Answer: yes, they are Normal(0, 4), but how can we tell? What about these? 3.3 -30.0 -8.1 8.1 -7.8 -9.0 3.4 -1.3 8.1 -13.7 12.6 -5.0 -9.6 -6.6 1.4 -5.6 -6.4 -11.8 2.5 9.0
Again, yes they do (Normal(0, 100) this time), but how can we tell? The unknown variance (4 versus 100) interferes, so that the second sample does not cluster about its mean of 0 at all. What test statistic should we use? If we dont know that our data are Normal, and we dont know their underlying variance, what can we use as our X to test whether = 0? Answer: a clever person called W. S. Gossett (1876-1937) worked out an answer. He called himself only Student, possibly because he (or his employers) wanted it to be kept secret that he was doing his statistical research as part of his employment at Guinness Brewery. The test that Student developed is the familiar Students t-test. It was originally developed to help Guinness decide how large a sample of people should be used in its beer tastings!
198
Student used the following test statistic for the unknown mean, : T = X
n 2 i=1 (Xi X)
n(n1)
Under H0 : = 0, the distribution of T is known: T has p.d.f. fT (t) =

n 2 n1 2
(n 1)
t2 1+ n1
n/2
for < t < .
T is the Students t-distribution, derived as the ratio of a Normal random variable and an independent Chi-Squared random variable. If = 0, observations of T will tend to lie out in the tails of this distribution. The Students t-test is exact when the distribution of the original data X1 , . . . , Xn is Normal. For other distributions, it is still approximately valid in large samples, by the Central Limit Theorem. It looks dicult It is! Most of the statistical tests in common use have deep (and sometimes quite impenetrable) theory behind them. As you can probably guess, Student did not derive the distribution above without a great deal of hard work. The result, however, is astonishing. With the help of our best friend the Central Limit Theorem, Students T -statistic gives us a test for = 0 (or any other value) that can be used with any large enough sample. The Chi-squared test for testing proportions in a contingency table also has a deep theory, but once researchers had derived the distribution of a suitable test statistic, the rest was easy. In the Chi-squared goodness-of-t test, the Pearsons chi-square test statistic is shown to have a Chi-squared distribution under H0. It produces larger values under H1 . One interesting point to note is the pivotal role of the Central Limit Theorem in all of this. The Central Limit Theorem produces approximate Normal distributions. Normal random variables squared produce Chi-squared random variables. Normals divided by Chi-squareds produce t-distributed random variables. A ratio of two Chi-squared distributions produces an F -distributed random variable. All these things are not coincidental: the Central Limit Theorem rocks!

210 Book

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

210 Book

Uploaded by

Copyright:

Available Formats

COURSE NOTES STATS 210 Statistical Theory

Department of Statistics University of Auckland

3 3 3 5 12 14 17 19 20 22 24 30 32 35 37 41 43 46 50 50 51 53 54 58 62 69 76 79 91 94 100 106 112

number of times event occurs number of opportunities for event to occur ;

= [0, 1] ={all numbers between 0 and 1 inclusive} (continuous, innite)

Bus Car Walk Bike

Train People in class

outcomes in Bus and all outcomes in Car.

Bus Car Walk Bike

Train People in class

as Bus UNION Car.

does an outcome need to satisfy for the shaded event to occur?

of Car and Train.

Bus Car Walk Bike

Train People in class

Car INTERSECT Train.

Denition: The intersection of events A and B is written A B and is given by A B = {s : s A AND s B} .

Bus Car Walk Bike

Train People in class

The opposite of everything is nothing.

Everything is either A or not A. Nothing is both A and not A.

(iii) For any events A and B, and

intersection is distributive over union.

Note: Does this mean that A and B are independent?

Examples: B1 , B2, B3, B4 form a partition of :

Important: B and B partition for any event B:

A 1111 0000 1111 0000 1111 0000 1111 0000

Second set: {All Blacks win}.

Both sets have just one element, but

we need to give them dierent probabilities!

or distributed, among the dierent sets in the sample space.

E.g. if A = {s3 , s5, s14}, then P(A) = p3 + p5 + p14.

Axiom 1: Axiom 2: Axiom 3:

If A1, A2, . . . , An are mutually exclusive events, (no overlap), then

0 P(A) 1 for all events A.

If A B = then P(A B) = P(A) + P(B).

For ANY events A, B ,

P(A B) = P(A) + P(B) P(A B).

P(A B C) = P(A) + P(B) + P(C) + P(A B C) .

P(A B) P(A C) P(B C)

When we add P(A) + P(B), we

add the intersection twice.

Let B1 , . . . , Bm form a partition of . Then for any event A,

First formulate events: Let

Next write down all the information given:

(a) Asked for P(L).

(b) Asked for P(L C).

P(L C) = P(L) + P(C) P(L C) = 0.25 + 0.33 0.13 = 0.45.

Information given: (a)

P(M) = 1 P(M) = 1 0.22 = 0.78.

(b) We can not calculate P(M S) from the information given.

P(M S) = 1 P(M S) = 1 0.48 = 0.52.

Probability that a student is interested in music, or sport, or both.

P(M S) = P(M) + P(S) P(M S) = 0.22 + 0.34 0.52 = 0.04.

Only 4% of students are interested in both music and sport.

P(M S) = P(M) P(M S) = 0.22 0.04 = 0.18.

ii) = A A; and A A = (mutually exclusive). So 1 = P() = P(A A) = P(A) + P(A). (Axiom 3)

A B = (A ) (B ) (Set theory) = A (B B) B (A A) (Set theory) = (A B) (A B) (B A) (B A) (Distributive laws)

= P(A) + P(B) P(A B).

1 P(A given B) = P(A | B) = . 3

Before we go any further . . . do you agree with Kate?

# single males # singles # who are M and S # who are S

# males in a relationship # in a relationship # who are M and R # who are R

P(A B) gives P(A and B , from the whole sample space ).