You are on page 1of 82

INTRODUCTION

TO
STATISTICS
William J. Anderson
McGill University
2
Contents
1 Sampling Distributions 5
1.1 The Basic Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Applications of these Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Order Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Estimation 11
2.1 Methods of Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Maximum Likelihood Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Method of Moments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Bayesian Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Properties of Estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Unbiasedness and Eciency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Suciency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Minimum Variance Revisited. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Condence Intervals 27
4 Theory of Hypothesis Testing 35
4.1 Introduction and Denitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 How to Choose the Critical Region Case of Simple Hypotheses. . . . . . . . . . . . . . . . . 39
4.3 How to Choose the Critical Region - Case of Composite Hypotheses. . . . . . . . . . . . . . . 43
4.4 Some Last Topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.1 Large Sample Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.2 p-value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.3 Bayesian Tests of Hypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.4 Relationship Between Tests and Condence Sets. . . . . . . . . . . . . . . . . . . . . 47
5 Hypothesis Testing: Applications 49
5.1 The Bivariate Normal Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Correlation Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Normal Regression Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Linear Models 55
6.1 Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Experimental Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2.1 The Completely Randomized Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2.2 Randomized Block Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3
4 CONTENTS
7 Chi-Square Tests 69
7.1 Tests Concerning k Independent Binomial Populations. . . . . . . . . . . . . . . . . . . . . . 69
7.2 Chi-Square Test for the Parameters of a Multinomial Distribution. . . . . . . . . . . . . . . . 71
7.3 Goodness of Fit Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.4 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8 Non-Parametric Methods of Inference 77
8.1 The Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 The Mann-Whitney, or U-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.3 Tests for Randomness Based on Runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Chapter 1
Sampling Distributions
Reference: WMS 7th ed., chapter 7
1.1 The Basic Distributions.
Denition. A random variable X with density function given by
f(x) =
_
1
2
n/2
(n/2)
x
(n/2)1
e
x/2
if x > 0,
0 otherwise,
where n 1 is an integer, is said to have a chi-square distribution with n degrees of freedom (df). Briey,
we write X
2
n
.
The chi-square distribution is just a gamma distribution with = n/2 and = 2. We therefore have
E(X) = n, V ar(X) = 2n, M
X
(t) =
1
(1 2t)
n/2
.
Also, (X n)/

2n is approximately N(0, 1), by the CLT.


Proposition 1.1.1
(1) Let X
1
, X
2
, . . . , X
m
be independent chi-square random variables, with n
1
, n
2
, . . . , n
m
degrees of freedom,
and let X = X
1
+ +X
m
. Then X has the chi-square distribution with n = n
1
+ +n
m
d.f.
(2) Suppose that X = X
1
+ X
2
where X
1
and X
2
are independent and X and X
1
have distributions
2
n
and
2
n1
respectively, where n
1
< n. Then X
2

2
n2
where n
2
= n n
1
.
Proof. We have
M
X
(t) = M
X1
(t)M
X2
(t) M
Xn
(t) =
1
(1 2t)
n1/2

1
(1 2t)
nm/2
=
1
(1 2t)
n/2
,
so X is as claimed. The proof of (2) is similar.
Proposition 1.1.2 Let X
1
, X
2
, . . . , X
n
be i.i.d., each with distribution N(0, 1), and let Y = X
2
1
+ +X
2
n
.
Then Y
2
n
.
Proof. If Z N(0, 1), then
Pr{Z
2
w} = Pr{

w Z

w} = 2 Pr{0 Z

w} =
2

2
_

w
0
e
u
2
/2
du =
_
w
0
v
1/2
e
v/2

2
dv
so that (because (1/2) =

) we haveZ
2

2
1
. The result then follows from proposition 1.1.1.
5
6 CHAPTER 1. SAMPLING DISTRIBUTIONS
Denition. Given with 0 < < 1, we dene
2
,n
to be the unique number such that
Pr{X >
2
,n
} = ,
where X is a
2
random variable with n degrees of freedom.
2

is called a critical value of the


2
-distribution.
Denition. A random variable T with density function given by
f(t) =

_
n+1
2
_

n
_
n
2
_
_
1 +
t
2
n
_

n+1
2
, < t < +,
where n 1 is an integer, is said to have the t-distribution with n degrees of freedom. Briey, we write
T t
n
.
Remarks. The t density function is symmetric about 0 and very similar to the standard normal density
function, except that it is lower in the middle and has fatter tails. In fact, it is easy to see that the density
f(t) tends to the standard normal density as n .
Proposition 1.1.3 Let X N(0, 1) and Y
2
n
be independent. Then
T =
X
_
Y/n
has the t-distribution with n degrees of freedom.
Denition. Given with 0 < < 1, we dene t
,n
to be the unique number such that
Pr{T > t
,n
} = ,
where T is a t-random variable with n degrees of freedom. t
,n
is called a critical value of the t-distribution.
Denition. A random variable Y with density function given by
g(y) =
_
_
_
(
n
1
+n
2
2
)
(
n
1
2
)(
n
2
2
)
_
n1
n2
_
n1/2
y
n1/21
_
1 +
n1
n2
y
_

n
1
+n
2
2
, if y > 0,
0 if y 0,
where n
1
, n
2
1 are integers, is said to have the F-distribution with n
1
, n
2
degrees of freedom. Briey, we
write Y F
n1,n2
.
Proposition 1.1.4 Let X
1

2
n1
and X
2

2
n2
be independent. Then
Y =
X
1
/n
1
X
2
/n
2
has the F-distribution with n
1
, n
2
degrees of freedom.
Corollary 1.1.5 If Y F
n1,n2
, then 1/Y F
n2,n1
.
Denition. Given with 0 < < 1, we dene F
,n1,n2
to be the unique number such that
Pr{Y > F
,n1,n2
} = ,
where Y is an F-random variable with n
1
, n
2
degrees of freedom. F
,n1,n2
is called a critical value of the
F-distribution.
1.2. APPLICATIONS OF THESE DISTRIBUTIONS. 7
Problem. Show that F
1,n2,n1
=
1
F,n
1
,n
2
.
Solution. We have
= P[Y > F
,n1,n2
] = P[
1
Y
<
1
F
,n1,n2
] = 1 P[
1
Y
>
1
F
,n1,n2
],
so P[
1
Y
>
1
F,n
1
,n
2
] = 1 . This gives the required result, since 1/Y F
n2,n1
.
1.2 Applications of these Distributions.
The following lemma will be extremely useful throughout the course.
Lemma 1.2.1 Let {x
1
, . . . , x
n
} and {y
1
, . . . , y
n
} be two sets of n numbers (which may be the same), and
let x = (x
1
+ +x
n
)/n. Then
n

i=1
(x
i
c)(y
i
d) =
n

i=1
(x
i
x)(y
i
y) +n( x c)( y d), (1.1)
for any numbers c and d. In the case where they are the same, we have
n

i=1
(x
i
)
2
=
n

i=1
(x
i
x)
2
+n( x )
2
, (1.2)
for any number .
Proof. Using the fact that

n
i=1
(x
i
x) = 0,
n

i=1
(x
i
c)(y
i
d) =
n

i=1
[(x
i
x) + ( x c)][(y
i
y) + ( y d)]
=
n

i=1
[(x
i
x)(y
i
y) + (x
i
x)( y d) + ( x c)(y
i
y) + ( x c)( y d)]
=
n

i=1
(x
i
x)(y
i
y) + ( y d)
n

i=1
(x
i
x) + ( x c)
n

i=1
(y
i
y) +n( x c)( y d)
=
n

i=1
(x
i
x)(y
i
y) +n( x c)( y d).
Denition. A set of independent random variables X
1
, X
2
, , X
n
, each having the distribution function
F, is said to be a simple random sample from the distribution F. Let us denote the mean and variance of
F by and
2
, and dene

X =
X
1
+X
2
+ +X
n
n
, s
2
=
1
n 1
n

i=1
(X
i


X)
2
to be the sample mean and sample variance, respectively.
Proposition 1.2.2
(1) E(

X) = and Var(

X) =
2
/n.
(2) For large n,

X is approximately N(,
2
/n).
8 CHAPTER 1. SAMPLING DISTRIBUTIONS
(3) E(s
2
) =
2
.
(4)

X and X
i


X are uncorrelated.
Proof. (1) is obvious and (2) follows from the CLT. For (3), we have from (1.2) that
E(s
2
) =
1
n 1
E
_
n

i=1
(X
i
)
2
n(

X )
2
_
=
1
n 1
_
n

i=1
E(X
i
)
2
nE(

X )
2
_
=
1
n 1
_
n

i=1

2
n

2
n
_
=
2
.
For (4), we have
Cov(

X, X
i


X) = Cov(

X, X
i
) Cov(

X,

X) =
1
n
n

j=1
Cov(X
j
, X
i
)

2
n
=
Cov(X
i
, X
i
)
n


2
n
= 0.
Proposition 1.2.3 Let X
1
, X
2
, . . . , X
n
be a random sample from N(,
2
). Then
(1)

X N(,
2
/n).
(2)

X and s
2
are independent.
(3) (n 1)s
2
/
2

2
n1
.
(4) the random variable

X
s/

n
has the t-distribution with n 1 degrees of freedom.
Proof.
(1) M
X
(t) = M
X1++Xn
(t/n) = (M
X
(t/n))
n
=
_
e
t/n+
2
t
2
/2n
2
_
n
= e
t+
2
t
2
/2n
N(,
2
/n).
(2) It can be shown that

X and X
i


X have a bivariate normal distribution. Since they are uncorrelated,
they must also be independent. Then

X and (X
i


X)
2
are also independent for each i, and so

X and

n
i=1
(X
i


X)
2
are independent.
(3) From lemma 1.1, we have
n

i=1
_
X
i

_
2
. .

2
n
=
(n 1)s
2

2
+
_

X
/

n
_
2
. .

2
1
and so the result follows from proposition 1.1.
(4) We can write

X
s/

n
=

X
/

n
_
(n1)s
2

2
/(n 1)
.
But this is of the form
X
_
Y/m
where X N(0, 1) and Y
2
m
, and so the result follows by proposition 1.3.
1.3. ORDER STATISTICS. 9
1.3 Order Statistics.
Denition. Let X
1
, , X
n
be a random sample from the continuous distribution F(x) with density f(x).
For each outome , dene
Y
1
() = the smallest of X
1
(), , X
n
()
Y
2
() = the second smallest of X
1
(), , X
n
()
Y
3
() = the third smallest of X
1
(), , X
n
()
.
.
.
Y
n
() = the largest of X
1
(), , X
n
().
The random variables Y
1
, , Y
n
are called the order statistics. In particular, Y
r
is called the rth order
statistic.
Example. Suppose X
1
() = 7, X
2
() = 4, X
3
() = 5. Then Y
1
() = 5, Y
2
() = 4, Y
3
() = 7.
Proposition 1.3.1 The rth order statistic Y
r
has density function given by
g
r
(y) =
n!
(r 1)!(n r)!
[F(y)]
r1
f(y)[1 F(y)]
nr
.
Proof. For small h > 0, we have, using the trinomial distribution,
Pr{y < Y
r
y +h} = the probability that r 1 sample values fall below y, one falls in (y, y +h],
and n r fall above y +h
=
n!
(r 1)!1!(n r)!
[F(y)]
r1
[F(y +h) F(y)][1 F(y +h)]
nr
.
Dividing both sides by h and letting h 0 then gives the required result.
Denition. Given a random sample X
1
, , X
n
of size n, we dene the sample median to be

X =
_
_
_
Y
m+1
if n = 2m+ 1,
Y
m
+Y
m+1
2
if n = 2m.
Note that

X is the middle sample value if n is odd, or the mean of the two middle values if n is even.
Remark. From proposition 3.1, the density function of the median for a sample of size 2m+ 1 is
g
m
(y) =
(2m+ 1)!
m!m!
[F(y)]
m
f(y)[1 F(y)]
m
.
Example. For the case of a sample of size 2m + 1 from the exponential distribution with mean , the
density function of the sample median is
g
r
(y) =
_
(2m+1)!
m!m!
[1 e
y/
]
m1

e
(m+1)y/
if y > 0,
0 if y 0.
10 CHAPTER 1. SAMPLING DISTRIBUTIONS
Chapter 2
Estimation
Reference: WMS 7th ed., chapters 8,9
Statistical Inference. Let X be a random quantity whose distribution depends on a parameter . We
do not know the truevalue of , only that belongs to the set , called the parameter set. Statistical
inference is concerned with using the observed value x of X to obtain information about this true value of
. It takes basically two forms:
(1) Estimation in which the observation x is used to come up with a plausible value for as the
truevalue.
(2) Hypothesis Testing in which x is used to decide between two hypotheses concerning the true
value of .
During most (but not all) of this course, the classical situation will prevail. That is, X will be a random
vector (X
1
, X
2
, , X
n
), where X
1
, X
2
, , X
n
are independent and identically distributed number-valued
random variables, each with distribution function F

(x). Such an X is called a simple random sample, or


just a random sample taken from the distribution F

(x). On the other hand, need not be a numerical


parameter. For example, may be in the form of a vector, such as (,
2
).
Denition. Let t(x) be a function of x which does not depend on . The random variable t(X) (e.g.
t(X
1
, X
2
, , X
n
) in the case of a random sample) is called a statistic. A statistic used to estimate is
called an estimator of , and is generically denoted by

. The value of the random variable

is called the
estimate.
For example, given a random sample X
1
, X
2
, , X
n
from a distribution with mean and variance
2
,
=

X =
1
n
n

i=1
X
i
and

2
= s
2
=
1
n 1
n

i=1
(X
i


X)
2
are commonly used estimators of and
2
.
In 2.1 2.3, we study point estimation. Then in the last section, 2.4, we look at interval estimation.
2.1 Methods of Estimation.
The problem to be considered in this section is: given a parameter to be estimated, how do we derive a
suitable estimator of it? We will examine three methods of estimation maximum likelihood estimation,
the method of moments, and Bayesian estimation. A fourth method, that of least squares, will be used later
on in the course.
11
12 CHAPTER 2. ESTIMATION
2.1.1 Maximum Likelihood Estimation.
We shall suppose that the observation X is either discrete with probability function f

(x), or continuous
with density function f

(x).
Denition. The function L() = f

(x) of , where x is considered xed, is called the likelihood function.


The method of maximum likelihood consists of choosing as our estimate

that value of for which L() is
a maximum. In other words,

is a member of such that
L(

) = sup{L()| }.
That is, we take for

that value of which is most likely to have produced the observation x. Then

is
called the maximum likelihood estimator (MLE) of .
L() = log L() is called the log likelihood function. It should be noted that since log x is a strictly
increasing function of x, then L() attains its maximum value at the same as does L().
Example. Suppose = 0 or 1/2, and f

(x) is given in the following table.

0 1/2
x 1 0 .1
2 1 .9
If x = 1 is observed, then

= 1/2. If x = 2 is observed, then

= 0.
Problem. Suppose f

(x) is given in the following table. Find MLEs of . This example shows that MLEs
are not unique. (A later example involving the uniform distribution shows they may not even exist.)

0 1/4 1/2
1 .1 .2 0
x 2 .4 .4 .3
3 .5 .4 .7
Example. A manufacturer of lightbulbs knows that the lifetime of his bulbs is a random variable with
exponential density function
g

(t) =
_
1

e
t/
if t > 0,
0 if t 0,
where > 0. He wants to estimate .
(1) Suppose he selects at random n bulbs, sets them burning, and separately measures their lifetimes
T
1
, T
2
, , T
n
. Then the observation is X = (T
1
, T
2
, , T
n
), and is a random sample from the above
density function. The likelihood function is then the joint density function of T
1
, , T
n
, which because
of independence, is
L() = f

(t
1
, t
2
, , t
n
) = g

(t
1
) g

(t
n
) =
_
1

n
e
n

t/
if t
1
, . . . , t
n
> 0
0 otherwise.
The log likelihood function is then L() = nlog n

t/. Dierentiating, we get


L

() = n/ +n

t/
2
=
n

1
_
=
_

_
> 0 if <

t,
= 0 if =

t,
< 0 if >

t,
so that the MLE must be

=

t.
2.1. METHODS OF ESTIMATION. 13
(2) Consider the following alternative sampling scheme. He sets n bulbs burning, and then c units of time
later, observes the number x of bulbs which are still burning. We shall determine the MLE of in this
case. The number X of bulbs still burning at time c is a binomial random variable with parameters n
and p = e
c/
, and therefore the likelihood function is
L() =
_
n
x
_
p
x
(1 p)
nx
where p = e
c/
.
The log likelihood function is L() = log
_
n
x
_
+xlog p + (n x) log(1 p), so
L

() =
x
p
p

()
n x
1 p
p

() = p

()
_
x np
p(1 p)
_
.
Since p

() > 0, L

() = 0 when p = x/n. Since p = e


c/
, then solving for gives

=
c
log(x/n)
.
We leave it to the reader to verify that this critical value

actually gives a maximum.
(3) As a third alternative, suppose he sets n bulbs burning, and observes the time Y until the rst bulb
burns out. That is, if T
1
, . . . , T
n
denote the lifetimes, then Y is just the rst order statistic for the
sample. Since
P{Y > y} = P{T
1
> y, . . . , T
n
> y} = P{T
1
> y} P{T
n
> y} = e
ny/
for y > 0, then the likelihood function is
L() =
_
n

e
ny/
if y > 0,
0 otherwise.
It is an easy matter to show that the MLE of is

= ny.
Example. Given a random sample X
1
, , X
n
from the distribution N(,
2
), nd the maximum likeli-
hood estimators of and
2
.
Solution. Here we have = (,
2
). The likelihood function is the joint density function of X
1
, , X
n
,
and because of independence, we have
L(,
2
) =
1

2
e

1
2
_
x
1

_
2

1

2
e

1
2
_
x
n

_
2
=
1

n
(2)
n/2
e

1
2
2
n

i=1
(x
i
)
2
.
The log likelihood function is
L(,
2
) =
n
2
log
2

n
2
log 2
1
2
2
n

i=1
(x
i
)
2
,
so

L =
1

2
n

i=1
(x
i
) =
n

2
( x )
and

2
L =
n
2
2
+
1
2
4
n

i=1
(x
i
)
2
=
n
2
4
_

1
n
n

i=1
(x
i
)
2
_
.
14 CHAPTER 2. ESTIMATION
Setting these partial derivatives equal to zero and solving simultaneously, we nd that
= x and

2
=
1
n
n

i=1
(x
i
x)
2
.
We shall leave it to the reader to check that L(,
2
) actually achieves its maximum at these values, and
that therefore these are the MLEs.
Remark. z = f(x, y) has a maximum at (x
0
, y
0
) if (1) f
1
= f
2
= 0 at (x
0
, y
0
), (2) f
11
f
22
f
12
f
21
> 0 at
(x
0
, y
0
), and (3) f
11
< 0 at (x
0
, y
0
).
Example. Given a random sample X
1
, , X
n
from the uniform density
g

(x) =
_
1

if 0 x ,
0 otherwise,
nd the MLE of .
Solution. The likelihood function is
L() =
_
1

n
if 0 x
1
, x
2
, , x
n

0 otherwise
=
_
1

n
if 0 y
1
< y
n

0 otherwise
where y
r
denotes the rth order statistic. The gure below shows a sketch of the graph of L() versus , from
which we see that the MLE of is

= y
n
= max(x
1
, , x
n
).
q
L(q)
y
n
Note that calculus would not work in this case since the function L() is not dierentiable at y
n
. Also note
that if the denition of g

(x) is changed to
g

(x) =
_
1

if 0 x < ,
0 otherwise,
then L() becomes
L() =
_
1

n
if 0 y
1
< y
n
<
0 otherwise,
so that no MLE can exist.
Proposition 2.1.1 (Invariance Property for MLEs) If

is an MLE of , and if is some function
dened on , then (

) is an MLE of ().
2.1. METHODS OF ESTIMATION. 15
Proof. First suppose is a one-to-one function. Let = () and = (

), and let g

(x) be the likelihood


function of x using the parameter (so that g

(x) = f

(x)). Then
g

(x) = g
(

)
(x) = f

(x) = sup

(x) = sup

(x).
If is not one-to-one, it is not clear what is meant by the MLE of (). We therefore proceed as follows:
we dene the induced likelihood function L

() = sup
{:()=}
L(), and say that is an MLE of () if
is a maximum of L

(). Then
L

((

)) = sup
{:()=(

)}
L() = L(

) = sup

L() = sup

sup
{:()=}
L() = sup

(),
so (

) is an MLE of ().
2.1.2 Method of Moments.
This method is based on the strong law of large numbers: if X
1
, X
2
, X
3
, . . . is a sequence of i.i.d. random
variables with nite kth moment

k
= E(X
k
), then
1
n
n

i=1
X
k
i

k
as n .
Thus if X
1
, , X
n
is a random sample and n is large, and if we put
m
k
=
1
n
n

i=1
X
k
i
,
then we have
m
k

k
, k 1.
The method of moments consists of solving as many of the equations
m
k
=

k
, k 1,
starting with the case k = 1, as are necessary to identify the unknown parameters.
Example. Suppose X
1
, , X
n
is a random sample from an exponential distribution with mean . Then

1
= and m
1
=

X, so the moment estimator is

=

X.
Example. Suppose X
1
, , X
n
is a random sample from a gamma distribution with parameters and .
Find moment estimators of and .
Solution. We have

1
= and

2
=
2
+
2

2
. Hence m
1
= and m
2
=
2
(1 +). Soving these
two equations for and , we nd
=
m
2
1
m
2
m
2
1
,

=
m
2
m
2
1
m
1
.
2.1.3 Bayesian Estimation.
16 CHAPTER 2. ESTIMATION
Reference: WMS 7th ed., chapter 16
In Bayesian estimation, we assume that the parameter is actually a random variable with a distribution
to be called the prior distribution. We also have in mind a certain loss function L(,

) which species
the loss or penalty when the true parameter value is and our estimate of it is

. Examples of possible loss
functions are L(,

) = (

)
2
(the squared loss function), and L(,

) = |

|.
Denition. The Bayes estimator of is that estimator

= t(X) for which the mean loss
E(L(, t(X))
is a minimum.
It can be shown that for the squared loss function, the Bayes estimator is given by

= t(X) where
t(x) = E(|X = x), (2.1)
where X is the observation. This is a result of a proposition given in the 357 supplement.
The conditional expectation in (2.1) is calculated as follows. Let h() denote the prior density (or
probability) function of . The conditional density (or probability) function f(x|) of X given is the
likelihood function f

(x). Then the conditional density (or probability) function of given X = x is


(|x) =
f

(x)h()
g(x)
(2.2)
where g(x) is the marginal density (probability) function of X. (|x) is called the posterior density (prob-
ability) function of . Finally, E(|X = x) is computed as the mean of the posterior density function.
Example 1. Estimate the parameter of a binomial distribution given the number X of successes in n
trials, and given that the prior distribution of is a beta distribution with parameters and ; that is,
h() =
_
(+)
()()

1
(1 )
1
if 0 < < 1,
0 otherwise.
Solution. We shall need the fact that the mean of such a beta density h() is /( + ). The likelihood
function is
f

(x) =
_
n
x
_

x
(1 )
nx
, x = 0, 1, . . . , n,
and the joint density function of X and is therefore
f(x, ) = f

(x)h() =
( +)
()()
_
n
x
_

+x1
(1 )
+nx1
.
Integrating, we nd the marginal probability function of X to be
g(x) =
_
1
0
f(x, ) d =
( +)
()()
_
n
x
__
1
0

+x1
(1)
+nx1
d =
( +)
()()
_
n
x
_
( +x)( +n x)
( + +n)
.
Hence the posterior density of is
(|x) =
f(x, )
g(x)
=
( + +n)
( +x)( +n x)

+x1
(1 )
+nx1
, 0 < < 1
which is another beta density, this time with parameters

= +x and

= +nx. The Bayes estimate


is therefore the mean of this density, namely

= E(|X = x) =
+x
+ +n
.
2.2. PROPERTIES OF ESTIMATORS. 17
Notice how both the information in the prior and the observation have been used to estimate . Also note
that (2.2) can be written as
(|x) = Kf

(x)h() (2.3)
where K is a normalization constant depending on x. The point is that the distribution on the right-hand
side of (2.3) might be easily recognizable, and then (|x) (or even E(|X = x)) could perhaps be written
down directly, without going through the fuss of determining g(x).
Example 2. Estimate the parameter of a uniform distribution on (0, ), on the basis of a single observa-
tion from that distribution, and given that the prior distribution of is the Pareto distribution with density
function
h() =
_

+1
if
0
,
0 if <
0
,
where
0
> 0 and > 0.
Solution. The mean of the Pareto law is
_
+

h()d =
_

0

d =
_

0
/( 1) if > 1,
+ if 1.
From (2.2), we have
(|x) = Kf

(x)h() =
_
K
1

+1
if 0 < x < ,
0
,
0 otherwise,
Thus the posterior distribution is again Pareto, but with parameters
0
x and + 1. It follows that the
Bayes estimator is
E(|X = x) =
( + 1) max{
0
, x}

.
Example 3. Estimate the mean of a normal distribution with known variance
2
, on the basis of the
mean

X of a random sample of size n, and given that the prior distribution of is N(
0
,
2
0
).
Solution. After carrying out the details, we nd that (| x) N(
1
,
2
1
), where

1
=
n x
2
0
+
0

2
n
2
0
+
2
,
1

2
1
=
n

2
+
1

2
0
.
Hence the Bayesian estimator of is
E(|

X = x) =
1
=
n x
2
0
+
0

2
n
2
0
+
2
.
Remark. The method of this section cannot be used to generate estimators in the classicalsituation
where is not random.
2.2 Properties of Estimators.
We have seen in 2.1 ways of generating estimators. Now how do we choose the appropriate one to use?
There are many criteria for goodness of estimators. In this section we will look at only four unbiasedness,
eciency, consistency, and suciency.
18 CHAPTER 2. ESTIMATION
2.2.1 Unbiasedness and Eciency.
Denition. An estimator

of is unbiased if E(

) = . The bias of an estimator



is B(

) = E(

) .
The mean square error of

is MSE(

) = E(

)
2
. Observe that
MSE(

) = E[(

E(

)) + (E(

) )]
2
= E[(

E(

))
2
+ 2(

E(

))B(

) + [B(

)]
2
]
= Var(

) + 2B(

)E[

E(

)] + [B(

)]
2
so that
MSE(

) = Var(

) + [B(

)]
2
.
The standard error of

is the standard deviation of

.
Denition. Given two estimators

1
and

2
of , we say

1
is relatively more ecient than

2
if E(

1
)
2

E(

2
)
2
. The ratio
E(

2
)
2
E(

1
)
2
is called the relative eciency of

1
with respect to

2
.
Observe that when

1
and

2
are unbiased for ,

1
is relatively more ecient than

2
if V ar(

1
)
V ar(

2
). Obviously unbiasedness and eciency are two desirable properties of estimators. A good strategy
for choosing an estimator among the many available might then be as follows. We agree to restrict ourselves
to unbiased estimators, and then among all unbiased estimators of , to choose the most ecient one. Such
an estimator is called a minimum variance unbiased estimator (MVUE). We remark, though, that among all
unbiased estimators of , there may not be one with minimum variance. On the other hand, there may be
more than one. The following theorem helps very much in verifying whether a given unbiased estimator is
an MVUE.
Theorem 2.2.1 (The Cramer - Rao inequality) Let

= t(X) be an estimator of , and let f

(x) denote the


likelihood function. Then
V ar(

)
|

()|
2
E
_
log f

(X)

_
2
, (2.4)
where () = E[t(X)] =
_
t(x)f

(x) dx is the bias function. (() = if t(X) is unbiased.) When X is a


random sample X
1
, X
2
, , X
n
from a density or probability function g

(x), (2.4) becomes


V ar(

)
|

()|
2
nE
_
log g

(X
1
)

_
2
. (2.5)
If

is unbiased (so

() = 1) and equality holds in (2.4) (or in (2.5)), then



is an MVUE.
Proof. We shall apply the inequality |Cov(U, V )|
2
Var(U)Var(V ) to the random variables U =

and
V =

log f

(X). Using the fact that


_

log f

(x)
_
f

(x) =

(x),
we have
E(V ) = E(

log f

(X)) =
_

log f

(x) f

(x) dx =
_

(x) dx =

_
f

(x) dx = 0 (2.6)
2.2. PROPERTIES OF ESTIMATORS. 19
and
E(UV ) = E(

log f

(X)) =
_
t(x)

log f

(x)f

(x) dx =
_
t(x)

(x) dx =

_
t(x)f

(x) dx =

(),
so that Cov(

log f

(X)) =

(). Thus
|

()|
2
= |Cov(

log f

(X))|
2
Var(

)Var(

log f

(X)) (2.7)
and therefore
V ar(

)
|

()|
2
V ar(

log f

(X))
. (2.8)
(2.4) follows from (2.8), by virtue of (2.6). If X is a random sample, then f

(x) = f

(x
1
, x
2
, . . . , x
n
) =
g

(x
1
) g

(x
n
), so
log f

(x)

=
n

i=1
log g

(x
i
)

and then
Var
_
log f

(X)

_
=
n

i=1
Var
_
log g

(X
i
)

_
= nVar
_
log g

(X
1
)

_
= nE
_
log g

(X
1
)

_
2
,
where we used the fact that the variance of a sum of independent random variables is the sum of the variances.
Hence (2.5) follows from (2.8) and (2.6) again.
Remarks.
(1) If equality holds in (2.7),

is said to be an ecient estimator of .
(2) The quantity
I() = E
_
log f

(X)

_
2
= V ar(

log f

(X))
in the denominator of (2.4) is called the Fisher information of the observation X.
(3) Recall that equality holds in (2.7) if and only if

= t(X) and

log f

(X) are linearly related with


probability 1; that is, if
P
_

log f

(X) = a()t(X) +b()


_
= 1,
which means the likelihood function must be of the form
f

(x) = e
w()t(x) +B() +H(x)
= h(x)c()e
w()t(x)
,
where w() =
_
a()d. Density or probability functions that have this form are said to be of expo-
nential type.
The exact denition of an exponential family is given at the end of 2.2.
Example. Let X
1
, , X
n
be a random sample from N(,
2
). Show that the sample mean

X is an
MVUE of .
20 CHAPTER 2. ESTIMATION
Solution. We already know that

X is unbiased for . Hence we check for equality in (2.5). g(x
1
) is the
usual normal density, so log g(x
1
) = log

2 (x
1
)
2
/2
2
. Hence

log g(x
1
) =
1

_
x
1

_
so that
nE
_
log g(X
1
)

_
2
=
n

2
E
_
X
1

_
2
=
n

2
.
On the other hand, V ar(

X) =
2
/n, so in fact we do have equality in (2.5)
Example. In a sequence of n Bernoulli trials with probability of success, we observe the total number
X of successes. Show that

= X/n is an MVUE for .
Solution. Obviously

is unbiased. On the one hand we have V ar(

) = (1 )/n. On the other hand,

log f

(x) =

log
_
n
x
_

x
(1 )
nx
=

_
log
_
n
x
_
+xlog + (n x) log(1 )
_
=
x


n x
1
=
x n
(1 )
and so
E
_
log f

(X)

_
2
= E
_
X n
(1 )
_
2
=
n
(1 )
.
Hence we have equality in (2.4)
2.2.2 Consistency.
Denition. Let X
1
, , X
n
be a random sample from F

. The estimator

= t(X
1
, , X
n
) is consistent
for if

P
as n ; that is
Pr{|

| > } 0 as n
for every > 0.
Theorem 2.2.2

is consistent for if
(1)

is unbiased for
(2) Var(

) 0 as n .
Proof. By Chebychevs inequality, we have
Pr{|

| > } = Pr{|

E(

)| > }
V ar(

2
0 as n .
Example. Show that for random samples from a normal distribution N(,
2
), the sample variance s
2
is
consistent for
2
.
Solution. We have shown that s
2
is unbiased. Since (n1)s
2
/
2
has a chi-square distribution with n1
degrees of freedom, it has variance 2(n 1), and so
V ar(s
2
) = V ar
_

2
n 1

(n 1)s
2

2
_
=

4
(n 1)
2
2(n 1) =
2
4
n 1
0
as n .
2.2. PROPERTIES OF ESTIMATORS. 21
2.2.3 Suciency.
A good estimator

should utilize all the information about in the observation, as opposed to an estimator
which does not. This motivates the following denition.
Denition. The statistic

= t(X) is said to be sucient for if the conditional density (probability)
function f(x|t(X) = w) of X given t(X) does not depend on for any w for which it is well-dened.
Remarks.
(1) The interpretation of the denition is that

is sucient if once we know the value of

, the remaining
information in the observation says nothing about .

contains in itself all the relevant information
about .
(2) For example, suppose we have data x
1
, , x
n
from a random sample from N(,
2
). We would think
that the sample mean x would contain all the information about , and that we could throw away the
data. In other words, x would seem to be sucient for .
(3) Let () be a one-to-one function of . If

is sucient for , then
f(x|(

) =

) = f(x|

=
1
(

))
does not depend on , so (

) is also sucient for .


Example. Suppose we have a sequence of n Bernoulli trials, each with probability of resulting in success.
Dene X
i
= 1 if the ith trial results in success, and 0 otherwise. Then X
1
, X
2
, , X
n
is a random sample
from a Bernoulli distribution. Let Y =

n
i=1
X
i
be the number of successes in the sample. Then
f(x
1
, , x
n
|Y = y) =
Pr{X
1
= x
1
, , X
n
= x
n
; Y = y}
Pr{Y = y}
=
Pr{X
1
= x
1
, , X
n
= x
n
}
Pr{Y = y}
=
Pr{X
1
= x
1
} . . . Pr{X
n
= x
n
}
Pr{Y = y}
=

y
(1 )
ny
_
n
y
_

y
(1 )
ny
=
1
_
n
y
_
if

n
i=1
x
i
= y, and 0 otherwise. In either case, it does not depend on , so Y is sucient for . By remark
(3),

= Y/n is also sucient for .
Theorem 2.2.3 (Neyman-Pearson Factorization Criterion)

= t(X) is sucient for if and only if the
density (probability) function f

(x) of X can be factored as


f

(x) = g

[t(x)]h(x) (2.9)
where g

[t(x)] depends on and on x only through t(x) and h(x) depends on x but not on .
Proof. We give the proof only in the case where X is discrete.
Suppose t(X) is sucient. Dene g

[w] = P

[t(X) = w]. If g

[t(x)] > 0, then h(x) = Pr{X = x|t(X) =


t(x)} is well dened and does not depend on , and we have
f

(x) = Pr{X = x} = Pr{X = x, t(X) = t(x)} = Pr{X = x|t(X) = t(x)} Pr{t(X) = t(x)} = h(x)g

[t(x)],
so (2.9) holds. If g

[t(x)] = 0, then f

(x) = Pr{X = x} P[t(X) = t(x)] = 0, so f

(x) = 0. Once again,


(2.9) holds with both sides zero.
Conversely, suppose that the factorization in (2.9) holds, and let w be such that f(x|t(X) = w) exists
(i.e. P[t(X) = w] > 0). If t(x) = w, then
f(x|t(X) = w) =
P[X = x, t(X) = w]
P[t(X) = w]
= 0,
22 CHAPTER 2. ESTIMATION
so does not depend on . If t(x) = w, then
P[t(X) = w] =

zt
1
(w)
P[X = z] =

zt
1
(w)
g

[t(z)]h(z) =

zt
1
(w)
g

[w]h(z) = g

[w]

zt
1
(w)
h(z),
and so
f(x|t(X) = w) =
P[X = x, t(x) = w]
P[t(X) = w]
=
P[X = x]
P[t(X) = w]
=
g

[t(x))]h(x)
g

[w)]

zt
1
(w)
h(z)
=
h(x)

zt
1
(w)
h(z)
,
which does not depend on .
Example. Suppose X
1
, , X
n
is a random sample from N(,
2
)
(1) Suppose
2
is known. Show that

X is sucient for .
(2) Suppose both and
2
are unknown. Show that (

X, s
2
) is sucient for = (,
2
).
Solution. We use the identity
n

i=1
(x
i
)
2
=
n

i=1
(x
i
x)
2
+n( x )
2
and the factorization theorem. We have
(1)
f(x
1
, , x
n
) =
1
(

2)
n
e

1
2
2
P
n
i=1
(xi)
2
=
1
(

2)
n
e

n
2
2
( x)
2
e

1
2
2
P
n
i=1
(xi x)
2
= g

( x)h(x
1
, , x
n
)
and so

X is sucient for .
(2)
f(x
1
, , x
n
) =
1
(

2)
n
e

n
2
2
( x)
2

(n1)s
2
2
2
= g
,
2( x, s
2
) 1,
so (

X, s
2
) is sucient for (,
2
).
Remarks.
(1) Suppose

= t(X) is sucient for . If a unique maximum likelihood estimator of exists, it will be a
function of

. This follows from the factorization f

(x) = g

)h(x).
(2) Suppose that the likelihood function is expressible in the form
f

(x) = h(x)c()e
w()t(x)
.
Then the factorization criterion applies with
g

(t(x)) = c()e
w()t(x)
.
Hence t(X) is a sucient statistic.
2.3. MINIMUM VARIANCE REVISITED. 23
Examples.
(1) The exponential density
f(x) =
_
1

e
x/
if x > 0,
0 otherwise,
where > 0, is of exponential type. Here, we have
h(x) = I
(0,)
(x), c() = w() =
1

, t(x) = x.
(2) The binomial probability function
f

(x) =
_
n
x
_

x
(1 )
nx
=
_
n
x
_
(1 )
n
e
x log(

1
)
, x = 0, 1, . . . , n; 0 < < 1,
is of exponential type, where
h(x) =
_
n
x
_
, c() = (1 )
n
, w() = log(

1
), t(x) = x.
Here is the general denition of an exponential family.
Denition. A family {f

(x), } of density or probability functions is called an exponential family if


it can be expressed as
f

(x) = h(x)c()e
P
k
i=1
wi()ti(x)
,
where h(x) 0 and c() 0 for all x and ; t
1
(x), . . . , t
k
(x) are real valued functions of the observation x
(which itself may be a point of some arbitrary space) which do not depend on , and w
1
(), . . . , w
k
() are
real valued functions of the parameter (again where may be some arbitrary space) which do not
depend on x.
Using the Neyman-Pearson factorization criterion, we see that the statistic t(X) = (t
1
(X), . . . , t
k
(X)) is
sucient for .
Example. The normal distribution N(,
2
) with density function
f
,
2(x) =
1

2
e

1
2
(
x

)
2
=
e

2
/2
2

2
e
x
2
/2
2
+x/
2
,
where and
2
are unknown, is of exponential type, with
h(x) 1, c(,
2
) =
e

2
/2
2

2
, w(,
2
) = (
1
2
2
,

2
), t(x) = (x
2
, x).
2.3 Minimum Variance Revisited.
References: WMS 7th ed., section 9.5, and the appendix to this chapter.
24 CHAPTER 2. ESTIMATION
Denition. Let X and Y be random variables and suppose f
Y |X
(y|x) is the conditional density (or con-
ditional probability) function of Y given that X = x. Then if x is such that f
Y |X
(y|x) is dened (that is, if
f
X
(x) > 0), we dene
E((Y )|X = x) =
_
_

(y)f
Y |X
(y|x) dy, if Y is a continuous random variable;

y
(y)f
Y |X
(y|x), if Y is a discrete random variable;
to be the conditional expectation of (Y ) given that X = x. Temporarily let h(x) = E((Y )|X = x). Then
we dene E((Y )|X) = h(X) to be the conditional expectation of (Y ) given X.
Notice that one gets E((Y )|X) by replacing x in the expression for E((Y )|X = x) by X. Thus whereas
E((Y )|X = x) is a function of the numerical variable x, E((Y )|X) is a random variable.
Proposition 2.3.1 Let X and Y be random variables and assume Y has nite mean and variance. Then
E[E(Y |X)] = E(Y ) and Var(E(Y |X)) Var(Y ), with equality if and only if P{Y = E(Y |X)} = 1.
Theorem 2.3.2 (Rao-Blackwell) Suppose T = t(X) is a sucient statistic for . Let U = u(X) be an
unbiased estimator of , and dene (w) = E(U|T = w). Then does not depend on , so that (T) is a
statistic. (T) is unbiased for and Var((T)) Var(U). If U is MVUE, then P{U = (T)} = 1.
Proof. By suciency, f(x|T = w) does not depend on , so (w) =
_
u(x)f(x|T = w)du, being an expected
value computed using f(x|T = w) does not, either. We have (T) = E(U|T), so E(T) = E[E(U|T)] =
E(U) = . Finally, we have Var((T)) = Var(E(U|T)) Var(U) by the previous proposition. If U is
minimum variance, we have equality.
Thus, assuming there exists a sucient statistic, we can always reduce variance by conditioning on it,
and an MVUE must be a function of the sucient statistic.
Denition. A family {g

(y), } of densities or probability functions is called complete if


E

h(Y ) = 0 for all


implies that P

{h(Y ) = 0} = 1 for all . Here, P

is the probability corresponding to the density (or


probability) function g

(y).
Example. Suppose that Y has the binomial distribution
g

(y) =
_
n
y
_

y
(1 )
ny
, y = 0, 1, . . . , n,
where 0 1. Then for [0, 1),
E

h(Y ) =
n

y=0
h(y)
_
n
y
_

y
(1 )
ny
= (1 )
n
n

y=0
h(y)
_
n
y
__

1
_
y
.
Putting = /(1 ), we see that if E

h(Y ) = 0 for all [0, 1), then


n

y=0
h(y)
_
n
y
_

y
= 0
for all > 0, which implies that h(y) = 0 for all y = 0, 1, . . . , n. Hence the binomial family {g

, [0, 1]} is
complete.
Denition. A statistic Y = t(X) is complete if the family {g

(y)| } of distributions of t(X) is


complete.
2.3. MINIMUM VARIANCE REVISITED. 25
Remark. Suppose that the statistic T = t(X) is complete, and that
1
(T) and
2
(T) are unbiased for .
Then
E

[
1
(T)
2
(T)] = E

[
1
(T)] E

[
2
(T)] = = 0
for all , so by completeness,
1
(T)
2
(T) in the sense that P

{
1
(T) =
2
(T)} = 1 for all .
Proposition 2.3.3 (Lehmann-Sche`e) Suppose that T is a complete sucient statistic for , and that

= (T) is unbiased for . Then



is the unique MVUE of .
Proof. T must be of the form T = t(X). Suppose that U is any unbiased estimator of , and dene

2
(w) = E(U|T = w). By the Rao-Blackwell theorem, V =
2
(T) is an unbiased estimator of with
Var(V ) Var(U). Moreover, by the previous remark, we must have

V , and so Var(

) = Var(V )
Var(U). Hence

is an MVUE.
Suppose W is another MVUE. Then by the remark following the Rao-Blackwell theorem, W must be of
the form W = (T) for some function (t). Once again, by the remark, we have W

.
Example. Let X
1
, . . . , X
n
be a random sample from the Bernoulli distribution with parameter , and let
X = X
1
+ +X
n
. We know that X is sucient for , and also, from the above example, complete. Then

X = X/n, being unbiased for , must be the unique MVUE.


Example. Let X
1
, . . . , X
n
be a random sample from the exponential distribution with density
f(x) =
_
1

e
x/
if x > 0,
0 otherwise.
Suppose we want to estimate
2
. We know that

X is sucient for . It can also be shown that

X is complete.
We have
E(

X
2
) = Var(

X) + (E

X)
2
=

2
n
+
2
=
_
n + 1
n
_

2
,
so
n
n+1

X
2
is the MVUE of
2
.
Remark. The Lehmann-Sche`e theorem is very useful for nding MVUEs. However, the verication that
a given statistic is complete can be quite dicult. It so happens that members of an exponential family are
complete as well as sucient. Most (if not all) of the distributions in WMS are from exponential families.
Hence their approach in setting a problem to nd an MVUE is to require the reader to prove the statistic is
sucient and to nd a function of it which is unbiased. Completeness is swept under the rug.
26 CHAPTER 2. ESTIMATION
Chapter 3
Condence Intervals
Reference: WMS 7th ed., chapter 8
In the previous chapter, we discussed point estimation. For example, the sample mean

X is a point
estimate of the population mean . We do not expect

X to coincide exactly with the true value of , only
to be close to it. It would be desireable to have some idea of just how close our estimate is to the true
value, namely to be able to say that
x < < x + (3.1)
for some > 0. This is not possible, but we can say that (3.1) holds with a certain degree of condence,
which we shall now make clear.
Denition. Let be a parameter of a given distribution, and let 0 < < 1. Let t
1
(X) and t
2
(X) be two
statistics such that
(1) t
1
(X) t
2
(X),
(2) Pr{t
1
(X) < < t
2
(X)} = 1 .
Let
1
and
2
be values of t
1
(X) and t
2
(X) respectively. Then

1
< <
2
is called a (1 ) 100% condence interval (or interval estimate) for .
We are now going to give several examples of standard condence intervals. All of them will be derived
by the use of a pivot.
Denition. A pivot is a random variable of the form g(X, ) whose distribution does not depend on ,
and where the function g(x, ) is for each x a monotonic function of the parameter .
Example 1. Random sample from N(,
2
),
2
known. Condence interval for . We use the
pivot

X
/

n
which has the distribution N(0, 1). We have
1 = Pr{z
/2
<

X
/

n
< z
/2
} = Pr{

X z
/2

n
. .
t1(X)
< <

X +z
/2

n
. .
t2(X)
}
after some rearrangement, and so a (1 ) 100% condence interval for is
x z
/2

n
< < x +z
/2

n
. (3.2)
27
28 CHAPTER 3. CONFIDENCE INTERVALS
Remarks. If n is large, say n 25, we may invoke the CLT to nd that the pivot has (approximately)
the N(0, 1) distribution regardless of the population distribution. Hence the condence interval is valid for
any population if n is large. Moreover, it is likely that will be unknown. If n is large, we may replace
in (3.2) by s.
Length of C.I. Let
LCL = x z
/2

n
, UCL = x +z
/2

n
be the lower and upper condence limits, respectively. Then
L = UCL LCL = 2z
/2

n
,
is the length of the condence interval. Observe that
(1) L varies inversely as the square root of the sample size.
(2) the larger the degree 1 of condence, the bigger L is.
The above formula for length can be inverted to give

n = 2z
/2

L
allowing us to compute the sample size required to achieve a certain length of condence interval. If you
dont know what is, you could use Range 4.
Example. (8.71, p. 424) We want to estimate to within 2 with a 95% c.i., so we want n so L = 4. From
past experience, we can take = 10. Hence

n = 2 1.96
10
4
10,
so n 100.
Interpretation of C.I. Suppose samples are repeatedly drawn and 90% condence intervals constructed
from them.

In the long run, 90% of such c.i.s so constructed will contain the true value of .
29
Example 2. Random sample from N(,
2
),
2
unknown. Condence interval for . We use
the pivot

X
s/

n
which has the t-distribution with n 1 d.f. We have
1 = Pr{t
/2,n1
<

X
s/

n
< t
/2,n1
}
and just as in example 1, a (1 ) 100% condence interval for is
x t
/2,n1
s

n
< < x +t
/2,n1
s

n
. (3.3)
Example 3. Random sample from N(,
2
), unknown. Condence interval for
2
. We use
the pivot
(n 1)s
2

2
which has the
2
-distribution with n 1 d.f. We have
1 = Pr{
2
1/2,n1
<
(n 1)s
2

2
<
2
/2,n1
}
and in the same way as above, a (1 ) 100% condence interval for
2
is
(n 1)s
2

2
/2,n1
<
2
<
(n 1)s
2

2
1/2,n1
. (3.4)
Numerical Example. (8.114,8.115, p.439) Given a random sample 785, 805, 790, 793, 802 from N(,
2
),
(1) nd a 90% c.i. for .
(2) nd a 90% c.i. for
2
.
Solution. We have n = 5, x = 795, s = 8.34, /2 = .05.
(1) t
/2,n1
= t
.05,4
= 2.132. Substituting into (3.3), the 90% c.i. for is
795 2.132
8.34

5
= 795 7.95 or (787.05, 802.85).
(2)
2
.95,4
= .710721 and
2
.05,4
= 9.48773. Substituting into (3.4), the 90% c. i. for
2
is
4 8.34
2
9.48773

2

4 8.34
2
.710721
2
, that is, 29.30
2
391.15. (3.5)
Note also that a 90% c.i. for is 5.412 19.77.
30 CHAPTER 3. CONFIDENCE INTERVALS
Example 4. Large sample condence interval for the parameter p of a binomial distribution.
Let X be the number of successes in n trials, where n is large. We use the pivot
X np
_
np(1 p)
which because of the CLT has approximately the N(0, 1) distribution. We have
1 = Pr{z
/2
<
X np
_
np(1 p)
< z
/2
} = Pr{(X np)
2
< z
2
/2
np(1 p)}
= Pr{(n
2
+nz
2
/2
)p
2
(z
2
/2
n + 2nX)p +X
2
< 0} = Pr{p
1
(X) < p < p
2
(X)},
where p
1
(x) and p
2
(x) are the roots of the quadratic, namely
p
1
(x), p
2
(x) =
x +z
2
/2
/2 z
/2
_
z
2
/2
/4 +x x
2
/n
n +z
2
/2
. (3.6)
Thus our condence interval is p
1
(x) < p < p
2
(x). However, since n is large, we can divide the top and
bottom of (3.6) by n to nd the approximate (1 ) 100% condence interval for p
p z
/2
_
p(1 p)
n
< p < p +z
/2
_
p(1 p)
n
, (3.7)
where p = x/n.
An Easier Derivation. We have
1 = Pr{z
/2
<
X np
_
np(1 p)
< z
/2
} = Pr{z
/2
_
np(1 p) < X np < z
/2
_
np(1 p)}
= Pr{z
/2
_
np(1 p) < np X < z
/2
_
np(1 p)}
= Pr{X z
/2
_
np(1 p) < np < X +z
/2
_
np(1 p)}
= Pr{
X
n
z
/2
_
p(1 p)
n
< p <
X
n
+z
/2
_
p(1 p)
n
},
so a (1 ) 100% c.i. for p is
p z
/2
_
p(1 p)
n
< p < p +z
/2
_
p(1 p)
n
,
where p =
x
n
.
Example 5. Condence interval for the dierence
1

2
of means of two normal distributions
N(
1
,
2
1
) and N(
2
,
2
2
). We shall assume we have independent random samples of sizes n
1
and n
2
from
these distributions, and that
2
1
and
2
2
are known. The pivot is

X
1


X
2
(
1

2
)
_

2
1
n1
+

2
2
n2
and has the distribution N(0, 1). In the usual way, we derive
x
1
x
2
z
/2

2
1
n
1
+

2
2
n
2
<
1

2
< x
1
x
2
+z
/2

2
1
n
1
+

2
2
n
2
.
The same remarks apply here as in example 1.
31
Example 6. Small sample condence interval for the dierence
1

2
of means of two normal
distributions N(
1
,
2
1
) and N(
2
,
2
2
). We shall assume we have independent random samples of sizes
n
1
and n
2
from these distributions, and that
2
1
and
2
2
are unknown. However, a technical assumption we
shall have to make is that
2
1
=
2
2
=
2
, say. The pivot is

X
1


X
2
(
1

2
)
_
s
2
p
n1
+
s
2
p
n2
where
s
2
p
=
(n
1
1)s
2
1
+ (n
2
1)s
2
2
n
1
+n
2
2
is the pooled sample variance. Note that
(n
1
+n
2
2)s
2
p

2
=
(n
1
1)s
2
1

2
1
+
(n
2
1)s
2
2

2
2
and so has a
2
-distribution with n
1
+n
2
2 degrees of freedom. Using this information, our pivot can be
written as

X1

X2(12)
q

2
n
1
+

2
n
2
_
(n1+n22)s
2
p

2
/(n
1
+n
2
2)
and so has the t-distribution with n
1
+ n
2
2 degrees of freedom. The rest is as usual, and we nd that a
(1 ) 100% condence interval is
x
1
x
2
t
/2,n1+n22
s
p
_
1
n
1
+
1
n
2
<
1

2
< x
1
x
2
+t
/2,n1+n22
s
p
_
1
n
1
+
1
n
2
. (3.8)
Numerical Example. (No. 8.120, p. 440)
Method 1 Method 2
No. of children in group 11 14
x 64 69
s
2
52 71
Solution. We have
s
2
p
=
(10 52) + (13 71)
23
= 62.74, t
.025,23
= 2.069,
so substituting into (3.8), a 95% c.i. for
1

2
is
64 69 2.069

62.74
_
1
11
+
1
14
_
, which is 5 6.60.
Example 7. Condence interval for the ratio
2
1
/
2
2
of variances of two normal distributions
N(
1
,
2
1
) and N(
2
,
2
2
). We shall assume we have independent random samples of sizes n
1
and n
2
from
these distributions, and that
1
and
2
are unknown. The pivot is
s
2
2
/
2
2
s
2
1
/
2
1
32 CHAPTER 3. CONFIDENCE INTERVALS
which has the F-distribution with n
2
1, n
1
1 degrees of freedom. In the usual way, we nd the condence
interval to be
s
2
1
s
2
2
F
1/2,n21,n11
<

2
1

2
2
<
s
2
1
s
2
2
F
/2,n21,n11
.
Sometimes the fact that
F
1/2,n21,n11
=
1
F
/2,n11,n21
is used in this c.i.
Example 8. Large sample condence interval for the dierence p
1
p
2
of parameters of two
independent binomial random variables. Let X
1
and X
2
be independent binomial random variables
with parameters n
1
, p
1
and n
2
, p
2
. Dene p
1
= x
1
/n
1
and p
2
= x
2
/n
2
. The pivot
p
1
p
2
(p
1
p
2
)
_
p1(1p1)
n1
+
p2(1p2)
n2
has distribution N(0, 1), and we nd
1 = Pr{ p
1
p
2
z
/2

p
1
(1 p
1
)
n
1
+
p
2
(1 p
2
)
n
2
< p
1
p
2
< p
1
p
2
+z
/2

p
1
(1 p
1
)
n
1
+
p
2
(1 p
2
)
n
2
}.
By estimating p
1
and p
2
under the square root by p
1
and p
2
, we obtain the condence interval as
p
1
p
2
z
/2

p
1
(1 p
1
)
n
1
+
p
2
(1 p
2
)
n
2
< p
1
p
2
< p
1
p
2
+z
/2

p
1
(1 p
1
)
n
1
+
p
2
(1 p
2
)
n
2
.
Example 9. Condence interval for the mean of an exponential distribution. Suppose we have
a random sample X
1
, , X
n
from the exponential distribution with mean . We take as pivot the random
variable
2

i=1
X
i
,
which has the
2
distribution with 2n degrees of freedom (check its moment generating function). We
therefore get the condence interval
2

n
i=1
x
i

2
/2,2n
< <
2

n
i=1
x
i

2
1/2,2n
.
Extra Notes-Large Sample Condence Intervals. Here, we assume our estimator

is such that for
large n sample size,

has approximately the distribution N(0, 1). Here,

is the standard error (the standard deviation of



).
Then we can write
1 = P[z
/2

z
/2
] = P[z
/2



z
/2

]
= P[

z
/2



+z
/2

],
resulting in the (1 ) 100% c.i

z
/2



+z
/2

.
33
Bayesian Credible Sets. The Bayesian analog of a classical condence interval is called a credible set.
Denition. A (1 ) 100% credible set for is a subset C of such that
1 P(C|x) =
_
_
C
(|x)d (continuous case),

C
(|x) (discrete case).
Obviously, there can be many such sets. One usually looks for the one that has minimal length.
34 CHAPTER 3. CONFIDENCE INTERVALS
Chapter 4
Theory of Hypothesis Testing
Reference: WMS 7th ed., chapter 10
4.1 Introduction and Denitions.
We have an observation X from a distribution F belonging to a family F of distributions. Let F
0
F, and
let F
1
= F \ F
0
. Then based on X, we want to decide whether F F
0
or F F
1
; that is, we wish to decide
which of the hypotheses
H
0
: F F
0
H
1
: F F
1
is true. H
0
is called the null hypothesis, and H
1
the alternative hypothesis. If F, F
0
, and F
1
are parametrized
as F = {F

, }, F
0
= {F

,
0
} , and F
1
= {F

,
1
}, where
0

1
= and
0

1
= , these
two hypotheses can be (and usually are) equivalently written as
H
0
:
0
H
1
:
1
.
The only way to base a decision on X is to choose a subset C R
X
, called the critical region (or region
of rejection) for the test, with the understanding that if X C, then H
0
will be rejected (and H
1
accepted),
and if X / C, then H
0
will be accepted (and H
1
rejected).
Because X is random, it is possible that the value x of X might lie in C even though H
0
is true. This
would cause us to erroneously reject H
0
. With this in mind, let us enumerate the four possible things that
can happen in a test, as shown in the following table.
Accept Ho
Accept H1
correct
decision
correct
decision
type I
error
type II
error
Ho is true H1 is true
When H
0
is true and the test procedure causes us to reject H
0
(that is, if the value x of X is in C), we have
made an error of type 1. When H
1
is true and we are led to accept H
0
(that is, if X / C), we have made
an error of type II. Of great importance will be the probability
= Pr
H0
{X C}
35
36 CHAPTER 4. THEORY OF HYPOTHESIS TESTING
of making a type I error, and the probability
= Pr
H1
{X / C}
of making a type two error. Here, the notation P
H
{} means calculate the probability of the event {}
assuming H is true. The number 1 will be called the power of the test.
If in a hypothesis H : F C, C consists of a single distribution, then H is called a simple hypothesis.
Otherwise, H is a composite hypothesis.
Example 1.1. Suppose an urn is lled with 7 marbles, of which are red and the rest are green. We want
to test the hypotheses
H
0
: = 3
H
1
: = 5.
Both of these are simple hypotheses each species a specic value of . To carry out the test, we select a
random sample of size three, without replacement. The rule will be: reject H
0
(and accept H
1
) if at least
two of the marbles in the sample are red; otherwise accept H
0
(and reject H
1
).
Discussion. The sample space here is the set of
_
7
3
_
triples of the form R
1
R
3
G
1
, and the critical region is
that part of it consisting of all triples with at least two Rs. X is the particular triple in S that occurs. Let
Y denote the number of red marbles in the sample. Y is determined from X and so is a statistic. Certainly
Y is zero if is zero, and is three if is seven. Otherwise, the probability function of Y is given by
Pr{Y = y} =
_

y
__
7
3y
_
_
7
3
_ , y = 0, 1, . . . , min(3, ).
Let us calculate the probability of making type I and type II errors. The probability of a type I error is
= Pr
H0
{X C} = Pr
=3
{Y 2} =
_
3
2
__
4
1
_
_
7
3
_ +
_
3
3
__
4
0
_
_
7
3
_ =
13
35
and the probability of a type II error is
= Pr
H1
{X / C} = Pr
=5
{Y < 2} =
_
5
1
__
2
2
_
_
7
3
_ =
5
35
.
The power of the test is 30/35. Of course, a dierent rule would lead to dierent values for and . If we
think that is too high, we could change the rule to: reject H
0
if Y = 3. In that case, would be 1/35,
but would increase. To decrease both and simultaneously, we would have to increase the sample size.
Of course, when the sample size is 7, both and are zero since no error can be made.
Example 1.2. Suppose we have a normal distribution N(,
2
) with known variance
2
= 625, and we
want to test the hypotheses
H
0
: = 100
H
1
: = 70.
We select a simple random sample of size 10 from this distribution. The rule is: reject H
0
if

X < 80.
4.1. INTRODUCTION AND DEFINITIONS. 37
Discussion. Here again, both hypotheses are simple. The observation is a simple random sample of size
10, the sample space S is the set of all ten-tuples (x
1
, , x
10
) of real numbers, and the critical region is
C = {(x
1
, , x
10
) S| x < 80}. The probability of making a type I error is
= Pr
H0
{X C} = Pr
=100
{

X < 80} = Pr{

X
/

n
<
80 100
25/

10
} = Pr{Z < 2.53} = 0.0057
and the probability of a type II error is
= Pr
H1
{X / C} = Pr
=70
{

X 80} = Pr{

X
/

n

80 70
25/

10
} = Pr{Z 1.26} = 0.1040.
Example 1.3. Suppose we have the same urn problem as in example 1, but this time we wish to test the
composite hypotheses
H
0
: 3 (4.1)
H
1
: > 3.
As before, we select a sample of size 3, without replacement, and we reject H
0
if the number Y of red balls
in the sample is at least two. What are the probabilities of type I and type II errors?
Since knowing that H
0
is true does not pinpoint a particular value of , we must calculate an value, to
be denoted by (), for each value of with 3; that is, for each value of assumed under H
0
. Similarily,
we must calculate a value () for each value of assumed under H
1
.
Denition. The function
P() = P

(reject H
0
) =
_
() if is assumed under H
0
,
1 () if is assumed under H
1
.
is called the power function of the test. If () for all values of assumed under H
0
, then the test is
said to be of level . The number max{() : is assumed under H
0
} is called the size of the test. (These
denitions are as given in the graduate texts Casella and Berger, p. 385, and Shao, p. 126.)
Remarks. WMS use some dierent terminology.
(1) They would write the hypotheses in (4.1), as
H
0
: = 3
H
1
: > 3,
and the hypotheses in (4.2), as
H
0
: = 100
H
1
: < 100.
This is justied if for example, max
100
() = (100).
(2) WMS write their null hypotheses as simple, say as in
H
0
: =
0
H
1
: <
0
.
They then dene (
0
) to be the level of the test. Thus, they confuse size and level.
38 CHAPTER 4. THEORY OF HYPOTHESIS TESTING
Let us now return to example 3 and compute the power function for the test. We have
P() = P

(reject H
0
) = Pr

{Y = 2} + Pr

{Y = 3},
which can be computed from (1.1). The results are in the following table:
0 1 2 3 4 5 6 7
P() 0 0 5/35 13/35 22/35 30/35 1 1
Example 1.4. Suppose as in example 1.2 we have a normal distribution N(, 625) and we wish to test
the composite hypotheses
H
0
: 100 (4.2)
H
1
: < 100.
As before, we select a simple random sample of size 10 from this distribution and reject H
0
if

X < 80. Find
the power function for this test.
Solution. We have
P() = Pr

{

X < 80} = Pr

X
/

n
<
80
25/

10
} = Pr{Z <
80
25/

10
}
where Z N(0, 1). We get the following table for selected values of ,
50 60 70 75 80 85 90 100 110
P() 1 .9941 .8962 .7357 .5000 .2643 .1038 .0059 0
which gives the graph
1.0
.8
.6
.4
.2
50 60 70 80 90 100 110 120

P( )
Example 1.5. For the test of example 1.2, nd a critical region of size = .05.
4.2. HOW TO CHOOSE THE CRITICAL REGION CASE OF SIMPLE HYPOTHESES. 39
Solution. The critical region is of the form {

X k}, where we have to nd k. We have
.05 = = Pr

{

X k} = Pr
=100
{

X
/

n
<
k 100
25/

10
} = Pr{Z <
k 100
25/

10
},
so
k 100
25/

10
= z
.05
= 1.645.
Solving, we nd k = 87. Hence, in order that the level of the test be .05, the rule should be: reject H
0
if
x 87. In this case we also have
= Pr
=70
{

X > 87} = Pr{Z >
87 70
25/

10
} = Pr{Z > 2.15} = .016.
4.2 How to Choose the Critical Region Case of Simple Hypothe-
ses.
In the examples of 4.1, the critical region (or equivalently the rule) was chosen sensibly, but with no
particular method. In this section, we will learn how to choose good critical regions when both hypotheses
are simple.
We have noticed in the examples of 4.1 that dierent critical regions give dierent values of and ,
and that changing the critical region to decrease has the eect of increasing , and vice-versa. A good
critical region should be one for which both and are as small as possible, but it seems we cannot make
both simultaneously small (unless the sample size is increased). Hence our method will be as follows:
(1) Decide beforehand on a suitable value of . Typical values are .01 or .05.
(2) Among all tests (i.e. critical regions) of level , choose one that has maximal power (i.e. minimal ).
This is called a most powerful test (critical region) of level .
In the remainder of this section, H
0
and H
1
denote two simple hypotheses, and f
0
(x) and f
1
(x) denote the
likelihood (i.e. density or probability) functions of the observation X under H
0
and H
1
respectively.
Example 2.1. Two hypotheses are to be tested on the basis of an observation X with range set {0, 1, 2, . . . , 5, 6}.
f
0
(x) and f
1
(x) are given in the following table.
x 0 1 2 3 4 5 6
f
0
(x) .2 .1 .1 .05 .25 .3 0
f
1
(x) .1 .1 .2 .2 .1 .15 .15
Find a most powerful critical region of level = .3.
Solution. The critical regions of level .3 are as follows. In each case, the powers (1 s) for these are
given in parentheses.
Size = .3 : {0, 1}(.2), {0, 1, 6}(.35), {0, 2}(.3), {0, 2, 6}(.45), {3, 4}(.3), {3, 4, 6}(.45), {5}(.15), {5, 6}(.3)
Size = .25 : {0, 3}(.3), {4}(.1), {4, 6}(.25), {1, 2, 3}(.5)
Size = .2 : {0}(.1), {0, 6}(.25){1, 2}(.3)
Size = .15 : {1, 3}(.3), {2, 3}(.4)
Size = .1 : {1}(.1), {2}(.2)
Size = .05 : {3}(.2).
40 CHAPTER 4. THEORY OF HYPOTHESIS TESTING
Hence either {0, 2, 6} or {3, 4, 6} is a most powerful critical region of size .3. In both cases, the power is .45.
The most powerful critical region of level .3 is {1, 2, 3}. It has size .25 and power .5.
In general, there may be innitely many critical regions of size , so how do we nd a most powerful one?
We reason as follows. For an x C, in which case we are going to reject H
0
, f
0
(x) should be small, while
f
1
(x) should be relatively large. Hence x C should correspond to small values of f
0
(x)/f
1
(x). This is
the thinking behind the Neyman-Pearson Lemma.
Theorem 4.2.1 (The Neyman-Pearson Lemma) Let C be a critical region of size . If there exists a
constant k > 0 such that
f
0
(x) kf
1
(x) for x C, (4.3)
f
0
(x) > kf
1
(x) for x / C, (4.4)
then C is a most powerful critical region of level (i.e. at most size) .
Proof. () Let C be such a critical region, and let D be any other critical region of level . Then
_
CD
f
0
(x) dx +
_
C\D
f
0
(x) dx =
_
C
f
0
(x) dx =
_
D
f
0
(x) dx =
_
CD
f
0
(x) dx +
_
D\C
f
0
(x) dx,
so that
_
C\D
f
0
(x) dx
_
D\C
f
0
(x) dx.
Using this with (4.3) and (4.4), we then have
1
C
=
_
C
f
1
(x) dx =
_
CD
f
1
(x) dx +
_
C\D
f
1
(x) dx

_
CD
f
1
(x) dx +
_
C\D
f
0
(x)
k
dx

_
CD
f
1
(x) dx +
_
D\C
f
0
(x)
k
dx

_
CD
f
1
(x) dx +
_
D\C
f
1
(x) dx =
_
D
f
1
(x) dx = 1
D
.
The Neyman-Pearson lemma not only tells us when a critical region is most powerful; it also tells us
exactly how to construct a most powerful critical region of size . What we do is
(1) Step 1. Dene C = {x : f
0
(x) kf
1
(x)} and then
(2) Step 2. Find k from the relation Pr
H0
{X C} = .
Note that the test constructed has size , but will be the most powerful test of level .
Example 2.2. We have a normal distribution N(,
2
) with known variance
2
. Based on a random
sample X
1
, . . . , X
n
, we wish to test
H
0
: =
0
H
1
: =
1
where
1
<
0
. Find a most powerful critical region of size .
4.2. HOW TO CHOOSE THE CRITICAL REGION CASE OF SIMPLE HYPOTHESES. 41
Solution. As usual, the likelihood function is
f

(x) = f(x
1
, . . . , x
n
) =
_
1

2
_
n
e

1
2
2
P
n
i=1
(xi)
2
=
_
1

2
_
n
e

1
2
2
[
P
n
i=1
(xi x)
2
+n( x)
2
]
and so
f
0
(x)
f
1
(x)
= e

1
2
2
n[( x0)
2
( x1)
2
]
= e
n(
0

1
)
2
2
[2 x(0+1)]
.
Hence
C = {
f
0
(x)
f
1
(x)
k} = {
n(
0

1
)
2
2
[2 x (
0
+
1
)] log k} = {2 x (
0
+
1
) k

} = { x k

}
where we used the fact that
1
<
0
. This tells us the form of the critical region. Now we nd k

. We
have
= Pr
=0
{

X k

} = Pr
=0
{

X
/

n

k

0
/

n
} = Pr{Z
k

0
/

n
},
which implies that
k

0
/

n
= z

.
Solving, we nd that k

=
0
z

n. The rule for the most powerful test is therefore:


reject H
0
at level if x
0
z

n
.
Alternate Version of the Test. We can also write
C =
_

X
/

n
k

_
= {Z k

},
and nd k

from = Pr{Z k

}. This gives k

= z

, and the test is: calculate


z =
x
0
/

n
and reject H
0
at level if z z

.
Remark. As in example 2.2, suppose we use the Neyman-Pearson method to nd a critical region C of
size which is most powerful of level , for the hypotheses
H
0
: =
0
H
1
: =
1
,
and that C is independent of
1
for
1

1
(so that C is most powerful of level for testing this pair of
hypotheses for any
1

1
). Then C is uniformly most powerful of level for the hypotheses
H
0
: =
0
H
1
:
1
.
That is, if P
C
() is the power function of C and P
D
() is the power function of any other critical region D
of level , then P
C
() P
D
() for all
1
. If moreover the null hypothesis is H
0
:
0
and ()
for all
0
, then C is UMP for testing
H
0
:
0
H
1
:
1
.
42 CHAPTER 4. THEORY OF HYPOTHESIS TESTING
Rationale: For if D is another critical region for this test with P
0
[X D] , and if


1
, then C
and D are both critical regions for testing
H
0
: =
0
H
1
: =

,
so the power of D (at

) is at most the power of C there. That is, P


D
(

) P
D
(

). This is true of all


1
.
Example 2.2 again. So the size critical region x
0
z

n
obtained in example 2.2 is the same
for any
1
<
0
, so is UMP of level for testing
H
0
: =
0
H
1
: <
0
,
or equivalently
H
0
:
0
H
1
: <
0
.
Note (): this method will not allow us to handle the two-sided test
H
0
: =
0
H
1
: =
0
.
That is handled in the next section.
Example 2.3. We observe a binomial random variable with parameter . Find a most powerful critical
region of size for testing
H
0
: =
0
H
1
: =
1
where
1
>
0
.
Solution. We have f

(x) =
_
n
x
_

x
(1 )
nx
for x = 0, 1, . . . , n, and so
C = {x|
_
n
x
_

x
0
(1
0
)
nx
k
_
n
x
_

x
1
(1
1
)
nx
} = {x|
_

1
_
x
_
1
0
1
1
_
nx
k}
= {x|x[log
_

1
_
+ log
_
1
1
1
0
_
] log k nlog
_
1
0
1
1
_
} = {x|x
log k nlog
_
10
11
_
log
_
0
1
_
+ log
_
11
10
_},
Note that the inequality changed direction because log
_
0
1
_
+log
_
11
10
_
< 0 due to the assumption
1
>
0
.
Thus C is of the form C = {x k

}, where k

is such that = Pr
=0
{X k

}. There is no nice formula


for k

in this case, and because X is discrete, we cannot always ensure a test of exactly size . Hence we
will dene
k

() = the smallest value of k such that Pr

{X k} .
The reason for choosing the smallestis to maximize the power. Then the test is: reject H
0
at level if
x k

(
0
). For a numerical example, suppose that n = 20, = .05, and we want to test
H
0
: = .3
H
1
: = .5.
From binomial tables, or using a calculator, one nds that k
.05
(.3) = 10. So the test is: reject H
0
if x 10.
4.3. HOW TO CHOOSE THE CRITICAL REGION - CASE OF COMPOSITE HYPOTHESES. 43
Example 2.4. We have a normal distribution N(,
2
) with known mean . Based on a random sample
X
1
, . . . , X
n
, we wish to test
H
0
:
2
=
2
0
H
1
:
2
=
2
1
where
2
1
>
2
0
. Find a most powerful critical region of size .
Solution. The likelihood function is
f

2(x) = f(x
1
, . . . , x
n
) =
_
1
2
2
_
n/2
e

1
2
2
P
n
i=1
(xi)
2
and so
f

2
0
(x)
f

2
1
(x)
=
_

2
1

2
0
_
n/2
e

1
2

2
0

2
1

P
n
i=1
(xi)
2
.
After simplifying, we get C = {

2
k}, where

2
=
1
n

n
i=1
(x
i
)
2
. Step 2 becomes
= P
H0
[X C] = P

2
=
2
0
[

2
k] = P

2
=
2
0
[
n

2
k

],
where
n
c

2

2
n
, implying that k

=
n
c

2
0
. Hence the test is: reject H
0
at level if
n
c

2
0

2
,n
.
4.3 How to Choose the Critical Region - Case of Composite Hy-
potheses.
In this section, we have an observation X from a distribution with parameter , and we want to test
H
0
:
0
H
1
:
1
where is the set of parameter values , and =
0

1
disjoint. If C is a critical region for the test, then
P
C
() = Pr

(reject H
0
) = Pr

(X C) =
_
() if
0
,
1 () if
1
.
is the power function corresponding to C.
Denition. Let C and D be two critical regions of level (i.e. both size(C) and size(D) are less than
). We say that C is uniformly more powerful than D if P
C
() P
D
() for all
1
. A test (or critical
region) of level is uniformly most powerful if it is more powerful than any other test of level .
As in the previous section, we decide beforehand on a value for the level of the test. Then we
search among all critical regions for one of size less than or equal to , which is uniformly most powerful.
Unfortunately, such uniformly most powerful tests do not always exist. The following method of nding
critical regions, called the likelihood ratio method, tends to give tests with excellent qualities.
The Likelihood Ratio Method. Let f

(x) denote the density or probability function of X. Then


(x) =
max
0
f

(x)
max

(x)
is called the likelihood ratio statistic. We take C = {x|(x) k} as the critical region, where k is chosen so
that max
0
Pr

{X C} = .
44 CHAPTER 4. THEORY OF HYPOTHESIS TESTING
Remark. When the hypotheses are simple, we have = {
0
,
1
} and
0
= {
0
}. It is easy to show that
the C dened in the previous paragraph is the same as the C which results from the Neyman-Pearson lemma.
Example 3.1. Given a random sample X
1
, . . . , X
n
from N(,
2
) where
2
is known, nd the likelihood
ratio test of size for testing
H
0
: =
0
H
1
: =
0
.
Solution. Here we have
0
= {
0
}, = R, and
f

(x) = f(x
1
, . . . , x
n
) =
_
1

2
_
n
e

1
2
2
P
n
i=1
(xi)
2
.
We have
max
0
f

(x) = f
0
(x) =
_
1

2
_
n
e

1
2
2
P
n
i=1
(xi0)
2
,
max

(x) = f
x
(x) =
_
1

2
_
n
e

1
2
2
P
n
i=1
(xi x)
2
,
and using our identity
n

i=1
(x
i
)
2
=
n

i=1
(x
i
x)
2
+n( x )
2
,
we get
(x) = e
n( x0)
2
/2
2
.
This gives
C = { k} = {
n( x
0
)
2
2
2
log k} = {

x
0
/

}.
We nd k

from
= Pr
=0
{

x
0
/

},
which gives k

= z
/2
. Hence the test is: calculate
z =
x
0
/

n
and reject H
0
if |z| z
/2
.
Example 3.2. Same as example 3.1, but this time the hypotheses are
H
0
:
0
H
1
: >
0
.
4.3. HOW TO CHOOSE THE CRITICAL REGION - CASE OF COMPOSITE HYPOTHESES. 45
Solution. This time
(x) =
max
0
f

(x)
max
<<
f

(x)
.
If we write
f

(x) =
_
1

2
_
n
e

1
2
2
P
n
i=1
[(xi x)
2
+n( x)
2
]
,
we see that f

(x) has its maximum when ( x)


2
has its minimum. Subject to the condition
0
, ( x)
2
has its minimum at
=
_

0
if x
0
,
x if x
0
.
Hence
(x) =
_
f
0
(x)
f x(x)
= e
n( x0)
2
/2
2
if x
0
,
f x(x)
f x(x)
= 1 if x
0
,
and so because k must be less than one in order to have size < 1,
C = {(x) k, x
0
} {(x) k, x <
0
} = {e
n( x0)
2
/2
2
k, x
0
}
= {

x
0
/

, x
0
} = {
x
0
/

n
k

, x
0
}
where k

0. Now we have to determine k

, using the fact that the test is to have level . Since for
0
,
Pr

X
0
/

n
k

}
is increasing in , then
= max
0
Pr

X
0
/

n
k

} = Pr
0
{

X
0
/

n
k

}
implying that k

= z

.
Example 3.3. Given a random sample X
1
, . . . , X
n
from N(,
2
) where
2
is unknown, nd the likelihood
ratio test of size for testing
H
0
: =
0
H
1
: =
0
.
Solution. This time we have
(x) =
max
=0,
2
>0
f
,
2(x)
max
<<,
2
>0
f
,
2(x)
=
f
0,

2
(x)
f
x,

2
(x)
where

2
=
1
n
n

i=1
(x
i

0
)
2
,

2
=
1
n
n

i=1
(x
i
x)
2
.
Thus,
(x) =
_

2
_
n/2
=
_
n
i=1
(x
i

0
)
2

n
i=1
(x
i
x)
2
_
n/2
=
_
1 +
n( x
0
)
2

n
i=1
(x
i
x)
2
_
n/2
=
_
1 +
t
2
n 1
_
n/2
,
46 CHAPTER 4. THEORY OF HYPOTHESIS TESTING
where
t =
x
0
s/

n
.
Hence the critical region is
C = {
_
1 +
t
2
n 1
_
n/2
k} = {
_
1 +
t
2
n 1
_
n/2
k
1
} = {|t| k

}.
Since = Pr
=0
{|t| k

}, then k

= t
/2,n1
. The test is
Reject H
0
at level if

X
0
s/

t
/2,n1
.
Example 3.4. Given a random sample X
1
, . . . , X
n
from N(,
2
) where is unknown, nd the likelihood
ratio test of size for testing
H
0
:
2
=
2
0
H
1
:
2
=
2
0
.
Solution. This time we have
(x) =
max
<<,
2
=
2
0
f
,
2(x)
max
<<,
2
>0
f
,
2(x)
=
f
x,
2
0
(x)
f
x,
c

2
(x)
where

2
=
1
n
n

i=1
(x
i
x)
2
.
Thus,
(x) =
_
1
0

2
_
n
e

1
2
2
0
P
n
i=1
(xi x)
2
_
1

2
_
n
e

1
2
d

2
P
n
i=1
(xi x)
2
=
_

2
0
_n
2
e

1
2
2
0
P
n
i=1
(xi x)
2
e
n
2
=
_
(n 1)s
2

2
0
_
n
2
e

1
2
(n1)s
2

2
0
e
n
2
n

n
2
,
where
s
2
=
1
n 1
n

i=1
(x
i
x)
2
=
n
n 1

2
.
Let us write
Y =
(n 1)s
2

2
0

2
n1
under H
0
.
Then
(x) = Y
n/2
e
Y/2
_
e
n
_
n/2
,
so the critical region is of the form
C = {Y
n/2
e
Y/2
k} = {Y k
1
} {Y k
2
}.
Hence = P
H0
(C) = P
H0
{Y k
1
} +P
H0
{Y k
2
}. We will choose k
1
and k
2
so that
P
H0
{Y k
1
} =

2
= P
H0
{Y k
2
}.
This gives k
1
=
2
1/2,n1
and k
2
=
2
/2,n1
. Thus the level- test is: calculate
Y =
(n 1)s
2

2
0
and reject H
0
if either Y
2
1/2,n1
or Y
2
/2,n1
.
4.4. SOME LAST TOPICS. 47
4.4 Some Last Topics.
4.4.1 Large Sample Tests.
We have a random sample X
1
, . . . , X
n
from F

. Suppose that for n large ( 25), the statistic


, where

is the standard error (std devn of



), has approximately the distribution N(0, 1). Then to test
H
0
: =
0
H
1
: =
0
( >
0
)
( <
0
)
at level , calculate z =

and reject H
0
if
|z| > z
/2
(z > z

)
(z < z

)
4.4.2 p-value.
(Also called p-level or observed level of signicance). The p-value of a test is that value of for which the
observed value of the test statistic is on the border between accepting and rejecting H
0
.
4.4.3 Bayesian Tests of Hypotheses.
In Bayesian statistics, one chooses between
H
0
:
0
H
1
:
1
by calculating P(
0
|x) and P(
1
|x), and deciding accordingly.
4.4.4 Relationship Between Tests and Condence Sets.
Let E denote the set of possible values of the observation X, and let be the set of all values of the
parameter .
Denition. A condence set C(X) for is a subset of consisting of parameter values which are
consistent with the observation X. The function P

{ C(X)} is called the coverage probability. The


condence coecient of C(X) is 1 = inf

{ C(X)}.
In the chapter on condence intervals, we had R, and C(x) was an interval of the form [
L
(x),
U
(x)],
or (,
U
(x)], or [
L
(x), +). In this case we have interval estimators, or condence intervals.
The following proposition shows that every test statistic corresponds to a condence set, and vice-versa.
Proposition 4.4.1 For each
0
, let A(
0
) be the acceptance region (i.e. complement of the rejection
region) for a size test of H
0
: =
0
against H
1
: =
0
. For each x E, dene
C(x) = { |x A()}.
Then C(X) is a 1 condence set for . Conversely, for each x E, let C(x) be a 1 condence set
for . For each , dene
A() = {x E : C(x)}.
Then A(
0
) is an acceptance region for a size test of H
0
: =
0
against H
1
: =
0
.
48 CHAPTER 4. THEORY OF HYPOTHESIS TESTING
Proof. Since C(x) x A(), and P

{X A()
c
} = , then
inf

{ C(X)} = inf

{X A()} = 1 sup

{X A()
c
},
so if one side is 1 , then so is the other.
Remarks. There is no guarantee that the condence set obtained by this method will be connected; for
example, an interval if R. But in most one-dimensional cases, one-sided tests give one-sided intervals
and two-sided tests give two-sided intervals.
Example. Given a r.s. X
1
, . . . , X
n
from N(,
2
) with
2
unknown, nd a c.i. of the form [
L
(x),
U
(x)]
for .
Solution. We will invert the test for H
0
: =
0
against H
1
: =
0
. The acceptance region of size
for this test is A(
0
) = {x :

x0
s/

t
/2,n1
}, so it follows that C(x) = { :

x
s/

t
/2,n1
} = { :
x t
/2,n1
s

n
x +t
/2,n1
s

n
}. Hence
L
(x) = x t
/2,n1
s

n
and
U
(x) = x +t
/2,n1
s

n
.
Chapter 5
Hypothesis Testing: Applications
5.1 The Bivariate Normal Distribution.
Reference: MWS 7th ed., section 5.10
Denition. The random variables X and Y have a bivariate normal distribution if their joint density
function is
f(x, y) =
1
2
x

y
_
1
2
e

1
2(1
2
)

(
xx
x
)
2
2(
xx
x
)

yy
y

yy
y

, < x, y < .
Recall that the marginal distributions of X and Y are N(
x
,
2
x
) and N(
y
,
2
y
), and that is the correlation
coecient between X and Y .
In general, the random vector X = (X
1
, . . . , X
n
) has the multivariate normal distribution N(, ) with
mean vector and covariance matrix (symmetric and positive denite) if X has joint density function
given by
f(x) =
1
(2)
n/2

det
e

1
2
(x)
T

1
(x)
, x R
n
.
Proposition 5.1.1 Let (X
1
, Y
1
), . . . , (X
n
, Y
n
) be a random sample of size n from the bivariate normal dis-
tribution. Then the maximum likelihood estimators of
x
,
y
,
2
x
,
2
y
, and are

x
=

X =
1
n
n

i=1
X
i
,
y
=

Y =
1
n
n

i=1
Y
i
,
2
x
=
1
n
n

i=1
(X
i


X)
2
,
2
y
=
1
n
n

i=1
(Y
i


Y )
2
,
and
r = =

n
i=1
(X
i


X)(Y
i


Y )
_

n
i=1
(X
i


X)
2
_

n
i=1
(Y
i


Y )
2
.
r is called the sample correlation coecient.
Proof. Using (1.1) and (1.2), we can write
n

i=1
(x
i

x
)
2
= n

2
x
+n( x
x
)
2
,
n

i=1
(y
i

y
)
2
= n

2
y
+n( y
y
)
2
,
n

i=1
(x
i

x
)(y
i

y
) = n
x

y
+n( x
x
)( y
y
).
49
50 CHAPTER 5. HYPOTHESIS TESTING: APPLICATIONS
The likelihood function is then
L = L(
x
,
y
,
2
x
,
2
y
, ) =
n
i=1
f(x
i
, y
i
)
=
e

1
2(1
2
)
n

i=1
_
_
x
i

x
_
2
2
_
x
i

x
__
y
i

y
_
+
_
y
i

y
_
2
_
(2
x

y
)
n
(1
2
)
n/2
=
e

n
2(1
2
)
_

2
x

2
x
+
( x
x
)
2

2
x

2
x

y

2( x
x
)( y
y
)

y
+

2
y

2
y
+
( y
y
)
2

2
y
_
(2
x

y
)
n
(1
2
)
n/2
.
(Note that
x
,
y
,

2
x
,

2
y
, and are jointly sucient.) Then
log L = nlog 2
n
2
log
2
x

2
y
(1
2
)

n
2(1
2
)
_

2
x

2
x
+
( x
x
)
2

2
x

2
x

y

2( x
x
)( y
y
)

y
+

2
y

2
y
+
( y
y
)
2

2
y
_
.
Dierentiating with respect to
x
,
y
,
2
x
,
2
y
, and , setting the results equal to zero, and solving the resulting
ve equations gives the required estimators.
The only thing new here is r. We are going to use r to make inferences about .
Computational Formula for r.
r =
S
xy

S
xx
_
S
yy
,
where S
xy
=

n
i=1
(x
i
x)(y
i
y) =

n
i=1
x
i
y
i
n x y.
5.2 Correlation Analysis.
Reference: MWS 7th ed., section 11.8
Maximum Likelihood Ratio Test for .
Given a random sample (X
1
, Y
1
), . . . , (X
n
, Y
n
) from a bivariate normal distribution N(
x
,
2
x
;
y
,
2
y
; ), let
us nd the maximum likelihood ratio test for
H
0
: = 0
H
1
: = 0
Solution. If = 0, L becomes
L =
e

n
2
_
1

2
x
(

2
x
+ ( x
x
)
2
) +
1

2
y
(

2
y
+ ( y
y
)
2
)
_
(2)
n
(
2
x

2
y
)
n/2
.
The maximum likelihood estimators of
x
,
y
,
2
x
,
2
y
are easily seen to be x, y,

2
x
,

2
y
, and so the numerator
of the maximum likelihood ratio statistic (x) is
L
num
=
e
n/2
(2)
n
(

2
x

2
y
)
n/2
.
5.2. CORRELATION ANALYSIS. 51
In the unconstrained case, the MLEs of
x
,
y
,
2
x
,
2
y
, are x, y,

2
x
,

2
y
, , and so the denominator of (x) is
L
den
=
e
n/2
(2)
n
(

2
x

2
y
(1
2
))
n/2
.
It follows that
(x) =
L
num
L
den
= (1
2
)
n/2
.
The critical region is therefore
C = {(1
2
)
n/2
k} = {
2
k

}.
Next, by dening
u
i
=
x
i

x
, v
i
=
y
i

y
, i = 1, . . . , n,
we see that x
i
x =
x
(u
i
u) and y
i
y =
y
(v
i
v), so that can be written as
=

n
i=1
(u
i
u)(v
i
v)
_

n
i=1
(u
i
u)
2
_

n
i=1
(v
i
v)
2
.
Here, the (u
i
, v
i
)s are values from the bivariate normal N(0, 0; 1, 1, ). It follows that the distribution of
depends only on the parameter . The question is: what is the distribution of when = 0? The answer is
that if = 0, the statistic
t =

n 2
_
1
2
(5.1)
has a t-distribution with n 2 degrees of freedom. Since |t| is an increasing function of | |, the critical
region for rejecting H
0
: = 0 can be written as C = {|t| k}, and the level- test becomes: reject H
0
if
|t| t
/2,n2
. Also, t is an increasing function of , and so we are led to the following.
Summary. To test
H
0
: = 0
H
1
: = 0
( > 0)
( < 0)
at level , we calculate t as in (5.1) and reject H
0
if
|t| t
/2,n2
(t > t
,n2
)
(t < t
,n2
).
Large Sample Inference for .
It can be shown that for large n, the statistic
1
2
log
1+r
1r
is approximately normally distributed with mean
1
2
log
1+
1
and variance
1
n3
. More precisely,
z
def
=
1
2
log
1+r
1r

1
2
log
1+
1
_
1
n3
N(0, 1)
as n . We can use this fact to test more general hypotheses concerning , and to construct condence
intervals for .
52 CHAPTER 5. HYPOTHESIS TESTING: APPLICATIONS
Tests of Hypotheses. To test
H
0
: =
0
H
1
: =
0
( >
0
)
( <
0
)
at level , we calculate
z =
1
2
log
1+r
1r

1
2
log
1+0
10
_
1
n3
,
and reject H
0
if
|z| z
/2
(z > z

)
(z < z

).
Note that
1
2
log
1+
1
is an increasing function of this is the reason for the test.
Condence Intervals. We have
z
/2
< z =

n 3
2
log
_
1 +r
1 r
__
1
1 +
_
< z
/2
with probability 1 . This gives
e

2z
/2

n 3
_
1 r
1 +r
_
<
1
1 +
< e
2z
/2

n 3
_
1 r
1 +r
_
with probability 1 . Isolating in the centre, we nd that a (1 ) 100% condence interval for is
1 +r (1 r)e
2z
/2

n 3
1 +r + (1 r)e
2z
/2

n 3
< <
1 +r (1 r)e

2z
/2

n 3
1 +r + (1 r)e
2z
/2

n 3
.
Example. We have the following data for a secretary concerning
x = minutes to do task in the morning
y = minutes to do task in the afternoon.
The data are as follows:
x 8.2 9.6 7.0 9.4 10.9 7.1 9.0 6.6 8.4 10.5
y 8.7 9.6 6.9 8.5 11.3 7.6 9.2 6.3 8.4 12.3
(1) Test the hypotheses
H
0
: = 0
H
1
: = 0
at level .01.
(2) Find a 99% condence interval for .
5.3. NORMAL REGRESSION ANALYSIS. 53
Solution. We have n = 10,

x
i
= 86.7,

y
i
= 88.8,

x
2
i
= 771.35,

y
2
i
= 819.34, and

x
i
y
i
=
792.92. Using the computational formula for r, we get r = .936.
(1) We get
z =
1
2
log
1+.936
1.936

1
2
log
1+0
10
_
1
7
= 4.5
which exceeds z
/2
= z
.005
= 2.58. Hence we reject H
0
at level .01. There is a correlation between X
and Y .
(2) for = .01 and the formula given above, a 99% condence interval for is
.623 < < .991.
Remark. As pointed out previously, we can also test the hypotheses
H
0
: = 0
H
1
: = 0
using the statistic
t =

n 2r

1 r
2
which under the hypothesis that = 0, has the t-distribution with n 2 degrees of freedom. For the data
given in the example above, we get
t =

8 .936

1 .936
2
= 7.52.
Since t
/2,n2
= t
.005,8
= 3.355, we reject H
0
at level .01.
5.3 Normal Regression Analysis.
Reference: MWS 7th ed., section 11.8
Let X and Y be random variables. Dene e = Y E(Y |X). Then E(e) = 0 and
Var(e) = Var(Y ) 2Cov[Y, E(Y |X)] + Var(E(Y |X)) = Var(Y ) Var(E(Y |X)); (5.2)
moreover, since E(eX) = E[(Y E(Y |X))X] = E[XY XE(Y |X)] = E[XY E(XY |X)] = 0, then
Cov(e, X) = 0 and so e and X are uncorrelated.
Proposition 5.3.1 Suppose that X and Y have a bivariate normal distribution. Then we have the repre-
sentation
Y = a +bX +e, (5.3)
where e is independent of X and has the distribution N(0,
2
).
Proof. As previously shown, E(Y |X) = a +bX for certain constants a, b given by
a =
y
b
x
, b =

x
.
Then X and e = Y a bX also have a bivariate normal distribution. Since they are uncorrelated, they
are independent. Hence we have the representation in (5.3) where X and e are independent normal random
variables. The distribution of e is N(0,
2
) where by (5.2), we have

2
= Var(Y ) Var(a +bX) =
2
y
b
2

2
x
=
2
y
(1
2
).
The maximum likelihood estimators of a and b are

b =

y

x
=

(x
i
x)(y
i
y)

(x
i
x)
2
, a =
y

b
x
= y

b x. (5.4)
54 CHAPTER 5. HYPOTHESIS TESTING: APPLICATIONS
Remarks.
(1) The maximum likelihood estimators in (5.4) are the same as the least squares estimators in chapter 7
on linear models. However (in contrast to linear models, where the x
i
s are xed), the distributions of
a and

b are either unknown or extremely complicated.
(2) Since b =
y
/
x
, the test H
0
: b = 0 is the same as the test H
0
: = 0.
Chapter 6
Linear Models
6.1 Regression.
Reference: WMS 7th ed., chapter 11
Consider the following laboratory experiment on Charles Law. We have a closed vessel lled with a
certain volume of gas. We heat the vessel and measure the corresponding pressures inside the vessel. We
obtain the observations (x
1
, y
1
), . . . , (x
n
, y
n
), where y
i
is the pressure in the vessel corresponding to the
temperature x
i
. We plot the points (x
1
, y
1
), . . . , (x
n
, y
n
) and nd
(x
1
,y
1
)
(x
n
,y
n
)
y
x
This is called a scatterplot. The distribution of points on the scatterplot indicates that there probably is a
linear relationship between the variables x and y of the form
y =
0
+
1
x. (6.1)
However, the points do not exactly fall on a straight line, due to experimental error or perhaps some other
factor. But if we do assume that the true relationship between x and y is of the form in (6.1), then the
relationship between the x
i
s and the y
i
s is given by the model
y
i
=
0
+
1
x
i
+e
i
, i = 1, 2, . . . , n (6.2)
where the e
i
s are the errors. Note that e
i
is the vertical distance of the point (x
i
, y
i
) from the line
y =
0
+
1
x.
55
56 CHAPTER 6. LINEAR MODELS
x
(x
i
,y
i
)
y
e
i
{
y
=
!
x
+
"
Our object in this chapter will be to use the data (x
1
, y
1
), . . . , (x
n
, y
n
) to estimate the parameters
0
and

1
.
Least Squares Estimation of
0
and
1
. The least squares estimates

0
and

1
are those values of
0
and
1
for which the sum
S =
n

i=1
e
2
i
=
n

i=1
(y
i

1
x
i
)
2
of the squares of the vertical deviations is a minimum. Dierentiating S with respect to
0
and
1
, we nd
S

0
= 2
n

i=1
(y
i

1
x
i
) = 2
_
n

i=1
y
i
n
0

1
n

i=1
x
i
_
S

1
= 2
n

i=1
x
i
(y
i

1
x
i
) = 2
_
n

i=1
x
i
y
i

0
n

i=1
x
i

1
n

i=1
x
2
i
_
.
Setting these derivatives equal to zero, we obtain
n

i=1
y
i
n

1
n

i=1
x
i
= 0 (6.3)
n

i=1
x
i
y
i

0
n

i=1
x
i

1
n

i=1
x
2
i
= 0, (6.4)
which are called the normal equations. Note that the rst of the two equations can be expressed in the useful
form

0
= y

1
x.
The solutions are

0
=
(

x
2
i
)(

y
i
) (

x
i
)(

x
i
y
i
)
n

x
2
i
(

x
i
)
2
,

1
=
n

x
i
y
i
(

x
i
)(

y
i
)
n

x
2
i
(

x
i
)
2
=
S
xy
S
xx
,
where
S
xy
=
n

i=1
(x
i
x)(y
i
y) =
n

i=1
x
i
y
i
n x y,
and the least squares line is
y =

0
+

1
x.
6.1. REGRESSION. 57
Example. An environmentalist is concerned about mercury emissions from a battery manufacturing plant
in Sorel, Quebec, into the St. Lawrence river. She measures mercury concentration at several locations
downriver from the plant. Her results are
x 1.7 2.3 2.4 2.6 2.8 3.3 3.7 4.1 4.5 7.9 8.3 9.8
y 32.8 23.6 26.9 21.8 22.7 19.5 19.3 19.9 13.5 8.4 8.2 5.8
where x is the distance downriver in kilometers and y is the mercury concentration in parts per million.
Find the least squares prediction line y =

0
+

1
x. At a point 4.3 kilometers downriver, what will be the
predicted mercury concentration?
Solution. The scatterplot (as well as the least squares line to be determined) is shown in the gure below.
We have n = 12 and

x
i
= 53.4,

y
i
= 222.4,

x
i
y
i
= 764.2,

x
2
i
= 317.52,

y
2
i
= 4849.38,
so
x = 4.45, y = 18.53, S
xx
= 79.89, S
yy
= 727.567, S
xy
= 225.48,
which gives

0
= 31.093,

1
= 2.822.
Thus the least squares prediction line is
y = 31.093 2.822x.
For x = 4.3, the predicted concentration is

0
+

1
x = 31.093 (2.822 4.3) = 18.7.
Remarks.
(1) Suppose we have a set of data with a scatterplot of the form shown below.
58 CHAPTER 6. LINEAR MODELS
y
x
Then we might want to t a model of the form
y =
0
+
1
x +
2
x
2
+e. (6.5)
In this case, we have
S =
n

i=1
(y
i

1
x
i

2
x
2
i
)
2
and as above we nd

0
,

1
, and

2
by deriving the three normal equations as above.
(2) A linear model is one, such as in (6.1) or (6.5), which is linear in the coecients. The reason for treating
linear models is that they lead to the linear normal equations which are easy to solve. Non-linear models
can lead to normal equations which cannot be solved exactly.
(3) However, it is sometimes possible to convert a non-linear model into a linear one by an appropriate
transformation of the data. For example, suppose that a scatterplot indicates that the model
y =
0

x
1
+e
might be appropriate. Then we could estimate
0
and
1
by actually tting the model
log y = log
0
+xlog
1
+e.
Normal Regression Analysis. So far, we have not made any assumptions about the distributions in-
volved, and both the x
i
s and y
i
s could be values of random variables. Now, in order to obtain condence
intervals and make tests of hypotheses about
0
and
1
, we must make distributional assumptions. From
this point on, we shall assume that the x
i
s are xed numbers and not the values of random variables, and
that the x
i
s and y
i
s are related by the model
y
i
=
0
+
1
x
i
+e
i
, e
i
N(0,
2
), i = 1, . . . , n
where the e
i
s are uncorrelated. Notice that we have
y
i
N(
0
+
1
x
i
,
2
), i = 1, . . . , n.
6.1. REGRESSION. 59
Remark. Now that there are distributions involved, we could use maximum likelihood estimation to esti-
mate
0
and
1
. The likelihood function is
L(
0
,
1
) =
n
i=1
1

2
e
(yi01xi)
2
/2
2
.
By calculating
log L

0
and
log L

1
,
we nd that the maximum likelihood estimates are the same as the least squares estimates.
Sampling Distributions of

0
,

1
, and

0
+

1
x. We can write

1
=
S
xy
S
xx
=

(x
i
x)y
i
S
xx
=

_
(x
i
x)
S
xx
_
y
i
.
Thus

1
is a linear combination of independent normals, and so is itself normal. Moreover, we have
E(

1
) =

_
(x
i
x)
S
xx
_
E(y
i
) =

(x
i
x)(
0
+
1
x
i
)
S
xx
=

0

(x
i
x) +
1

x
i
(x
i
x)
S
xx
=
0 +
1

(x
i
x)
2
S
xx
=
1
,
and
Var(

1
) =

_
(x
i
x)
S
xx
_
2
Var(y
i
) =

(x
i
x)
2

2
S
2
xx
=

2
S
xx
.
In summary, then, we have

1
N
_

1
,

2
S
xx
_
.
Similarly, we get

0
N
_

0
,
2
_
1
n
+
x
2
S
xx
__
.
Moreover, we can show that
Cov(

0
,

1
) =

2
x
S
xx
,
and so the estimated line

0
+

1
x has distribution

0
+

1
x N
_

0
+
1
x,
2
_
1
n
+
(x x)
2
S
xx
__
.
Finally, let y =
0
+
1
x + e be the value resulting from a future measurement, and suppose we wish to
predict y. Taking y =

0
+

1
x as our prediction, we see that E(y y) = 0 and
Var(y y) = Var(y) + Var( y) 2Cov(y, y) =
2
+
2
_
1
n
+
(x x)
2
S
xx
_
,
where we used the fact that Cov(y, y) = 0 (since e is independent of e
1
, . . . , e
n
). It follows that
y y N
_
0,
2
_
1 +
1
n
+
(x x)
2
S
xx
__
.
The problem insofar as deriving tests and condence intervals for
0
and
1
is that
2
is not likely to be
known, and therefore must be estimated. Let us dene
SS(Res)=least squares minimum =
n

i=1
(y
i

1
x
i
)
2
,
60 CHAPTER 6. LINEAR MODELS
(SS(Res) is called SSE in WMS). Then
SS(Res) =
n

i=1
_
(y
i
y) +

1
( x x
i
)
_
2
=
n

i=1
(y
i
y)
2
2

1
n

i=1
(y
i
y)(x
i
x) +

2
1
n

i=1
(x
i
x)
2
,
giving
SS(Res) = S
yy

1
S
xy
,
a useful computation formula for SS(Res). SS(Res)/
2
can be seen to have a chi-square distribution with
n 2 degrees of freedom, independent of

0
and

1
, and therefore

2
def
=
SS(Res)
n 2
(called s
2
in WMS) is an unbiased estimator of
2
. Then
t
def
=
(

1
)

S
xx

=

1
_

2
Sxx

SS(Res)

2
n 2
has a t-distribution with n 2 degrees of freedom. Similar results hold for

0
,

0
+

1
x, and y y. Hence
our tests and condence intervals are based on the facts that
t
def
=
(

1
)

S
xx

, t
def
=

0

_
1
n
+
x
2
Sxx
, t
def
=

0
+

1
x
0

1
x

_
1
n
+
(x x)
2
Sxx
, and t
def
=
y y

_
1 +
1
n
+
(x x)
2
Sxx
have t-distributions with n 2 degrees of freedom.
Example. Let us consider the data from our previous example.
(1) Suppose we want to test the hypotheses
H
0
:
1
=
H
1
:
1
=
at level . The method is: calculate
t =
(

1
)

S
xx

and reject H
0
at level if |t| > t
/2,n2
.
In the numerical example, are the data sucient to indicate that mercury concentration depends on
distance? That is, let us test the above hypotheses with = 0. We have

= 2.882, and S
xx
= 79.89.
For SS(Res), we get SS(Res)= 727.567 (2.822 225.48) = 91.176, so

2
= 91.176/10 = 9.1176.
Finally, we nd t = 2.882

79.89/

9.1176 = 8.53. Since t


.025,10
= 2.228, we reject H
0
at level .05.
(2) Suppose we want a 95% condence interval for
0
. The method is as follows: we know that
t
/2,n2
<

0

_
1
n
+
x
2
Sxx
< t
/2,n2
with probability 1 . Unravelling and isolating
0
in the usual way, we nd that

0
t
/2,n2

1
n
+
x
2
S
xx
<
0
<

0
+t
/2,n2

1
n
+
x
2
S
xx
6.2. EXPERIMENTAL DESIGN. 61
is a (1 ) 100% condence interval for
0
. In our case, we get 31.093 3.872, so that
27.22 <
0
< 34.96
is a 95% condence interval for
0
.
(3) A (1 ) 100% condence interval for
1
is

1
t
/2,n2

_
1
S
xx
<
1
<

1
+t
/2,n2

_
1
S
xx
.
(4) A 95% condence interval for the true line E(y) =
0
+
1
x is easily found to be

0
+

1
x t
/2,n2

1
n
+
(x x)
2
S
xx
<
0
+
1
x <

0
+

1
x +t
/2,n2

1
n
+
(x x)
2
S
xx
.
(5) A 95% condence interval for a future value y =
0
+
1
x +e is easily found to be
y t
/2,n2

1 +
1
n
+
(x x)
2
S
xx
< y < y +t
/2,n2

1 +
1
n
+
(x x)
2
S
xx
,
where y =

0
+

1
x.
y
x
6.2 Experimental Design.
Reference: WMS 7th ed., chapters 12,13
In this nal section, we study experimental designs involving a single factor, also called one-way analysis
of variance.
6.2.1 The Completely Randomized Design.
Suppose we want to test the eects of k dierent fertilizers (called treatments) on a certain type of wheat.
To do this, we plant a total of N test plots with this wheat, of which n
1
are fertilized with fertilizer 1, n
2
with fertilizer 2, and so on (so that N = n
1
+n
2
+ +n
k
). We then measure the resulting yields for each
plot. For example, if k = 3, a typical set of observations would be the following:
62 CHAPTER 6. LINEAR MODELS
Fertilizer 1 Fertilizer 2 Fertilizer 3
50 60 40
60 60 50
60 65 50
65 70 60
70 75 60
80 80 60
75 70 65
80 65
85
75
Let x
ij
= the yield from the jth plot receiving fertilizer i (the jth observation on treatment i), j =
1, 2, , n
i
; i = 1, , k. Then we may write x
ij
=
i
+ e
ij
, where
i
represents the eect of the ith
treatment, and e
ij
is the random error for the jth plot receiving fertilizer i. e
ij
is the sum total of eects
due to uncontrolled factors, such as precipitation, soil fertility, and so on. Let us put
=
1
N
k

i=1
n
i

i
,
i
=
i
, i = 1, 2, , k.
Then our model becomes
x
ij
= +
i
+e
ij
, j = 1, , n
i
, i = 1, , k,
where
= the general eect

i
= the deviation from the general eect for the ith treatment,
and we notice that

k
i=1
n
i

i
= 0 (called a side condition). This is called the one-way classication model.
Estimation of Parameters. We shall begin by estimating the parameters ,
i
by the method of least
squares. We will take and
i
to be the values of and
i
for which
SS =
k

i=1
ni

j=1
e
2
ij
=
k

i=1
ni

j=1
(x
ij

i
)
2
is a minimum. Dierentiation gives
SS

= 2
k

i=1
ni

j=1
(x
ij

i
) = 2(x

N
k

i=1
n
i

i
)
SS

i
= 2
ni

j=1
(x
ij

i
) = 2(x
i
n
i
n
i

i
)
where
x
i
=
ni

j=1
x
ij
and x

=
k

i=1
x
i
=
k

i=1
ni

j=1
x
ij
.
We get the normal equations
x
i
n
i
n
i

i
= 0, i = 1, , k
x

N = 0
6.2. EXPERIMENTAL DESIGN. 63
from which
=
x

N
,
i
=
x
i
n
i

N
.
The minimum sum of squares is
S =
k

i=1
ni

j=1
(x
ij

i
)
2
=
k

i=1
ni

j=1
(x
ij

x
i
n
i
)
2
.
Derivation of the Treatment Sum of Squares. We shall want to test the hypotheses
H
0
:
1
=
2
= =
k
(= 0) equivalently
1
= =
k
H
1
: not H
0
.
If H
0
is true, then the observations come from the model x
ij
= +e
ij
, j = 1, , n
i
; i = 1, , k. The least
squares estimate of is = x

/N, and comes from minimizing SS =

k
i=1

ni
j=1
(x
ij
)
2
. The minimum
sum of squares in this case is
S
0
=
k

i=1
ni

j=1
(x
ij
)
2
=
k

i=1
ni

j=1
(x
ij

x

N
)
2
.
Now S
0
is the variability in the observations not explained by the parameter , while S is the variability in
the observations not explained by ,
1
, ,
k
. Hence S
0
S is the variability in the observations explained
by the treatment eects
1
, ,
k
. Note that
S
0
S =
k

i=1
ni

j=1
(x
ij

x

N
)
2

i=1
ni

j=1
(x
ij

x
i
n
i
)
2
=
k

i=1
ni

j=1
(x
ij

x

N
x
ij
+
x
i
n
i
)(x
ij

x

N
+x
ij

x
i
n
i
)
=
k

i=1
(
x
i
n
i

N
)
ni

j=1
(x
ij

x

N
+x
ij

x
i
n
i
) =
k

i=1
(
x
i
n
i

N
)(x
i
n
i
x

N
+x
i
x
i
)
=
k

i=1
n
i
(
x
i
n
i

N
)
2
.
It is customary to write
S
0
= SS(T) = total sum of squares=total variability in the observations
S
0
S = SS(Tr) = treatment sum of squares=variability due to treatments
S = SS(E) = error sum of squares=variability unexplained by ,
1
, ,
k
.
Note from the expressions for SS(Tr) and SS(E) that SS(Tr) is the between treatment variability, while
SS(E) is the within treatment variability.
Summary. SS(T) = SS(Tr) +SS(E), where
SS(T) =
k

i=1
ni

j=1
(x
ij

x

N
)
2
=
k

i=1
ni

j=1
x
2
ij

x
2

N
,
SS(Tr) =
k

i=1
n
i
(
x
i
n
i

N
)
2
=
k

i=1
x
2
i
n
i

x
2

N
,
SS(E) =
k

i=1
ni

j=1
(x
ij

x
i
n
i
)
2
=
k

i=1
ni

j=1
x
2
ij

k

i=1
x
2
i
n
i
.
64 CHAPTER 6. LINEAR MODELS
If the null hypothesis H
0
:
1
= =
k
= 0 is true, then we expect SS(Tr) to be a small part of SS(T),
and SS(E) a relatively large part. Hence, the ratio SS(Tr)/SS(E) should be small if H
0
is true and large if
H
0
is false.
Now assume that e
ij
N(0,
2
) for all i, j, and are independent. Then if H
0
is true, x
ij
N(,
2
)
and are independent; from the formulas for SS(T), SS(Tr), and SS(E), it would appear they have chi-square
distributions. In fact, we have
SS(T)

2

2
with N 1 degrees of freedom
SS(Tr)

2

2
with k 1 degrees of freedom
SS(E)

2

2
with N k degrees of freedom
and SS(Tr) and SS(E) are independent. Hence, under H
0
,
F =
SS(Tr)

2
/(k 1)
SS(E)

2
/(N k)
=
SS(Tr)
k1
SS(E)
Nk
has the F-distribution with k 1, N k degrees of freedom.
Summary To test
H
0
:
1
=
2
= =
k
(= 0) i.e. no treatment eects
H
1
: not H
0
.
calculate
F =
SS(Tr)/(k 1)
SS(E)/(N k)
and reject H
0
at level if F F
,k1,Nk
.
ANOVA Table. (ANOVA=Analysis of Variance) The data and computations for a given problem are
usually summarized in the form of an ANOVA table as follows.
Source of Degrees of Sum of Mean F
Variation Freedom Squares Square
Treatments k 1 SS(Tr) MS(Tr) =
SS(Tr)
k1
F
Tr
=
MS(Tr)
MS(E)
Errors N k SS(E) MS(E) =
SS(E)
Nk
Total N 1 SS(T)
Example 1. For the wheat and fertilizer example given at the beginning of this chapter, we obtain
x
1
= 700, x
2
= 480, x
3
= 450, x

= 1630,
n
1
= 10, n
2
= 7, n
3
= 8, N = 25,
k

i=1
ni

j=1
x
2
ij
= 109, 200;
k

i=1
x
2
i
n
i
= 107, 226.79.
Hence we get
Source of Degrees of Sum of Mean F
Variation Freedom Squares Square
Treatments 2 950.79 475.4 5.30
Errors 22 1973.21 89.69
Total 24 2924
Since F
.05,2,22
= 3.44, we reject H
0
at level .05.
6.2. EXPERIMENTAL DESIGN. 65
Randomization The concept of randomization, which is part of the title of this section, has entered in
an important but as yet unspecied way into the theory of this section.
Denition. A randomized design is one in which the plots (test units) are assigned randomly to the
treatments. Complete randomization refers to assigning all the test units randomly to the treatments (as
opposed to randomization within blocks, for example, in the next section).
The purpose of randomization is as follows. In most experiments, especially undesigned ones, there will
be one or more extraneous factors, whose eects in this section have been considered a part of the e
ij
s.
Randomization ensures that each treatment has an equal chance of being favoured or handicapped by an
extraneous factor.
To make this clear, let us consider an example. Suppose we have three fertilizers to be applied to three
plots each, as in the drawing below.
low swampy
ground
flat fertile
ground
high rocky ground
1 2 3
1 2 3
1 2 3
Notice that we have introduced an extraneous factor type of terrain, which may have been unnoticed by
the experimenter. Suppose that fertilizer 1 was assigned to the three left plots, fertilizer 2 to the middle
three plots, and fertilizer 3 to the right three plots. How could we be sure that any dierences detected
between fertilizers are not in fact due to the dierence in type of terrain? A bias has been introduced into
the experiment, and the factor type of fertilizerhas been confounded by the factor type of terrain. To
protect ourselves from this bias, we use the device of randomization: we choose n
1
= 3 plots at random from
the N = 9 plots and apply fertilizer 1 to these three; then n
2
= 3 plots are chosen from the remaining six,
and fertilizer 2 is applied; nally, fertilizer 3 is applied to the remaining three plots.
The following is another example, which we will return to in the next section.
Example 2. A certain person can drive to work along four dierent routes, and the following are the
numbers of minutes in which he timed himself on ve dierent occasions for each route.
Route 1 Route 2 Route 3 Route 4
Monday 22 25 26 26
Tuesday 26 27 29 28
Wednesday 25 28 33 27
Thursday 25 26 30 30
Friday 31 29 33 30
The ANOVA table is
66 CHAPTER 6. LINEAR MODELS
Source of Degrees of Sum of Mean F
Variation Freedom Squares Square
Treatments 3 52.8 17.6 2.8
Errors 16 100.4 6.28
Total 19 153.2
Since F
.05,3,16
= 3.24, we do not reject H
0
at level .05.
6.2.2 Randomized Block Designs
Suppose an extraneous factor is present, such as type of terrain in the fertilizer example above. Then unless
the plots are assigned as in gure 7.2.1 above (and we have just seen why such an assignment should be
avoided by randomization), the within treatment sum of squares SSE will contain variation due to the type
of terrain. Since the test statistic F has SSE in its denominator, this may result in F being so small as to
make the test inconclusive, even though there is a dierence between fertilizers. To eliminate the eect of
this extraneous factor, we adopt the randomized block design.
Denition. A randomized block design is a design in which the nk test units are partitioned into n blocks
depending on the extraneous factor, and the k treatments are assigned so that every treatment is represented
once in every block. Within blocks, the treatments should be assigned randomly.
For our fertilizer-terrain example, there are three blocks. The rst block consists of the three plots in
the low swampy terrain, the second of the three plots on at fertile ground, and the third block of the three
plots on high rocky ground. A possible assignment of plots to treatments is shown in the gure below.
low swampy
ground
flat fertile
ground
high rocky ground
1 2 3
3 1 2
2 3 1
Let us consider another example. In example 2 of the previous section, we were unable to reject the null
hypothesis because SS(E) formed a large part of SS(T) relative to SS(Tr). Looking at the data, though, it
appears that route 1 is certainly better than route 4, since the sample means are x
1
= 25.8 and x
4
= 30.2.
Again recall that SS(E) is the within treatment variability. Is is possible that SS(E) is being inated by
a second factor? We go back to the original person who took the observations and nd that yes, the times
were measured on dierent days of the week, as shown in the table. We must do a randomized block design
with the weekdays as blocks.
Analysis Our model becomes
x
ij
= +
i
+
j
+e
ij
, i = 1, . . . , k; j = 1, . . . , n,
6.2. EXPERIMENTAL DESIGN. 67
where
= the grand mean,

i
= ith treatment eect and
k

i=1

i
= 0.

j
= jth block eect and
n

j=1

j
= 0.
As before, we estimate ,
i
, and
j
by the method of least squares. We have
SS =
k

i=1
n

j=1
(x
ij

i

j
)
2
SS

= 2
k

i=1
n

j=1
(x
ij

i

j
) = 2(x

nk)
SS

i
= 2
n

j=1
(x
ij

i

j
) = 2(x
i
n n
i
)
SS

j
= 2
k

i=1
(x
ij

i

j
) = 2(x
j
k k
j
).
We obtain the equations
x

nk = 0
x
i
n n
i
= 0
x
j
k k

j
= 0
which give
=
x

nk
,
i
=
x
i
n

x

nk
,

j
=
x
j
k

x

nk
.
The minimum sum of squares is
SS
min
=
k

i=1
n

j=1
(x
ij

i

j
)
2
=
k

i=1
n

j=1
(x
ij

x
i
n

x
j
k
+
x

kn
)
2
and represents that part of the total variation SS(T) not explained by treatment eects or block eects.
Hence the variability explained by block eects must be
k

i=1
n

j=1
(x
ij

x
i
n
)
2
. .
old SS(E)

i=1
n

j=1
(x
ij

x
i
n

x
j
k
+
x

kn
)
2
. .
new SS(E)
=
k

i=1
n

j=1
(x
ij

x
i
n
)
2

i=1
n

j=1
[(x
ij

x
i
n
) (
x
j
k

x

kn
)]
2
= 2
n

j=1
(
x
j
k

x

kn
)
k

i=1
(x
ij

x
i
n
)
k

i=1
n

j=1
(
x
j
k

x

kn
)
2
= k
n

j=1
(
x
j
k

x

kn
)
2
.
68 CHAPTER 6. LINEAR MODELS
Hence we write
SS(T) =
k

i=1
n

j=1
(x
ij

x

kn
)
2
,
SS(Tr) = n
k

i=1
(
x
i
n

x

kn
)
2
,
SS(Bl) = k
n

j=1
(
x
j
k

x

kn
)
2
,
SS(E) =
k

i=1
n

j=1
(x
ij

x
i
n

x
j
k
+
x

kn
)
2
.
and we have SS(T) = SS(Tr) +SS(Bl) +SS(E). Now assume that e
ij
N(0,
2
) for all i, j.
Tests concerning treatment eects. Dene
F
Tr
=
SS(Tr)
k1
SS(E)
(n1)(k1)
.
Reject H
0
:
1
= =
k
= 0 and accept H
1
:
i
= 0 for some i if F
Tr
F
,k1,(n1)(k1)
.
Tests concerning block eects. Dene
F
Bl
=
SS(Bl)
n1
SS(E)
(n1)(k1)
.
Reject H
0
:
1
= =
n
= 0 and accept H
1
:
j
= 0 for some j if F
Bl
F
,n1,(n1)(k1)
.
ANOVA Table.
Source of Degrees of Sum of Mean F
Variation Freedom Squares Square
Treatments k 1 SS(Tr) MS(Tr) =
SS(Tr)
k1
F
Tr
=
MS(Tr)
MS(E)
Blocks n 1 SS(Bl) MS(Bl) =
SS(Bl)
n1
F
Bl
=
MS(Bl)
MS(E)
Errors (n 1)(k 1) SS(E) MS(E) =
SS(E)
(n1)(k1)
Total nk 1 SS(T)
Example 3. Let us redo example 2 concerning the four routes. We get
Source of Degrees of Sum of Mean F
Variation Freedom Squares Square
Treatments 3 52.8 17.6 7.75
Blocks 4 73.2 18.3 8.06
Errors 12 27.2 2.27
Total 19 153.2
Since F
.05,3,12
= 3.49 and F
.05,4,12
= 3.26, we reject both hypotheses.
Chapter 7
Chi-Square Tests
7.1 Tests Concerning k Independent Binomial Populations.
Suppose we have observations X
1
, . . . , X
k
fromk independent binomial distributions with parameters (n
1
, p
1
), . . . , (n
k
, p
k
)
Case 1.
Suppose we want to test
H
0
: p
1
= p
1,0
, . . . , p
k
= p
k,0
H
1
: not H
0
Background: Each X
i
is a binomial random variable, so that if n
i
is large enough, then
X
i
n
i
p
i
_
n
i
p
i
(1 p
i
)
is approximately N(0, 1), and then
(X
i
n
i
p
i
)
2
n
i
p
i
(1 p
i
)
will have a
2
-distribution with 1 d.f. By independence, if all n
i
are large enough, then
k

i=1
(X
i
n
i
p
i
)
2
n
i
p
i
(1 p
i
)
will have a
2
-distribution with k degrees of freedom. By large enough, we mean that n
i
p
i
5 and
n
i
(1 p
i
) 5 for all i = 1, . . . , k.
Test of Hypotheses: Let

2
def
=
k

i=1
(X
i
n
i
p
i,0
)
2
n
i
p
i,0
(1 p
i,0
)
.
Assume that n
i
p
i,0
5 and n
i
(1 p
i,0
) 5 for all i = 1, . . . , k. If H
0
is true, we expect
2
to be small.
Hence the -level test is: reject H
0
if
2

2
,k
.
69
70 CHAPTER 7. CHI-SQUARE TESTS
Example. Suppose we wish to test
H
0
: p
1
= p
2
= p
3
= .3
H
1
: not H
0
at level .05. The observations are
x
1
= 155 n
1
= 250
x
2
= 118 n
2
= 200
x
3
= 87 n
3
= 150
We have

2
=
(155 (250 .3))
2
250 .3 .7
+
(118 (200 .3))
2
200 .3 .7
+
(87 (150 .3))
2
150 .3 .7
= 258.
Since
2
.05,3
= 7.815, we reject H
0
at level .05.
Case 2.
However, perhaps it was the value .3 that was incorrect. Suppose we want to test
H
0
: p
1
= = p
k
H
1
: not H
0
.
This is the more usual case. Since no common value of the p
i
s is given under H
0
, we estimate it by the
pooled estimate
p =
x
1
+ +x
k
n
1
+ +n
k
.
Hence setting

2
def
=
k

i=1
(X
i
n
i
p)
2
n
i
p(1 p)
,
we reject H
0
at level if
2

2
,k1
. (The loss of a degree of freedom is because the common value is
estimated by p.)
Example. Carry out the test
H
0
: p
1
= p
2
= p
3
H
1
: not H
0
at level .05, using the same observations as in the previous example.
Solution. We have
p =
155 + 118 + 87
250 + 200 + 150
= .6,
and so

2
=
(155 (250 .6))
2
250 .6 .4
+
(118 (200 .6))
2
200 .6 .4
+
(87 (150 .6))
2
150 .6 .4
= .75.
Since
2
.05,2
= 5.991, we do not reject H
0
at level .05.
7.2. CHI-SQUARE TEST FOR THE PARAMETERS OF A MULTINOMIAL DISTRIBUTION. 71
7.2 Chi-Square Test for the Parameters of a Multinomial Distri-
bution.
Reference: WMS 7th ed., chapter 14
Suppose that the random vector (X
1
, . . . , X
k
) has the multinomial distribution
P{X
1
= x
1
, . . . , X
k
= x
k
} =
_
n!
x1!x
k
!
p
x1
1
p
x
k
k
if x
1
+ +x
k
= n,
0 otherwise,
with parameters n 1 and p
1
, . . . , p
k
. Then

2
=
k

i=1
(X
i
np
i
)
2
np
i
(7.1)
is called Pearsons Chi-square statistic. It can be shown (see the appendix to this chapter) that, as n ,
the distribution of
2
tends to the chi-square distribution with k 1 degrees of freedom. Thus, for large n,

2
has approximately a chi-square distribution with k 1 degrees of freedom. The usual convention is that
if np
i
5 for all i = 1, . . . , k, the approximation is considered good.
Remark. The reason that
2
has a -square distribution is as follows: under the conditions np
i
5
for i = 1, . . . , k, the random variables X
1
, . . . , X
k
have (by the CLT) approximately a multivariate normal
distribution. Since np
i
is the mean of X
i
, then
2
is the sum of squares of normalized (almost) normal
random variables, and should therefore have a chi-square distribution. We lose a degree of freedom because
p
1
+ +p
k
= 1.
Chi-Square Tests. Suppose we want to test
H
0
: p
1
= p
1,0
, p
2
= p
2,0
, . . . , p
k
= p
k,0
H
1
: not H
0
(7.2)
where p
1,0
, . . . , p
k,0
are given values and p
1,0
+ +p
k,0
= 1.
Method: Assume that the conditions np
i,0
5 are satised for each i = 1, . . . , k. We calculate the value

2
=
k

i=1
(X
i
np
i,0
)
2
np
i,0
If H
0
is true, then X
i
should be closeto np
i,0
for each i, and
2
should be small. Hence the test is:
reject H
0
at level if
2

2
,k1
.
Example. A group of rats, one by one, proceed down a ramp to one of ve doors, with the following
results:
Door 1 2 3 4 5
Number of rats which choose door 23 36 31 30 30
Are the data sucient to indicate that the rats show a preference for certain doors? That is, test the
hypotheses
H
0
: p
1
= p
2
= p
3
= p
4
= p
5
=
1
5
H
1
: not H
0
.
Use = .01.
72 CHAPTER 7. CHI-SQUARE TESTS
Solution. We have n = 150, so np
i,0
= 30 5 for all i = 1, . . . , 5. Since

2
=
(23 30)
2
30
+
(36 30)
2
30
+
(31 30)
2
30
+
(30 30)
2
30
+
(30 30)
2
30
= 2.87,
and since
2
.01,4
= 13.277, we do not reject H
0
at level .01. No, the data are not sucient.
Other Applications. The chi-square statistic can be used to test hypotheses for populations other than
the multinomial. Suppose we have a population, each of whose members can be of one of k categories
S
1
, S
2
, . . . , S
k
. Let p
i
be the (true) proportion of members of the population which are of category S
i
,
i = 1, 2, . . . , k. Note that p
1
+ + p
k
= 1. Suppose we select a random sample of size n from this
population. If an observation is of category i, we say that it falls into cell i. Let
n
i
= the number of observations in the sample that fall into cell i, i = 1, . . . , k.
(Note: we are using n
i
rather than X
i
.) The numbers n
1
, . . . , n
k
are called the observed cell frequencies.
Suppose we wish to test the hypotheses in (7.2). Let us dene
e
i
= np
i,0
, i = 1, . . . , k.
Note that for each i, e
i
is the number of observations in the sample we would expect if H
0
were true. The
numbers e
1
, . . . , e
k
are therefore called the expected cell frequencies, and

k
i=1
e
i
= n. With this notation,
Pearsons chi-square statistic becomes.

2
=
k

i=1
(n
i
e
i
)
2
e
i
.
If the sample size n is large enough that all expected cell frequencies e
i
are at least 5, then
2
has approx-
imately the chi-square distribution with k 1 degrees of freedom. Also, we expect
2
to be small if H
0
is
true. Hence the test is: reject H
0
at level if
2

2
,k1
.
It is traditional and useful to arrange the observed and expected cell frequencies into a table. In the
above example, we got the following table of observed and expected cell frequencies.
Cell 1 2 3 4 5
Observed Cell Frequency 23 36 31 30 30
Expected Cell Frequency 30 30 30 30 30
Remark. If in the process of using the multinomial test, there are t independent parameters which must
be estimated from the sample data, the number of degrees of freedom of
2
drops to k t 1. Hence our
test would become: calculate
2
and reject H
0
if
2

2
,kt1
. This will be useful to remember when we
come to goodness of t tests and contingency tables in the next two sections.
7.3 Goodness of Fit Tests.
We have a random sample X
1
, . . . , X
n
from an unknown distribution F, and we want to test
H
0
: F = F

(7.3)
H
1
: F = F

(7.4)
where F

is a given distribution function.


7.3. GOODNESS OF FIT TESTS. 73
Method Let X denote a random variable with distribution F. Partition the range set of X into s subsets
called cells or categories, of which the ith will be denoted by C
i
. We shall say that an observation x falls in
the ith cell if x C
i
. Let P denote probabilities calculated under F. Dene

i
= P(X C
i
),

i
= P

(X C
i
), i = 1, . . . , s.
Then carry out the multinomial test
H
0
:
i
=

i
for all i = 1, . . . , s. (7.5)
H
1
: not H
0
. (7.6)
Since if H
0
: F = F

is true implies H
0
:
i
=

i
, i = 1, . . . , s is true, then rejecting H
0
:
i
=

i
, i = 1, . . . , s
causes us to reject H
0
: F = F

.
Example 1. Let X = the number of cars sold per day by a certain car dealer, measured over a period of
100 days. We want to test
H
0
: X is Poisson with parameter = 3.5
H
1
: not H
0
at level = .05. The observations are as follows.
Cell Number of cars n
i
e
i
e
i
0 0 1 3.02 1.2
1 1 4 10.57 5.32
2 2 11 18.50 11.76
3 3 16 21.58 17.32
4 4 26 18.88 19.14
5 5 21 13.22 16.92
6 6 12 7.71 12.46
7 7 5 3.85 7.87
8 {8, 9, . . .} 4 2.67 8.00
n =

n
i
= 100
The expected cell frequencies are computed from
e
i
= n Pr{X = i} = 100
3.5
i
e
3.5
i!
, i = 0, 1, . . . , 7
e
8
= n Pr{X 8} = 100
7

i=0
e
i
.
Note that cell 8 is the interval {x 8}. But the observations imply that four observations were precisely 8
(needed to calculate

below). Also note that e
7
and e
8
are less than 5. Hence we must combine cells 7 and
8. The result is that a new cell 7 is created with n
7
= 9 and e
7
= 6.52. Similarly, we combine cells 0 and 1,
into a new cell 1 with n
1
= 5 and e
1
= 13.59. For
2
, we obtain

2
=
7

i=1
(n
i
e
i
)
2
e
i
= 20.51,
2
.05,6
= 12.5916,
so we reject H
0
at level .05.
Example 2. In the above example, the n
i
s look as though they could come from a Poisson distribution,
but with a mean larger than 3.5. Perhaps it was the value = 3.5 which was incorrect and inated the value
of
2
. Let us use the same data and this time test
H
0
: X is Poisson
H
1
: not H
0
.
74 CHAPTER 7. CHI-SQUARE TESTS
In order to calculate the e
i
s, we must supply an estimate of . For our estimator, we take the sample mean

= x =
(1 0) + (4 1) + + (5 7) + (4 8)
100
= 4.42,
computed from the original data. The expected cell frequencies are computed as before, but with 3.5 replaced
by 4.42, and are given in the rightmost column in the above table. Once again, we have to combine cells 0
and 1 into a new combined cell 1 with n
1
= 5 and e
1
= 6.52. We nd

2
=
8

i=1
(n
i
e
i
)
2
e
i
= 6.99,
2
.05,6
= 12.5916,
(note: s t 1 = 8 1 1 = 6) and so we do not reject H
0
.
Can we accept H
0
? There are really two parts to this question. First, we have really tested the hypotheses
in (7.5) and (7.6) and we have no idea of the power of the test. Secondly, even if we could accept the null
hypothesis in (7.5), that does not mean we can accept the null hypothesis in (7.3). Hence the answer is a
double NO! It is customary to conclude in a weaker way, by saying
(1) A Poisson distribution with = 4.42 provides a good t to the data (especially if
2
is small compared
to
2
st1
, or
(2) the data are not inconsistent with the assumption of a Poisson population with = 4.42.
Example 3. We have the following data concerning times X to burnout for a certain brand of battery.
Cell No. of Hours n
i
e
i
1 0 5 37 39.34
2 5 10 20 23.87
3 10 15 17 14.47
4 15 20 13 8.78
5 20 25 8 5.32
6 25 5 8.20
n = 100
We wish to test
H
0
: X is exponential
H
1
: not H
0
at level .05.
Solution. The e
i
s are calculated from
e
i
= nPr{0 < X 5} = 100
_
5
0
1

e
x/
dx,
.
.
.
e
5
= nPr{20 < X 25} = 100
_
25
20
1

e
x/
dx,
e
6
= nPr{X > 25} = 100
_

25
1

e
x/
dx,
or e
6
can be calculated from

6
i=1
e
i
= n. We use the sample mean

= x = 10 of our original observations
as our estimate of . The e
i
s can now be calculated from the formulae above, with taken to be 10, and
are shown in the above table. We obtain
2
= 5.84 and
2
.05,4
= 9.488. Hence we do not reject H
0
. An
exponential distribution with mean 10 gives a good t to the data.
7.4. CONTINGENCY TABLES 75
7.4 Contingency Tables
Consider the following data in the form of a contingency table resulting from a random sample of 350
students.
Interest in
Mathematics
low avg. high
Ability in low 65 40 15
Statistics avg. 54 63 29
high 12 45 27
We want to test
H
0
: ability in statistics and interest in mathematics are independent
H
1
: not H
0
.
Theory. Suppose each observation from a population has two attributes, measurable by the variables X
and Y . We want to test
H
0
: X and Y are independent
H
1
: not H
0
.
In the above example, X = ability in statistics and Y = interest in mathematics (both are measurable by
some means). Let A
1
, . . . , A
r
be r categories for the variable X, and B
1
, . . . , B
c
be c categories for the
variable Y . These give rise to r c categories for the vector (X, Y ), as in the following table.
B
1
B
2
. . . . . . B
c
A
1
A
2
.
.
.
A
r
Let
ij
= Pr{X A
i
, Y B
j
}, i = 1, . . . , r; j = 1, . . . , c. Then

i
=
c

j=1

ij
= Pr{X A
i
}, i = 1, . . . , r,

j
=
r

i=1

ij
= Pr{Y B
j
}, j = 1, . . . , c.
We shall carry out the multinomial test of hypotheses
H
0
:
ij
=
i

j
for all i = 1, . . . , r; j = 1, . . . , c
H
1
: not H
0
.
Note that rejection of this H
0
also rejects H
0
: X and Y are independent. For this multinomial test, the
expected cell frequencies are given (as usual) by
e
ij
= n

ij
= n
i

j
where
i
and
j
are unknown and will be estimated by

i
=
n
i
n
,

j
=
n
j
n
, i = 1, . . . , r; j = 1, . . . , c,
76 CHAPTER 7. CHI-SQUARE TESTS
where n
i
=

c
j=1
n
ij
and n
j
=

r
i=1
n
ij
. The expected cell frequencies then become
e
ij
= n
n
i
n
n
j
n
=
n
i
n
j
n
.
In this process, we have estimated t = (r 1) + (c 1) parameters (since

r
i=1

ij
= 1 =

c
j=1

ij
). The
test is therefore: calculate

2
=
r

i=1
c

j=1
(n
ij
e
ij
)
2
e
ij
and reject H
0
at level if
2
>
2
,(r1)(c1)
. (Note: the number of degrees of freedom is s t 1 =
rc (r 1) (c 1) 1 = (r 1)(c 1).)
Example 1. For our student example, the expected cell frequency table is
Interest in
Mathematics
low avg. high
Ability in low 44.91 50.74 24.34
Statistics avg. 54.65 61.74 29.62
high 31.44 35.52 17.04
and we get
2
= 31.67. Since
2
.01,4
= 13.277, we reject H
0
at level .01.
Example 2. 1200 U.S. stores are classied according to type and location, with the following results:
Observed Cell Frequencies
N S E W
Clothing stores 219 200 181 180
Grocery stores 39 52 89 60
Other 42 48 30 60
Expected Cell Frequencies
N S E W
195 195 195 195
60 60 60 60
45 45 45 45
We get
2
= 38.07. Since
2
.05,6
= 12.592, we reject H
0
at level .05.
Chapter 8
Non-Parametric Methods of Inference
Reference: WMS 7th ed., chapter 15
In this chapter, we discuss non-parametric, or distribution-free, methods of testing hypotheses. They are
called non-parametric because we do not assume a knowledge of the form of the population distribution.
8.1 The Sign Test
Suppose we have a random sample of size n from an unknown continuous distribution with median m and
mean . We want to test
H
0
: m = m
0
H
1
: m = m
0
.
Note that if the underlying distribution is known to be symmetrical, then this is also a test of
H
0
: =
0
H
1
: =
0
.
Method: Consider a sample value > m
0
a success and assign it a +. Consider a sample value < m
0
a failureand assign it a -. Any sample values equal to m
0
are dropped from the sample. Let n be the
sample size after any deletions, and let X = the number of +s. If H
0
is true, X has the binomial distribution
with parameters n and p = 0.5. Hence we will reject H
0
if X k

/2
or if X k
/2
. Two useful points are:
k
/2
= n k

/2
.
If n 20, then
Z =
X n/2
_
n/4
has approximately the standard normal distribution, and we would reject H
0
if |Z| > z
/2
.
Example 1. The following observations represent prices (in thousands of dollars) of a certain brand of
automobile at various dealerships across Canada, and are known to come from a symmetrical distribution:
- + - - + - - - - -
18.1 20.3 18.3 15.6 22.5 16.8 17.6 16.9 18.2 17.0
19.3 16.5 19.5 18.6 20.0 18,8 19.1 17.5 18.5 18.0
- - + - + - - - - -
77
78 CHAPTER 8. NON-PARAMETRIC METHODS OF INFERENCE
We want to test
H
0
: = 19.4
H
1
: = 19.4.
We get X = 4. Since k
.025
= 15 and k

.025
= 5, we reject H
0
at level .05.
The Sign Test for a Matched Pairs Experiment.
Suppose (X, Y ) has joint probability (density) function f(x, y). We have data (X
1
, Y
1
), . . . , (X
n
, Y
n
) from
(X, Y ), and we want to test
H
0
: f(x, y) f(y, x)
H
1
: not H
0
.
We let W = X Y . Under H
0
, P[X Y < 0] = P[X Y > 0], so we can carry out a sign test using W.
Here is the theoretical justication:
Proposition 8.1.1 If the joint probability (or density) function of X and Y is symmetric, then P[XY =
z] = P[X Y = z].
Proof. We are assuming that f(x, y) = f(y, x) for all x, y. Then
P[X Y = z] =

y
P[X Y = z, Y = y] =

y
P[X = y +z, Y = y] =

y
f(y +z, y) =

y
f(y +z, y)
=

y
f(y, y +z) =

w
f(w z, w) = P[X Y = z].
Example 2. The numbers of defective memory modules produced per day by two operators A and B are
observed for 30 days. The observations are.
+ + + + - + + + +
(2,5) (6,8) (5,6) (9,9) (6,10) (7,4) (1,5) (4,8) (3,6) (5,6)
+ - + + + + + - +
(6,10) (8,6) (2,5) (1,7) (5,9) (3,3) (6,7) (4,9) (4,2) (5,7)
+ + + - + + + - + +
(1,2) (7,9) (4,6) (7,5) (6,8) (7,10) (2,4) (5,4) (6,8) (3,5)
We want to test
H
0
: A and B perform equally well
H
1
: A is better than B
at level .01. We assign a + to a pair (x, y) with x < y, a - if x > y, or we delete the pair (x, y) if x = y.
Then the number of +s is W = 23 and the sample size is n = 28. We obtain
z =
23 28/2
_
28/4
= 3.40,
and since z
.01
= 2.326, we reject H
0
at level .01.
Remark. In the sign test, we have to assume f(x, y) = f(y, x). As the following example shows, it is not
enough to assume X and Y have the same distributions, as in Mendenhall.
8.2. THE MANN-WHITNEY, OR U-TEST 79
Example. Suppose X and Y have joint probability function given by
1 2 3
1 0 a b
2 b 0 a
3 a b 0
where a, b 0, a = b and a +b = 1/3. Then X and Y have the same (uniform) distribution, but
P[X Y > 0] = P[X = 3, Y = 1] +P[X = 3, Y = 2] +P[X = 2, Y = 1] = a + 2b =
1
3
+b,
P[X Y < 0] = P[X = 1, Y = 3] +P[X = 2, Y = 3] +P[X = 1, Y = 2] = 2a +b = a +
1
3
.
8.2 The Mann-Whitney, or U-Test
We have samples x
1
, . . . , x
n1
and y
1
, . . . , y
n2
from two densities f(x) and g(y) respectively, which are known
to satisfy
f(x) g(x )
for some number . That is, we are assuming that the two underlying distributions dier only in location
along the horizontal axis. We want to test
H
0
: = 0
H
1
: = 0
( > 0 i.e. f shifted to the right of g)
( < 0 i.e. f shifted to the left of g.)
Method: Combine the two samples, order the resulting combined sample, and assign ranks in order of
increasing size. If there is a tie, assign to each of the tied observations the mean of the ranks which they
jointly occupy. Dene
U
x
= n
1
n
2
+
n
1
(n
1
+ 1)
2
W
x
,
where W
x
= the sum of the ranks assigned to the values from the x-sample. Note that the minimum value
of W
x
occurs when the rst sample occupies the n
1
lowest ranks, and then W
x min
= 1 + 2 + + n
1
=
n
1
(n
1
+ 1)/2. The maximum value of W
x
occurs when the rst sample occupies the highest n
1
ranks, and
then W
x max
= (n
2
+ 1) + (n
2
+ 2) + + (n
2
+n
1
1) + (n
2
+n
1
) = n
1
n
2
+n
1
(n
1
+ 1)/2. Hence
n
1
(n
1
+ 1)
2
W
x
n
1
n
2
+
n
1
(n
1
+ 1)
2
and so
0 U
x
n
1
n
2
.
It may be shown that if H
0
is true, then U
x
is symmetric about its middle value, and
E(W
x
) =
n
1
(n
1
+n
2
+ 1)
2
Var(R
x
) =
n
1
n
2
(n
1
+n
2
+ 1)
12
and so
E(U
x
) =
n
1
n
2
2
Var(U
x
) =
n
1
n
2
(n
1
+n
2
+ 1)
12
.
U
y
is dened in the same way. The above expressions are also true for U
y
, provided n
1
and n
2
are inter-
changed. We also have
U
x
+U
y
= n
1
+n
2
.
80 CHAPTER 8. NON-PARAMETRIC METHODS OF INFERENCE
Case 1. n
1
, n
2
small. We use the exact distribution of U
x
. We reject H
0
at level if
U U
0
(U
x
U
0
)
(U
y
U
0
),
where U = min{U
x
, U
y
}, and U
0
is such that P{U U
0
} = /2 (two-tailed test) or P{U
x
U
0
} =
( > 0) or or P{U
y
U
0
} = ( < 0).
Case 2. n
1
, n
2
large (i.e. n
1
> 8, n
2
> 8). Then
Z =
U
x
E(U
x
)
_
Var(U
x
)
has approximately a standard normal distribution. We would therefore reject H
0
at level if |Z| > /2.
Example 3. Consider the following observations, representing the potencies of two samples of penicillin,
one from company X, the other from company Y.
Rank 10 14 6 9 12 13 4 17 11 7
Company X 54.8 62.6 51.2 54.5 56.0 59.4 48.2 70.5 55.2 51.9
Company Y 50.1 44.6 65.9 72.2 74.6 39.4 52.8 78.2 45.8 68.7
Rank 5 2 15 18 19 1 8 20 3 16
Are these data sucient to indicate that the distributions of potencies of penicillin from these two companies
dier in location? Use = .05.
Solution. We get n
1
= n
2
= 10, W
x
= 103 and so U
x
= 52. Hence U
y
= 100 52 = 48 and U = 48. From
the tables, we nd U
0
= 23, so we do not reject. Alternatively, using the normal approximation,
Z =
52 50

175
= 0.15.
Since z
.025
= 1.96, we do not reject H
0
.
Remarks. If V
x
= the number of observations in the x-sample which precede each observation in the
y-sample, then V
x
= U
x
. (For if r
y1
, . . . , r
yn
denote the ranks of the observations in the y-sample, then
V
x
= (r
y1
1) + (r
y2
2) + + (r
yn
2
n) = W
y

n2(n2+1)
2
= n
1
n
2
U
y
= U
x
.)
8.3 Tests for Randomness Based on Runs.
Consider the following example. A long line of trees is observed one by one, and each tree is classied as S
(healthy) or F (diseased). The result is
SSSFSSSSSFFFSSSSSSFFFFSSSSFFSSSSSSSSSSFFFFSS. (8.1)
We want to test
H
0
: the ordering of the sample is random
H
1
: not H
0
.
Observe that the Ss and Fs seem to be clustered, perhaps because diseased trees tend to infect neighbouring
trees. On the other hand, this ordering may be due merely to chance. We want to test whether or not this
particular ordering of 30 Ss and 14 Fs is due to chance.
8.3. TESTS FOR RANDOMNESS BASED ON RUNS. 81
Remark. Note that the opposite case of SFSFSFFSF . . . where Ss and Fs virtually alternate is also
non-random.
Theory. Suppose a sequence such as in (8.1) is made up of m Ss and n Fs. A maximal subsequence
consisting of like symbols is called a run. In (8.1), the runs are SSS, F, SSSSS, FFF, SSSSSS, FFFF, SSSS,
FF, SSSSSSSSSS, FFFF, and SS, of lengths 3,1,5,3,6,4,4,2,10,4, and 2 respectively. Thus, there are 6 runs
of S and 5 runs of F, for a total of 11 runs.
In general, let R denote the number of runs of both types (so that R = 11 in the above example). Then
the minimum possible value of R is r
min
= 2 and the maximum value is
r
max
=
_
2m if m = n,
2 min{m, n} if m = n.
In order to compute the probability function of R, we shall begin by counting the number N(r) of ways that
m Ss and n Fs can be ordered so that r runs result. Suppose rst that r = 2k, so there are k runs of S and
k runs of F.
N(r) = the number of ways which start with S plus the number of ways that start with F.
The number of ways in which m Ss can form k runs is the same as the number of ways that k 1 bars
can be put in m1 spaces (a bar represents a run of Fs), and is therefore
_
m1
k 1
_
.
Each such way generates the number
_
n 1
k 1
_
of ways in which n Fs can be put into k boxes, such that each box contains at least one F.
Combining the above three points, we nd that
N(r) = 2
_
m1
k 1
__
n 1
k 1
_
.
When r = 2k + 1, we similarly nd that
N(r) =
_
m1
k
__
n 1
k 1
_
+
_
m1
k 1
__
n 1
k
_
.
Now suppose that the sequence of m Ss and n Fs is generated at random, in the sense that all
_
m+n
m
_
orderings are equally likely. Then we arrive at
Pr{R = r} =
_

_
2(
m1
k1
)(
n1
k1
)
(
m+n
m
)
if r = 2k
(
m1
k
)(
n1
k1
)+(
m1
k1
)(
n1
k
)
(
m+n
m
)
if r = 2k + 1.
(8.2)
From this, we can calculate that
E(R) =
2mn
m+n
+ 1, Var(R) =
2mn(2mn mn)
(m+n)
2
(m+n 1)
. (8.3)
82 CHAPTER 8. NON-PARAMETRIC METHODS OF INFERENCE
Method: If m and n are both large (m, n 10), then under H
0
,
Z =
R E(R)
_
Var(R)
(where E(R) and Var(R) are given in (8.3)) has approximately the standard normal distribution, and so we
would reject H
0
at level if |Z| > z
/2
. Otherwise (if m and n are not large), we must use the probability
function of R given in (8.2) directly.
Example. For the example with the trees, we have m = 30, n = 14, and so E(R) = 20.1, and Var(R) = 8.03.
We found R = 11, and therefore
Z =
11 20.1

8.03
= 3.21.
Since z
.025
= 1.96, we reject H
0
at level .05.

You might also like