Professional Documents
Culture Documents
is a
subset of the cartesian product
:= {(,
) : ,
}
such that any is the rst component of exactly one ordered pair (,
).
If is the rst component of (,
, called value.
2.1. BASIC NOTIONS 9
A function f :
j=1
, i.e. there exists an enumeration
for .
10 CHAPTER 2. DISCRETE PROBABILITY SPACES
Examples 2.1.2. The sample spaces in Example 2.1.1 (1) and (2) are dis-
crete, but those in (3) and (4) are not.
Denition 2.1.2. A subset A of a discrete sample space is called an
event. An event of form {} with is called elementary event.
Note that an event is a set A consisting of certain outcomes . When the
experiment is practically (or mentally) carried out, then it yields a certain
sample (outcome) . We say that the event A takes place or occurs or
is realized with this sample if A.
Examples 2.1.3.
(1) Tossing a coin: The events are , , {H} and {T}.
(2) When throwing a die the idea to obtain an even number is represented
by the event A := {2, 4, 6} = {1, 2, 3, 4, 5, 6}.
Exercise 2.1.3. Write down all events for tossing a coin.
Given two events A, B , we say that A or B occurs if A B occurs.
Similarly, we say that A and B occur if A B occurs.
Exercise 2.1.4. When throwing a die, what is the event that the outcome is
an even number and greater that two ? What is the event that it is an even
number or greater than two ?
Now some events may be particularly interesting for us, and we might
want to somehow assign them a number telling how likely their occurrence
is.
Examples 2.1.4.
(1) Maybe {H} is interesting when tossing a coin to make an otherwise
dicult decision by chance.
(2) When throwing a dice we might want to know how likely it is to obtain
an even number. That would be the probability of {2, 4, 6}.
If the model is appropriate for the experiment, then the probability of an
event A should be close to the relative frequency (sometimes also called
empirical probability) observed when performing the experiment repeatedly
2.1. BASIC NOTIONS 11
and in such a way that the dierent trials do not inuence each other. If n
denotes the total number of trials, and n
A
the number of outcomes of the
event A, then the relative frequence is given by the ratio
n
A
n
.
Writing P(A) for the probability of A (which is yet to be dened at this
point), then the best we could hope for is to observe the limit relation
lim
n
n
A
n
= P(A).
While the sample space is often dictated by the design of the experiment
or by common sense, the question which probabilites to assign to an event
really is a matter of modelling. The following notion already provides a
complete mathematical model for many applications.
Denition 2.1.3. Let = {
j
}
j=1
be a discrete sample space. A probability
measure P on the discrete sample space is a countable collection {p
j
}
j=1
of numbers p
j
0 with
j=1
p
j
= 1. The probability of an event A is
given by
P(A) :=
j1;
j
A
p
j
.
To (, P) we refer as discrete probability space.
If = {
j
}
j=1
, {p
j
}
j=1
and P are as in the denition, then obviously
P({
j
}) = p
j
, j = 1, 2, ...
are the probabilities of the elementary events {
j
}.
Remark 2.1.1. This is a denition in the style of classical probability. When
dealing with uncountably innite sample spaces (i.e. to describe experi-
ments with a continuum of possible outcomes such as determining the lifes-
pan of an organism), then it is usually impossible to come up with a reason-
able way of assigning a probability to each subset A . Later we will see
how to x this by regarding only a proper subset F of the power set P()
as the collection of events.
We list a couple of properties of probability measures.
12 CHAPTER 2. DISCRETE PROBABILITY SPACES
Lemma 2.1.1. Let (, P) be a discrete probability space.
(i) P() = 1 and P() = 0.
(ii) If A B then P(A) P(B).
(iii) P(A B) = P(A) +P(B) P(A B).
(iv) P(A
c
) = 1 P(A).
(v) P(
i=1
A
i
)
i=1
P(A
i
).
(vi) If A
1
, A
2
, A
3
, ... are pairwise disjoint events (i.e. if A
i
A
k
= when-
ever i = k) then
P
_
_
i=1
A
i
_
=
i=1
P(A
i
).
Proof. According to the denition we have P() =
j=1
p
j
= 1, because
j
for all j, and P() = 0, because there is no j such that
j
. This
is (i). If A B we have
j1:
j
A
p
j
j1:
j
B
p
j
,
what gives (ii). For (iii) note that
j1:
j
AB
p
j
=
j1:
j
A
p
j
+
j1:
j
B
p
j
j1:
j
AB
p
j
(substracting the last sum ensures each index j with j A B appears
exactly once on the right hand side of the equality). A special case of (iii)
together with (i) gives (iv). Finally, we have
j1:
j
i=1
A
i
p
j
i=1
j1: A
i
p
j
with equality if the A
i
are pairwise disjoint. This proves (v) and (vi).
Remark 2.1.2. Given a discrete probability spaces (, P), we should try to
think of a probability measure P as a function P : P() [0, 1] from the
power set P() of into the unit interval [0, 1]. This function is normed by
(i), monotone by (ii), and the properties (v) and (vi) are called -subadditivity
and -additivity, respectively.
2.1. BASIC NOTIONS 13
Examples 2.1.5.
(1) Tossing a fair coin: = {H, T}, P({H}) =
1
2
.
(2) Throwing a fair die: = {1, ..., 6}, P({}) = p
1
= ... = p
6
=
1
6
.
(2) Throwing an unfair (biased, marked) die: = {1, ..., 6}, P({6}) = p
6
=
1
2
and P({}) = p
1
= ... = p
5
=
1
10
for {1, 2, 3, 4, 5}. Of course
there are also many other ways to make a die unfair. But whether it is
fair or not should be reected by the choice of the probability measure.
(3) Tossing two fair coins at once (or tossing one coin twice but without any
inuence between the two trials): = {(H, H), (H, T), (T, H), (T, T)}
and P({}) =
1
4
, .
(4) To open a safe one needs to know the correct code consisting of 4 digits
each between 0 and 9. We would like to know the probability of nding
it by chance. In this case
= {(0, 0, 0, 0), (0, 0, 0, 1), ...., (9, 9, 9, 8), (9, 9, 9, 9)}
has 10
4
dierent elements, and common sense suggests that in a model
for this problem each of these four-tuples should have the same prob-
ability, namely 10
4
. Hence the probability to nd the correct code is
10
4
.
(5) The letters A, B, C and D are randomly arranged in a line, each order
being equally likely. Then the probability to get the string ADCB
is
1
4!
=
1
24
. The probability space for this example is made up by all
possible permutations of these letters,
= {ABCD, ABDC, ACBD, ..., DCBA} .
(6) Suppose we have limited space and want to invite 20 out of our 100
friends to a party. We cannot decide and do it randomly, but in a
fair manner (we do not prefer any combination). Given a xed choice
of 20 friends, the probability that exactly these friends will be invited
ois
_
100
20
_
1
. (This is a small number, maybe we better make a more
emotional, non-random choice ...) Here the probability space consists
of all combinations of 20 out of 100 elements, but is would be tedious to
write it down more explicitely. However, we know it has
_
100
20
_
elements.
14 CHAPTER 2. DISCRETE PROBABILITY SPACES
(7) In a group there are 10 women and 10 men. We would like to form a
gender balanced team of four people. We do it in a fair manner and
ask for the probability that some xed, preferred team we have in mind
will be the chosen one. The sample space could be thought of to consist
of elements ((m
1
, m
2
), (w
1
, w
2
)), where (m
1
, m
2
) is a randomly chosen
combination of two out of ten men, and (w
1
, w
2
) is a randomly chosen
combination of 2 out of ten women. For each of these two choices
there are
_
10
2
_
possibilities, hence our sample space will have
_
10
2
__
10
2
_
elements. The probability to choose the preferred team is
__
10
2
__
10
2
__
1
.
Exercise 2.1.5. Why is it (strictly speaking) wrong to write P(H) in Example
2.1.5 (1) or P(1) in (2) ?
Please be careful. Some textbooks, encyclopedias or articles will neverthe-
less write these wrong expressions. This is often done to have a simplied
(short-hand) notation, but with the agreement (often written somewhere, but
sometimes silent) that this is written willingly to replace the mathematically
correct expression.
Remark 2.1.3. The best way to solve modelling problems is: First, deter-
mine the sample space and afterwards come up with a reasonable probability
measure on it. Once the sample space is written correctly, the subsequent
parts of the problem become much simpler.
Exercise 2.1.6. To design a new product 10 items have been produced, 8 of
them are of high quality but 2 are defective. In a test we randomly pick two
of these 10 items. No item is preferred, and we do not place a drawn item
back. How likely is it that our random choice gives one high quality and one
defective item ?
Tue, Oct 15,
2013
Lemma 2.1.2. (Inclusion-exclusion principle)
Let (, P) be a discrete probability space and let A
1
, A
2
, ..., A
n
be events. Then
P
_
n
_
i=1
A
i
_
=
n
i=1
P(A
i
)
1i
1
<i
2
n
P(A
i
1
A
i
2
)
+
1i
1
<i
2
<i
3
n
P(A
i
1
A
i
2
A
i
3
) . . .
+ + (1)
n1
P(A
1
... A
n
).
2.1. BASIC NOTIONS 15
This generalizes Lemma 2.1.1 (iii). For n = 3 and three events A, B, C
Lemma 2.1.2 gives
P(A B C) = P(A) +P(B) +P(C)
P(A B) P(A C) P(B C)
+P(A B C).
Exercise 2.1.7. Draw the corresponding Venn diagram for n = 3.
Exercise 2.1.8. Review or study the basic idea of proof by induction.
Proof. We proceed by induction. For n = 1 there is nothing to prove, and for
n = 2 the statement is known to be true by Lemma 2.1.1 (iii). We assume
it is true for n, this is called the induction hypothesis. We use the induction
hypothesis to prove the statement of the lemma for n = 1. If we manage
to do so, it must be true for all natural numbers n and all choices of events
A
1
, ..., A
n
, as desired.
Given n + 1 events A
1
, ..., A
n
, A
n+1
, we observe
P
_
n+1
_
i=1
A
i
_
= P
__
n
_
i=1
A
i
_
A
n+1
_
= P
_
n
_
i=1
A
i
_
+P(A
n+1
) P
__
n
_
i=1
A
i
_
A
n+1
_
by Lemma 2.1.1 (iii). The distributivity rule of set operations tells
_
n
_
i=1
A
i
_
A
n+1
=
n
_
i=1
(A
i
A
n+1
),
and by induction hypothesis this event has probability
P
_
n
_
i=1
(A
i
A
n+1
)
_
=
n
i=1
P(A
i
A
n+1
)
1i
1
<i
2
n
P(A
i
1
A
i
2
A
n+1
)
+
1i
1
<i
2
<i
3
n
P(A
i
1
A
i
2
A
i
3
A
n+1
) . . .
+ + (1)
n1
P(A
1
... A
n
A
n+1
)
16 CHAPTER 2. DISCRETE PROBABILITY SPACES
Using the induction hypothesis once more, now on
n
i=1
A
i
, we obtain
P
_
n+1
_
i=1
A
i
_
=
n
i=1
P(A
i
) +P(A
n+1
)
1i
1
<i
2
n
P(A
i
1
A
i
2
)
n
i=1
P(A
i
A
n+1
)
+
1i
1
<i
2
<i
3
n
P(A
i
1
A
i
2
A
i
3
) +
1i
1
<i
2
n
P(A
i
1
A
i
2
A
n+1
)
+...
+(1)
n1
P(A
1
... A
n
)+(1)
n1
1i
1
<i
2
<<i
n1
n
P(A
i
1
... A
i
n+1
A
n+1
)
+ (1)
n
P(A
1
... A
n
A
n+1
),
which is
n+1
i=1
P(A
i
)
1i
1
<i
2
n+1
P(A
i
1
A
i
2
)
+
1i
1
<i
2
<i
3
n+1
P(A
i
1
A
i
2
A
i
3
) . . .
+ + (1)
n
P(A
1
... A
n+1
),
as desired.
To discuss the next example (which, by the way, is a classic) we need to
talk about functions again. A function f :
(each element of the target space is the value of some argument under f). A
function f :
1i
1
<i
2
<<i
k
n
P(E
i
1
E
i
k
) =
_
n
k
_
(n k)!
n!
=
1
k!
,
and by Lemma 2.1.2,
P
_
n
_
i=1
E
i
_
= 1
1
2!
+
1
3!
... + (1)
n+1
1
n!
=
n
k=1
(1)
k+1
k!
.
As n tends to innity, this tends to
k=1
(1)
k+1
k!
= 1
1
e
,
what shows that for a large number n of people the chance not to have a
single person getting their own hat is approximately
1
e
.
We nally record some continuity properties of probability measures. A
sequence (A
n
)
n=1
of events is called increasing if it is increasing in the set-
theoretic sense,
A
1
A
2
A
n
A
n+1
. . .
and decreasing if it is decreasing in the set-theoretic sense,
A
1
A
2
A
n
A
n+1
...
Lemma 2.1.3. Let (, P) be a discrete probability space.
(i) If (A
n
)
n=1
is an increasing sequence of events, then
lim
n
P(A
n
) = P
_
_
n=1
A
n
_
.
18 CHAPTER 2. DISCRETE PROBABILITY SPACES
(ii) If (E
n
)
n=1
is a decreasing sequence of events, then
lim
n
P(A
n
) = P
_
n=1
A
n
_
.
Proof. Tor see (i) set
B
1
:= A
1
, B
2
:= A
2
\ B
1
, . . . , B
n
:= A
n
\
_
n1
_
i=1
B
i
_
, . . .
Note that the B
i
are pairwise disjoint,
i=1
B
i
=
i=1
A
i
, and
n
i=1
B
i
=
n
i=1
A
i
= A
n
for all n. Then
P
_
_
i=1
A
i
_
= P
_
_
i=1
B
i
_
=
i=1
P(B
i
) = lim
n
n
i=1
P(B
i
) = lim
n
P
_
n
_
i=1
B
i
_
= lim
n
P(A
n
).
Statement (ii) is left as an exercise, just use complements.
2.2 Conditional probability and independence
Sometimes we will only have partial information about an experiment, for
instance, we may have to impose or assume certain conditions in order to
discuss statistical features. If these conditions are nonrandom, we can some-
how ignore them by choosing an adequate model. But sometimes these
conditions themselves are random, i.e. varying with the outcome of the ex-
periment, and we need an appropriate probabilistic model to reect this.
Examples 2.2.1. Consider a group of one hundred adults, twenty are women
and eighty are men, fteen of the women are employed and twenty of the men
are employed. We randomly select a person and nd out whether she or he
is employed. If employed, what is the probability that the selected person is
a woman ?
Without the information on employment, we would have selected a woman
with probability
20
100
=
1
5
.
Knowing the selected person is employed, we can pass to another sample
space, now consisting of the thirty ve employed people, and obtain the
probability
15
35
=
3
7
to have selected a woman.
2.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 19
In general it is more exible to keep th sample space xed but to imple-
ment given information by conditioning.
Denition 2.2.1. Let (, P) be a discrete probability space. Given two
events A and B with P(B) > 0, the conditional probability of A given B is
dened as
P(A|B) :=
P(A B)
P(B)
.
Examples 2.2.2. For the previous example, set
A := {a woman selected}
and
B := {the selected person is employed}
to see P(A|B) =
3
7
.
The next Lemma is (almost) obvious.
Lemma 2.2.1. Let (, P) be a discrete probability space and B an event with
P(B) > 0. Then P(|B) is again a probability measure.
We have P(B|B) = P(|B) = 1. Moreover, statements (i)-(vi) of Lemma
2.1.1 and the statement of Lemma 2.1.2 remain valid for P(|B) in place of
P.
More specic rules for conditional probability are the following.
Lemma 2.2.2. Let (, P) be a discrete probability space.
(i) If A and B both are events with positive probability, then
P(B|A) =
P(A|B)P(B)
P(A)
.
(ii) If B
1
, . . . , B
n
are pairwise disjoint events of positive probability and
such that =
n
i=1
B
i
then we have
P(A) =
n
i=1
P(A|B
i
)P(B
i
)
for any event A.
20 CHAPTER 2. DISCRETE PROBABILITY SPACES
(iv) If B
1
, B
2
, . . . are pairwise disjoint events of positive probability and such
that =
i=1
B
i
then we have
P(A) =
i=1
P(A|B
i
)P(B
i
)
for any event A.
(iv) For any events A
1
, . . . , A
n
with P
_
n1
i=1
A
i
_
> 0 we have
P
_
n
i=1
A
i
_
= P(A
1
)P(A
2
|A
1
)P(A
3
|A
2
A
1
) P
_
A
n
|
n1
i=1
A
i
_
.
Exercise 2.2.1. Prove Lemma 2.2.2.
(iii) and (iv) are called the law of total probability, (iv) is referred to as
Bayes rule.
Examples 2.2.3. We consider a medical test. Patients are being tested for
a disease. For a patient sick with the disease the test will be positive in 99%
of all cases. In 2% of all cases a healthy patient is tested positive. Statistical
data show that one out of thousand patients really gets sich with the disease.
We would like to know the probability that a tested patient indeed is sick.
We write S for the event {patient is sick}, + for {patient is tested positive}
and for {patient is tested negative}. We know that
P(S) = 0.001, P(+|S) = 0.99 und P(+|S
c
) = 0.02.
We are looking for P(S|+). By the law of total probability ,
P(+) = P(+|S)P(S) +P(+|S
c
)P(S
c
),
and therefore
P(S|+) =
P(S +)
P(+)
=
P(+|S)P(S)
P(+)
=
0.99 0.001
0.99 0.001 + 0.02 0.999
,
which is approximately
1
20
. This suggests that while the test may give some
indication and serve as a rst diagnostic tool, it is too inaccurate to allow
the conclusion of a diagnosis without performing further examinations.
2.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 21
Exercise 2.2.2. About 5% of all men and 1% of all women suer from
dichromatism. Suppose that 60% of a group are women. If a person is
randomly selected from that group, what is the probability that this individual
suers from dichromatism ?
In our experiment or observation it may happen that the relative fre-
quence of the occurence of some event A is not aected whether some other
given event B occurs or not. In other words, if n is the total number of trials
and n
A
denotes the number of trials such that A occurs, we observe that the
numbers
n
A
n
and
n
AB
n
B
are close to each other, at least for large n. A heuristic rearrangement of
this relation is to say that the numbers
n
AB
n
and
n
A
n
n
B
n
are close. In the mathematical model this is formalized by the notion of
independence.
Denition 2.2.2. Let (, P) be a discrete probability space. Two events A
and B are called independent if
P(A B) = P(A)P(B).
Events A
1
, ..., A
n
are called independent if
P(A
j
1
A
j
k
) =
k
i=1
P(A
j
i
)
for all distinct j
1
, ..., j
k
{1, . . . , n}. The events A
1
, . . . , A
n
are called pair-
wise independent if for any two dierent j, k {1, . . . , n} the two events A
j
and A
k
are independent.
(At rst glance independence looks like some algebraic relation. In a
sense, this intuition is not all that wrong ... maybe at some point you will
encounter or have encountered product measures, characteristic functions
and groups characters, etc.)
Examples 2.2.4. Three events A, B and C are independent if
22 CHAPTER 2. DISCRETE PROBABILITY SPACES
1. P(A B C) = P(A)P(B)P(C),
2. P(A B) = P(A)P(B),
3. P(A C) = P(A)P(C) and
4. P(B C) = P(B)P(C).
Obviously independence implies pairwise independence. In general the
converse is false.
Exercise 2.2.3. Give an example of three events A, B, C that are pairwise
independent but not independent.
Exercise 2.2.4. Verify that if two events A and B are independent then also
A and B
c
are independent and A
c
and B
c
are independent.
Whether in a mathematical model two events are independent or not is
a consequence of the choice of the probability measure.
Examples 2.2.5. We toss two coins, not necessarily fair. A suitable sample
space is
:= {H, T}
2
= {(
1
,
2
) :
i
{H, T} , i = 1, 2} .
For a rst model, let p, p
_
P({(H, H)}) = pp
,
P({(T, H)}) = (1 p)p
,
P({(H, T)}) = p(1 p
),
P({(T, T)}) = (1 p)(1 p
).
(2.1)
Then the events {
1
= H} and {
2
= H} are independent under P with
P({
1
= H}) = p and P({
2
= H}) = p
. (2.2)
Conversely, it is not dicult to see that if we require these events to be
independent under a probability measure P and to have (2.2), then P has to
be as in (2.1).
For a second model, let the probability to get heads in the rst trial be
p (0, 1). If the rst trial gives heads, then let the probability to get heads
2.3. DISCRETE RANDOM VARIABLES 23
in the second trial be p
(0, 1) with q
= p
.
For a probability measure Q on satisfying these ramications we must have
_
_
Q({
1
= H}) = p,
Q({
2
= H} | {
1
= H}) = p
,
Q({
2
= H} | {
1
= T}) = q
.
This yields
Q({
1
= H} {
2
= H}) = Q({
2
= H} | {
1
= H})Q({
1
= H})
= p
p.
But on the other hand,
Q({
2
= H}) = Q({
2
= H} | {
1
= H})Q({
1
= H})
+Q({
2
= H} | {
1
= H})Q({
1
= T})
= p
p +q
(1 p),
and therefore
Q({
1
= H})Q({
2
= H}) = p(p
p +q
(1 p))
= Q({
1
= H} {
2
= H}),
i.e. the events {
1
= H} and {
2
= H} are not independent under Q.
2.3 Discrete random variables
When performing and observing an experiment it is often useful to lter
or rearrange information or to change perspective. For instance, we might
measure a temperature, viewed as random outcome of the experiment, and
want to calculate a reaction intensity that depends on the given temperature.
Then this reaction intensity itself will be random. This, more or less, is the
concept of a random variable, or more generally, a random element.
Denition 2.3.1. Let (, P) be a discrete probability space and E = . A
function X : E is called a random element with values in E. A function
X : R is called a random variable.
24 CHAPTER 2. DISCRETE PROBABILITY SPACES
A random variable is a random element with values in E = R.
Notation: We agree to write
{X B} := { : X() B}
for any random element X with values in E and any E X, and in the
special case B = {x} also
{X = x} := { : X() = x} .
Similarly, we agree to write
P(X B) := P({X B}) ,
and in the special case B = {x} also
P(X = x) := P({X = x}) .
These abbreviations are customary.
Examples 2.3.1. Given an event A , set
1
A
() :=
_
1 if A
0 if A
c
.
This denes a random variable 1
A
on , usually referred to as the indicator
function of the event A. Note that {1
A
= 1} = A, {1
A
= 0} = A
c
and
{1
A
= x} = for any x {0, 1}
c
.
Since is discrete, any random element X on with values in a set
E can have at most a countable number of dierent values x E. If we
enumerate these countably many dierent values by {x
j
}
j=1
, then the events
{X = x
j
} are pairwise disjoint and =
j=1
{X = x
j
}. If X attains only
nitely many dierent values then of course the same is true with a nite
number n in place of .
If X is a random variable, we may rewrite it as
X =
i=1
x
j
1
{X=x
j
}
,
where 1
{X=x
j
}
is the indicator function of the event {X = x
j
}.
2.4 Bernoulli trials
Chapter 3
Absolutely continuous
distributions
25
Chapter 4
Measure theory and Lebesgue
integration
26
Chapter 5
Product spaces and
independence
27
Chapter 6
Weak convergence and the
Central Limit Theorem
28
Chapter 7
Conditioning
29
Chapter 8
Martingale theory
30