You are on page 1of 32

Probability

Dr. Michael Hinz, Bielefeld University


WS 2013/2014, Mo 16-18 T2-233, Tue 16-18 T2-204
Contents
1 Introduction 1
2 Discrete probability spaces 7
2.1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Conditional probability and independence . . . . . . . . . . . 18
2.3 Discrete random variables . . . . . . . . . . . . . . . . . . . . 23
2.4 Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Absolutely continuous distributions 25
4 Measure theory and Lebesgue integration 26
5 Product spaces and independence 27
6 Weak convergence and the Central Limit Theorem 28
7 Conditioning 29
8 Martingale theory 30
ii
Chapter 1
Introduction
Any form of intelligent life - and a human being in particular - extracts
information from the sensorial input provided by the immediate environment.
This information is observed and then processed further. Its evaluation
(e.g. danger or reward) allows to make decisions and to establish appropriate
routines and reactions (e.g. ight or cooperation). The single most important
mechanism in evaluating information that is trial and error. But on a
philosphical level, this is nothing but testing a hypothesis by an experiment:
An experiment is an orderly procedure carried out with the goal
of verifying, refuting or establishing the validity of a hypothesis.
(URL: http://en.wikipedia.org/wiki/Experiment)
1
2 CHAPTER 1. INTRODUCTION
A hypothesis is a proposed explanation for an observed phe-
nomenon.
(URL: http://en.wikipedia.org/wiki/Hypotheses)
Sometimes the idea of trial and error - source of all intelligent behaviour
- is shamelessly disregarded. This also happens often in the educational
system, where the terrible habit of considering mistakes something bad is
widespread. This is illogical, because without mistakes there cannot be any
learning process, any acquisition of reliable knowledge ! After having com-
pleted our learning process up to an adequate level we of course hope to have
established precise and correct thinking.
Maybe an experiment has more to do with perception than with physical
reality. This is an issue people discuss in constructivism. But in any case
we should keep in mind that what exactly is the experiment, what are its
possible outcomes, and what is the hypothesis to be tested, will always be a
matter of agreement.
A sort of trivial experiment is an experiment with a predetermined out-
come. In this case it suces to perform this experiment just once to decide
whether the hypothesis is true or false. For example, we may throw a ball
in the air and wait for things to happen. We assume our strength is nite
and the mass of the ball is positive. According to classical mechanics it will
sooner or later fall down. The hypothesis the ball stays in the air forever
can be disproved at once. (Note however that we made silent and intrinsic
assumptions: sooner or later fall down might suggest that we are standing
on or above the surface of the earth, and that earth will not stop to exist
while the ball is in the air. In other words: We have to explain what we
mean by common sense.)
In more interesting cases the experiment has several possible outcomes,
i.e. its outcome is not predetermined. It is reasonable assume the collection
of all possible outcomes of the experiment is known. For example, we may
toss a coin (to be precise, an ideal coin that will always land on a face and
never remain standing on its edge). The hypothesis the coin always shows
head can be disproved, but we may need to perform our experiment several
times to do so.
3
The (theoretical) design of our experiment depends on the amount of
information we can observe and on the sort of information we would like
to extract. Suppose, for example, we record the temperature in a certain
geographical location. Suppose that measurements are taken daily over the
period of one year of 365 days and that we are close to the equator, such that
seasons do not play a role. Assume that our thermometer only yields integer
numbers. If we are interested in a single large set of data consisting of 365
numbers, then this set may be considered an outcome of the experiment, and
no further discussion is needed. But this is interesting only if our hypothesis
neccessarily has to be phrased in terms of 365 numbers. If we would like
to compare two locations, it may be more intelligent to ask for an average
temperature, i.e. one single number. This amounts to the idea of having
an experiment whose possible outcomes are integer numbers and which is
performed repeatedly for 365 times. If we plot the absolute numbers of
the occurrences of certain temperatures in a histogram, then we may try to
conclude that they obey a certain distribution. In other words: We think
of our experiment as an experiment with random outcome and ask for a
probabilistic structure. This is ltering information: For instance, this idea
does not tell whether it is 20
o
C today or yesterday.
When an experiment is performed repeatedly, then two ideas arise:
We may want to record, evaluate and interpret the obtained data. For
instance, we may want to look at a time series to see whether we can
determine distributions or dependencies. Observing and interpreting
are the objectives of statistics.
We may want to model the phenomenon and extract further informa-
tion from the model: Once we know the physical system under obser-
vation shows typical statistical features, we can try to come up with a
4 CHAPTER 1. INTRODUCTION
probabilistic model. This model allows further theoretical conclusions
(independent of further physical observations) and maybe used to fore-
cast the future behaviour of the system. The model and the deduction
of further information is the task of probability.
In this sense probability is talking about models for experiments with
several possible outcomes. We do not talk about reality, whatever reality is.
Whether a probabilistic model is considered appropriate for a physical exper-
iment is usually decided in subsequent steps of statistical testing or simply
by common sense. But in any case this is not a question to be decided by
mathematics alone.
Another interpretation of assigning probabilities to certain events is to
think of it as a measure of belief , for instance think of opinion polls. Luck-
ily, this idea can be captured by exactly the same mathematical idea as used
to model experiments, we dont have to come up with yet another theory.
As far as we know robability theory started in the sixteenth and seven-
teenth century (Cardano, Fermat, Pascal, Bernoulli, etc.; Bernoullis Ars
conjectandi appeared 1713), mainly to analyze games of chance. Based on
earlier results a great body of work was established in the nineteenth cen-
tury (Euler, Lagrange, Gauss, Laplace, etc.; Laplaces Theorie analytique
des probabilites appeared 1812). This is what we refer to as classical prob-
ability. It is already very useful and intuitive but still limited to modelling
experiments that have at most a countable number of possible outcomes or
can be described by probability measures that have densities (calculus-based
methods, Riemann integration). We will follow this spirit in the rst of our
lecture, and most concepts of probabilistic thinking can already be estab-
lished at this level.
Later in the nineteenth century other areas of mathematics bloomed, for in-
stance set theory (Cantor, Dedekind, etc.), while at the same time probability
theory was merely unknown to the general public. In an encyclopedia from
around 1890 it says that mathematical probability is the ratio with numerator
being the number of fortunate cases and denominator being the number of all
possible cases. They also describe an urn model, say that an increasing num-
ber of experimental runs improves the quality of observations using empirical
probabilities, refer to Bernoulli and Laplace, and relate probability theory to
Gauss least squares method. Finally they mention a couple of contemporary
5
textbooks on probability. But compared to what the encyclopedia tells about
geometry, algebra or analysis, the article is extremely short.
6 CHAPTER 1. INTRODUCTION
In the course of the nineteenth century mathematicians started to under-
stand that it would be clever to base probability on set theory (Hausdors
Grundz uge der Mengenlehre 1912 was an important inuence for probabil-
ity theory), because then abstract and general models for experiments with
an uncountable number of outcomes could be formulated and investigated.
This dream came with the question how to determine the probability of a set
(called event), and after an intense discussion people realized that determin-
ing the probability of an event should be more or less the same as measuring
length, area or volume of a subset of the Euclidean space. This lead to the
concept of measure theory and to an axiomatization of probability theory
based on measure theory (along with Lebesgue integration), developed in the
early years of the last century (Hausdor, von Mises, Wiener, Kolmogorov,
etc.) and commonly referred to as modern probability. Usually its invention
is attributed to Kolmogorov (Kolomgorovs Grundbegri der Wahrschein-
lichkeitsrechnung appeared 1933), but he himself remarked that these ideas
had been around for some time. Modern probability is an incredibly exible
theory, and closely related developments in physics or economics (Bachelier,
Einstein, Smoluchowski, Wiener, Birkho, Bose, Boltzmann, etc.) went par-
allel. Because measure theory is a deep and somewhat abstract business close
to the axiomatic roots of mathematics we will only have a peek into some of
its concepts.
In terms of its strentgh and exibility, probability theory took an evolu-
tion in two, maybe even three steps: discrete theory, theory based on calculus
and modern theory based on measure theory. It still is a relatively young
mathematical subject.
Chapter 2
Discrete probability spaces
Mon, Oct 14,
2013
In this chapter we have a look at key notions and ideas of probability with
a minimum of technicalities. This is possible for models that use discrete
probability spaces.
2.1 Basic notions
We use mathematical set theory to model an experiment.
There are dierent axiomatic systems mathematics can be based upon
(mathematics is neither true nor false, as shown bei Kurt G odel around
1930, inspired by David Hilbert and Hans Hahn), and usually the notion
set is explained in the chosen axiomatic system (for instance, the so-called
Zermelo-Fraenkel axioms). An earlier and less rigoros denition (at that
time axiomatic was not yet understood) was given by Georg Cantor in the
late nineteenth century:
A set is a gathering together into a whole of denite, distinct
objects of our perception or of our thought - which are called
elements of the set.
For our purposes let us please agree to accept this denition of the notion
set.
Exercise 2.1.1. Study or review the customary notation and verbalization
of set theory (including power set), set relations and operations and related
notions, Venn diagrams, connection to logical operations.
7
8 CHAPTER 2. DISCRETE PROBABILITY SPACES
The collection of all possible outcomes of an experiment is represented
by a nonempty set , called the sample space. Its elements are called
samples or outcomes.
Examples 2.1.1.
(1) Tossing a coin: A reasonable model is = {H, T}, where H stands
for head and T for tails. The possible outcomes in this model are H
and T.
(2) Throwing a die: Common sense suggests = {1, 2, 3, 4, 5, 6}.
(3) Life span of a device or a living organism: A natural choice is =
[0, +), the possible outcomes are all nonnegative real numbers 0
t < +. The outcome 0 [0, +) means broken from the beginning
or born dead.
(4) Random choice of a number in [0, 1]: Of course = [0, 1].
Exercise 2.1.2.
(1) Find an appropriate sample space for tossing a coin ve times.
(2) Find an appropriate sample space for two people playing rock-paper-
scissors.
A function (or map or mapping) f :

from a set to a set

is a
subset of the cartesian product

:= {(,

) : ,

}
such that any is the rst component of exactly one ordered pair (,

).
If is the rst component of (,

), called argument, we write f() to


denote the second component

, called value.
2.1. BASIC NOTIONS 9
A function f :

is called injective if f() = f() implies = .


(Injectivity forbids two dierent arguments to have the same value.)
A set is called countable if there is an injective function from to the
natural numbers N. Such an injective map is called an enumeration of our
set.
Denition 2.1.1. A sample space is called discrete if it is countable.
If is discrete we can write = {
j
}

j=1
, i.e. there exists an enumeration
for .
10 CHAPTER 2. DISCRETE PROBABILITY SPACES
Examples 2.1.2. The sample spaces in Example 2.1.1 (1) and (2) are dis-
crete, but those in (3) and (4) are not.
Denition 2.1.2. A subset A of a discrete sample space is called an
event. An event of form {} with is called elementary event.
Note that an event is a set A consisting of certain outcomes . When the
experiment is practically (or mentally) carried out, then it yields a certain
sample (outcome) . We say that the event A takes place or occurs or
is realized with this sample if A.
Examples 2.1.3.
(1) Tossing a coin: The events are , , {H} and {T}.
(2) When throwing a die the idea to obtain an even number is represented
by the event A := {2, 4, 6} = {1, 2, 3, 4, 5, 6}.
Exercise 2.1.3. Write down all events for tossing a coin.
Given two events A, B , we say that A or B occurs if A B occurs.
Similarly, we say that A and B occur if A B occurs.
Exercise 2.1.4. When throwing a die, what is the event that the outcome is
an even number and greater that two ? What is the event that it is an even
number or greater than two ?
Now some events may be particularly interesting for us, and we might
want to somehow assign them a number telling how likely their occurrence
is.
Examples 2.1.4.
(1) Maybe {H} is interesting when tossing a coin to make an otherwise
dicult decision by chance.
(2) When throwing a dice we might want to know how likely it is to obtain
an even number. That would be the probability of {2, 4, 6}.
If the model is appropriate for the experiment, then the probability of an
event A should be close to the relative frequency (sometimes also called
empirical probability) observed when performing the experiment repeatedly
2.1. BASIC NOTIONS 11
and in such a way that the dierent trials do not inuence each other. If n
denotes the total number of trials, and n
A
the number of outcomes of the
event A, then the relative frequence is given by the ratio
n
A
n
.
Writing P(A) for the probability of A (which is yet to be dened at this
point), then the best we could hope for is to observe the limit relation
lim
n
n
A
n
= P(A).
While the sample space is often dictated by the design of the experiment
or by common sense, the question which probabilites to assign to an event
really is a matter of modelling. The following notion already provides a
complete mathematical model for many applications.
Denition 2.1.3. Let = {
j
}

j=1
be a discrete sample space. A probability
measure P on the discrete sample space is a countable collection {p
j
}

j=1
of numbers p
j
0 with

j=1
p
j
= 1. The probability of an event A is
given by
P(A) :=

j1;
j
A
p
j
.
To (, P) we refer as discrete probability space.
If = {
j
}

j=1
, {p
j
}

j=1
and P are as in the denition, then obviously
P({
j
}) = p
j
, j = 1, 2, ...
are the probabilities of the elementary events {
j
}.
Remark 2.1.1. This is a denition in the style of classical probability. When
dealing with uncountably innite sample spaces (i.e. to describe experi-
ments with a continuum of possible outcomes such as determining the lifes-
pan of an organism), then it is usually impossible to come up with a reason-
able way of assigning a probability to each subset A . Later we will see
how to x this by regarding only a proper subset F of the power set P()
as the collection of events.
We list a couple of properties of probability measures.
12 CHAPTER 2. DISCRETE PROBABILITY SPACES
Lemma 2.1.1. Let (, P) be a discrete probability space.
(i) P() = 1 and P() = 0.
(ii) If A B then P(A) P(B).
(iii) P(A B) = P(A) +P(B) P(A B).
(iv) P(A
c
) = 1 P(A).
(v) P(

i=1
A
i
)

i=1
P(A
i
).
(vi) If A
1
, A
2
, A
3
, ... are pairwise disjoint events (i.e. if A
i
A
k
= when-
ever i = k) then
P
_

_
i=1
A
i
_
=

i=1
P(A
i
).
Proof. According to the denition we have P() =

j=1
p
j
= 1, because

j
for all j, and P() = 0, because there is no j such that
j
. This
is (i). If A B we have

j1:
j
A
p
j

j1:
j
B
p
j
,
what gives (ii). For (iii) note that

j1:
j
AB
p
j
=

j1:
j
A
p
j
+

j1:
j
B
p
j

j1:
j
AB
p
j
(substracting the last sum ensures each index j with j A B appears
exactly once on the right hand side of the equality). A special case of (iii)
together with (i) gives (iv). Finally, we have

j1:
j

i=1
A
i
p
j

i=1

j1: A
i
p
j
with equality if the A
i
are pairwise disjoint. This proves (v) and (vi).
Remark 2.1.2. Given a discrete probability spaces (, P), we should try to
think of a probability measure P as a function P : P() [0, 1] from the
power set P() of into the unit interval [0, 1]. This function is normed by
(i), monotone by (ii), and the properties (v) and (vi) are called -subadditivity
and -additivity, respectively.
2.1. BASIC NOTIONS 13
Examples 2.1.5.
(1) Tossing a fair coin: = {H, T}, P({H}) =
1
2
.
(2) Throwing a fair die: = {1, ..., 6}, P({}) = p
1
= ... = p
6
=
1
6
.
(2) Throwing an unfair (biased, marked) die: = {1, ..., 6}, P({6}) = p
6
=
1
2
and P({}) = p
1
= ... = p
5
=
1
10
for {1, 2, 3, 4, 5}. Of course
there are also many other ways to make a die unfair. But whether it is
fair or not should be reected by the choice of the probability measure.
(3) Tossing two fair coins at once (or tossing one coin twice but without any
inuence between the two trials): = {(H, H), (H, T), (T, H), (T, T)}
and P({}) =
1
4
, .
(4) To open a safe one needs to know the correct code consisting of 4 digits
each between 0 and 9. We would like to know the probability of nding
it by chance. In this case
= {(0, 0, 0, 0), (0, 0, 0, 1), ...., (9, 9, 9, 8), (9, 9, 9, 9)}
has 10
4
dierent elements, and common sense suggests that in a model
for this problem each of these four-tuples should have the same prob-
ability, namely 10
4
. Hence the probability to nd the correct code is
10
4
.
(5) The letters A, B, C and D are randomly arranged in a line, each order
being equally likely. Then the probability to get the string ADCB
is
1
4!
=
1
24
. The probability space for this example is made up by all
possible permutations of these letters,
= {ABCD, ABDC, ACBD, ..., DCBA} .
(6) Suppose we have limited space and want to invite 20 out of our 100
friends to a party. We cannot decide and do it randomly, but in a
fair manner (we do not prefer any combination). Given a xed choice
of 20 friends, the probability that exactly these friends will be invited
ois
_
100
20
_
1
. (This is a small number, maybe we better make a more
emotional, non-random choice ...) Here the probability space consists
of all combinations of 20 out of 100 elements, but is would be tedious to
write it down more explicitely. However, we know it has
_
100
20
_
elements.
14 CHAPTER 2. DISCRETE PROBABILITY SPACES
(7) In a group there are 10 women and 10 men. We would like to form a
gender balanced team of four people. We do it in a fair manner and
ask for the probability that some xed, preferred team we have in mind
will be the chosen one. The sample space could be thought of to consist
of elements ((m
1
, m
2
), (w
1
, w
2
)), where (m
1
, m
2
) is a randomly chosen
combination of two out of ten men, and (w
1
, w
2
) is a randomly chosen
combination of 2 out of ten women. For each of these two choices
there are
_
10
2
_
possibilities, hence our sample space will have
_
10
2
__
10
2
_
elements. The probability to choose the preferred team is
__
10
2
__
10
2
__
1
.
Exercise 2.1.5. Why is it (strictly speaking) wrong to write P(H) in Example
2.1.5 (1) or P(1) in (2) ?
Please be careful. Some textbooks, encyclopedias or articles will neverthe-
less write these wrong expressions. This is often done to have a simplied
(short-hand) notation, but with the agreement (often written somewhere, but
sometimes silent) that this is written willingly to replace the mathematically
correct expression.
Remark 2.1.3. The best way to solve modelling problems is: First, deter-
mine the sample space and afterwards come up with a reasonable probability
measure on it. Once the sample space is written correctly, the subsequent
parts of the problem become much simpler.
Exercise 2.1.6. To design a new product 10 items have been produced, 8 of
them are of high quality but 2 are defective. In a test we randomly pick two
of these 10 items. No item is preferred, and we do not place a drawn item
back. How likely is it that our random choice gives one high quality and one
defective item ?
Tue, Oct 15,
2013
Lemma 2.1.2. (Inclusion-exclusion principle)
Let (, P) be a discrete probability space and let A
1
, A
2
, ..., A
n
be events. Then
P
_
n
_
i=1
A
i
_
=
n

i=1
P(A
i
)

1i
1
<i
2
n
P(A
i
1
A
i
2
)
+

1i
1
<i
2
<i
3
n
P(A
i
1
A
i
2
A
i
3
) . . .
+ + (1)
n1
P(A
1
... A
n
).
2.1. BASIC NOTIONS 15
This generalizes Lemma 2.1.1 (iii). For n = 3 and three events A, B, C
Lemma 2.1.2 gives
P(A B C) = P(A) +P(B) +P(C)
P(A B) P(A C) P(B C)
+P(A B C).
Exercise 2.1.7. Draw the corresponding Venn diagram for n = 3.
Exercise 2.1.8. Review or study the basic idea of proof by induction.
Proof. We proceed by induction. For n = 1 there is nothing to prove, and for
n = 2 the statement is known to be true by Lemma 2.1.1 (iii). We assume
it is true for n, this is called the induction hypothesis. We use the induction
hypothesis to prove the statement of the lemma for n = 1. If we manage
to do so, it must be true for all natural numbers n and all choices of events
A
1
, ..., A
n
, as desired.
Given n + 1 events A
1
, ..., A
n
, A
n+1
, we observe
P
_
n+1
_
i=1
A
i
_
= P
__
n
_
i=1
A
i
_
A
n+1
_
= P
_
n
_
i=1
A
i
_
+P(A
n+1
) P
__
n
_
i=1
A
i
_
A
n+1
_
by Lemma 2.1.1 (iii). The distributivity rule of set operations tells
_
n
_
i=1
A
i
_
A
n+1
=
n
_
i=1
(A
i
A
n+1
),
and by induction hypothesis this event has probability
P
_
n
_
i=1
(A
i
A
n+1
)
_
=
n

i=1
P(A
i
A
n+1
)

1i
1
<i
2
n
P(A
i
1
A
i
2
A
n+1
)
+

1i
1
<i
2
<i
3
n
P(A
i
1
A
i
2
A
i
3
A
n+1
) . . .
+ + (1)
n1
P(A
1
... A
n
A
n+1
)
16 CHAPTER 2. DISCRETE PROBABILITY SPACES
Using the induction hypothesis once more, now on

n
i=1
A
i
, we obtain
P
_
n+1
_
i=1
A
i
_
=
n

i=1
P(A
i
) +P(A
n+1
)

1i
1
<i
2
n
P(A
i
1
A
i
2
)
n

i=1
P(A
i
A
n+1
)
+

1i
1
<i
2
<i
3
n
P(A
i
1
A
i
2
A
i
3
) +

1i
1
<i
2
n
P(A
i
1
A
i
2
A
n+1
)
+...
+(1)
n1
P(A
1
... A
n
)+(1)
n1

1i
1
<i
2
<<i
n1
n
P(A
i
1
... A
i
n+1
A
n+1
)
+ (1)
n
P(A
1
... A
n
A
n+1
),
which is
n+1

i=1
P(A
i
)

1i
1
<i
2
n+1
P(A
i
1
A
i
2
)
+

1i
1
<i
2
<i
3
n+1
P(A
i
1
A
i
2
A
i
3
) . . .
+ + (1)
n
P(A
1
... A
n+1
),
as desired.
To discuss the next example (which, by the way, is a classic) we need to
talk about functions again. A function f :

from a set into a set


is called surjective if for any

there is some such that f() =

(each element of the target space is the value of some argument under f). A
function f :

is called bilective or a bijection if it is both injective and


surjective.
A bijection : of a nite set onto itself is called a permutation.
Examples 2.1.6. Consider n people at a party each of whom owns a hat. If
their n hats are assigned to these n people at random, what is the probability
that at least one person ends up with their own hat ? Mathematically, the
question can be rephrased by asking for the probability that a randomly
2.1. BASIC NOTIONS 17
chosen permutation : {1, 2, ..., n} {1, 2, ..., n} has a xed point (i.e.
some element i {1, 2, ..., n} with (i) = i).
It makes sense to set the sample space to be the space of all dierent
permutations of {1, 2, ..., n}. It has || = n! elements.
If E
i
:= { : (i) = i} denotes the event that i is a xed point, then
P(E
i
1
E
i
k
) =
(n k)!
n!
for all 1 i
1
< i
2
< < i
k
n. Therefore

1i
1
<i
2
<<i
k
n
P(E
i
1
E
i
k
) =
_
n
k
_
(n k)!
n!
=
1
k!
,
and by Lemma 2.1.2,
P
_
n
_
i=1
E
i
_
= 1
1
2!
+
1
3!
... + (1)
n+1
1
n!
=
n

k=1
(1)
k+1
k!
.
As n tends to innity, this tends to

k=1
(1)
k+1
k!
= 1
1
e
,
what shows that for a large number n of people the chance not to have a
single person getting their own hat is approximately
1
e
.
We nally record some continuity properties of probability measures. A
sequence (A
n
)

n=1
of events is called increasing if it is increasing in the set-
theoretic sense,
A
1
A
2
A
n
A
n+1
. . .
and decreasing if it is decreasing in the set-theoretic sense,
A
1
A
2
A
n
A
n+1
...
Lemma 2.1.3. Let (, P) be a discrete probability space.
(i) If (A
n
)

n=1
is an increasing sequence of events, then
lim
n
P(A
n
) = P
_

_
n=1
A
n
_
.
18 CHAPTER 2. DISCRETE PROBABILITY SPACES
(ii) If (E
n
)

n=1
is a decreasing sequence of events, then
lim
n
P(A
n
) = P
_

n=1
A
n
_
.
Proof. Tor see (i) set
B
1
:= A
1
, B
2
:= A
2
\ B
1
, . . . , B
n
:= A
n
\
_
n1
_
i=1
B
i
_
, . . .
Note that the B
i
are pairwise disjoint,

i=1
B
i
=

i=1
A
i
, and

n
i=1
B
i
=

n
i=1
A
i
= A
n
for all n. Then
P
_

_
i=1
A
i
_
= P
_

_
i=1
B
i
_
=

i=1
P(B
i
) = lim
n
n

i=1
P(B
i
) = lim
n
P
_
n
_
i=1
B
i
_
= lim
n
P(A
n
).
Statement (ii) is left as an exercise, just use complements.
2.2 Conditional probability and independence
Sometimes we will only have partial information about an experiment, for
instance, we may have to impose or assume certain conditions in order to
discuss statistical features. If these conditions are nonrandom, we can some-
how ignore them by choosing an adequate model. But sometimes these
conditions themselves are random, i.e. varying with the outcome of the ex-
periment, and we need an appropriate probabilistic model to reect this.
Examples 2.2.1. Consider a group of one hundred adults, twenty are women
and eighty are men, fteen of the women are employed and twenty of the men
are employed. We randomly select a person and nd out whether she or he
is employed. If employed, what is the probability that the selected person is
a woman ?
Without the information on employment, we would have selected a woman
with probability
20
100
=
1
5
.
Knowing the selected person is employed, we can pass to another sample
space, now consisting of the thirty ve employed people, and obtain the
probability
15
35
=
3
7
to have selected a woman.
2.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 19
In general it is more exible to keep th sample space xed but to imple-
ment given information by conditioning.
Denition 2.2.1. Let (, P) be a discrete probability space. Given two
events A and B with P(B) > 0, the conditional probability of A given B is
dened as
P(A|B) :=
P(A B)
P(B)
.
Examples 2.2.2. For the previous example, set
A := {a woman selected}
and
B := {the selected person is employed}
to see P(A|B) =
3
7
.
The next Lemma is (almost) obvious.
Lemma 2.2.1. Let (, P) be a discrete probability space and B an event with
P(B) > 0. Then P(|B) is again a probability measure.
We have P(B|B) = P(|B) = 1. Moreover, statements (i)-(vi) of Lemma
2.1.1 and the statement of Lemma 2.1.2 remain valid for P(|B) in place of
P.
More specic rules for conditional probability are the following.
Lemma 2.2.2. Let (, P) be a discrete probability space.
(i) If A and B both are events with positive probability, then
P(B|A) =
P(A|B)P(B)
P(A)
.
(ii) If B
1
, . . . , B
n
are pairwise disjoint events of positive probability and
such that =

n
i=1
B
i
then we have
P(A) =
n

i=1
P(A|B
i
)P(B
i
)
for any event A.
20 CHAPTER 2. DISCRETE PROBABILITY SPACES
(iv) If B
1
, B
2
, . . . are pairwise disjoint events of positive probability and such
that =

i=1
B
i
then we have
P(A) =

i=1
P(A|B
i
)P(B
i
)
for any event A.
(iv) For any events A
1
, . . . , A
n
with P
_
n1
i=1
A
i
_
> 0 we have
P
_
n

i=1
A
i
_
= P(A
1
)P(A
2
|A
1
)P(A
3
|A
2
A
1
) P
_
A
n
|
n1

i=1
A
i
_
.
Exercise 2.2.1. Prove Lemma 2.2.2.
(iii) and (iv) are called the law of total probability, (iv) is referred to as
Bayes rule.
Examples 2.2.3. We consider a medical test. Patients are being tested for
a disease. For a patient sick with the disease the test will be positive in 99%
of all cases. In 2% of all cases a healthy patient is tested positive. Statistical
data show that one out of thousand patients really gets sich with the disease.
We would like to know the probability that a tested patient indeed is sick.
We write S for the event {patient is sick}, + for {patient is tested positive}
and for {patient is tested negative}. We know that
P(S) = 0.001, P(+|S) = 0.99 und P(+|S
c
) = 0.02.
We are looking for P(S|+). By the law of total probability ,
P(+) = P(+|S)P(S) +P(+|S
c
)P(S
c
),
and therefore
P(S|+) =
P(S +)
P(+)
=
P(+|S)P(S)
P(+)
=
0.99 0.001
0.99 0.001 + 0.02 0.999
,
which is approximately
1
20
. This suggests that while the test may give some
indication and serve as a rst diagnostic tool, it is too inaccurate to allow
the conclusion of a diagnosis without performing further examinations.
2.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 21
Exercise 2.2.2. About 5% of all men and 1% of all women suer from
dichromatism. Suppose that 60% of a group are women. If a person is
randomly selected from that group, what is the probability that this individual
suers from dichromatism ?
In our experiment or observation it may happen that the relative fre-
quence of the occurence of some event A is not aected whether some other
given event B occurs or not. In other words, if n is the total number of trials
and n
A
denotes the number of trials such that A occurs, we observe that the
numbers
n
A
n
and
n
AB
n
B
are close to each other, at least for large n. A heuristic rearrangement of
this relation is to say that the numbers
n
AB
n
and
n
A
n
n
B
n
are close. In the mathematical model this is formalized by the notion of
independence.
Denition 2.2.2. Let (, P) be a discrete probability space. Two events A
and B are called independent if
P(A B) = P(A)P(B).
Events A
1
, ..., A
n
are called independent if
P(A
j
1
A
j
k
) =
k

i=1
P(A
j
i
)
for all distinct j
1
, ..., j
k
{1, . . . , n}. The events A
1
, . . . , A
n
are called pair-
wise independent if for any two dierent j, k {1, . . . , n} the two events A
j
and A
k
are independent.
(At rst glance independence looks like some algebraic relation. In a
sense, this intuition is not all that wrong ... maybe at some point you will
encounter or have encountered product measures, characteristic functions
and groups characters, etc.)
Examples 2.2.4. Three events A, B and C are independent if
22 CHAPTER 2. DISCRETE PROBABILITY SPACES
1. P(A B C) = P(A)P(B)P(C),
2. P(A B) = P(A)P(B),
3. P(A C) = P(A)P(C) and
4. P(B C) = P(B)P(C).
Obviously independence implies pairwise independence. In general the
converse is false.
Exercise 2.2.3. Give an example of three events A, B, C that are pairwise
independent but not independent.
Exercise 2.2.4. Verify that if two events A and B are independent then also
A and B
c
are independent and A
c
and B
c
are independent.
Whether in a mathematical model two events are independent or not is
a consequence of the choice of the probability measure.
Examples 2.2.5. We toss two coins, not necessarily fair. A suitable sample
space is
:= {H, T}
2
= {(
1
,
2
) :
i
{H, T} , i = 1, 2} .
For a rst model, let p, p

(0, 1) and put


_

_
P({(H, H)}) = pp

,
P({(T, H)}) = (1 p)p

,
P({(H, T)}) = p(1 p

),
P({(T, T)}) = (1 p)(1 p

).
(2.1)
Then the events {
1
= H} and {
2
= H} are independent under P with
P({
1
= H}) = p and P({
2
= H}) = p

. (2.2)
Conversely, it is not dicult to see that if we require these events to be
independent under a probability measure P and to have (2.2), then P has to
be as in (2.1).
For a second model, let the probability to get heads in the rst trial be
p (0, 1). If the rst trial gives heads, then let the probability to get heads
2.3. DISCRETE RANDOM VARIABLES 23
in the second trial be p

(0, 1), otherwise let it be q

(0, 1) with q

= p

.
For a probability measure Q on satisfying these ramications we must have
_

_
Q({
1
= H}) = p,
Q({
2
= H} | {
1
= H}) = p

,
Q({
2
= H} | {
1
= T}) = q

.
This yields
Q({
1
= H} {
2
= H}) = Q({
2
= H} | {
1
= H})Q({
1
= H})
= p

p.
But on the other hand,
Q({
2
= H}) = Q({
2
= H} | {
1
= H})Q({
1
= H})
+Q({
2
= H} | {
1
= H})Q({
1
= T})
= p

p +q

(1 p),
and therefore
Q({
1
= H})Q({
2
= H}) = p(p

p +q

(1 p))
= Q({
1
= H} {
2
= H}),
i.e. the events {
1
= H} and {
2
= H} are not independent under Q.
2.3 Discrete random variables
When performing and observing an experiment it is often useful to lter
or rearrange information or to change perspective. For instance, we might
measure a temperature, viewed as random outcome of the experiment, and
want to calculate a reaction intensity that depends on the given temperature.
Then this reaction intensity itself will be random. This, more or less, is the
concept of a random variable, or more generally, a random element.
Denition 2.3.1. Let (, P) be a discrete probability space and E = . A
function X : E is called a random element with values in E. A function
X : R is called a random variable.
24 CHAPTER 2. DISCRETE PROBABILITY SPACES
A random variable is a random element with values in E = R.
Notation: We agree to write
{X B} := { : X() B}
for any random element X with values in E and any E X, and in the
special case B = {x} also
{X = x} := { : X() = x} .
Similarly, we agree to write
P(X B) := P({X B}) ,
and in the special case B = {x} also
P(X = x) := P({X = x}) .
These abbreviations are customary.
Examples 2.3.1. Given an event A , set
1
A
() :=
_
1 if A
0 if A
c
.
This denes a random variable 1
A
on , usually referred to as the indicator
function of the event A. Note that {1
A
= 1} = A, {1
A
= 0} = A
c
and
{1
A
= x} = for any x {0, 1}
c
.
Since is discrete, any random element X on with values in a set
E can have at most a countable number of dierent values x E. If we
enumerate these countably many dierent values by {x
j
}

j=1
, then the events
{X = x
j
} are pairwise disjoint and =

j=1
{X = x
j
}. If X attains only
nitely many dierent values then of course the same is true with a nite
number n in place of .
If X is a random variable, we may rewrite it as
X =

i=1
x
j
1
{X=x
j
}
,
where 1
{X=x
j
}
is the indicator function of the event {X = x
j
}.
2.4 Bernoulli trials
Chapter 3
Absolutely continuous
distributions
25
Chapter 4
Measure theory and Lebesgue
integration
26
Chapter 5
Product spaces and
independence
27
Chapter 6
Weak convergence and the
Central Limit Theorem
28
Chapter 7
Conditioning
29
Chapter 8
Martingale theory
30

You might also like