You are on page 1of 15

Handout No.

4
PHIL.015
March 28, 2016
RANDOM VARIABLES AND THEIR PROBABILITY
DISTRIBUTIONS
1. Passage from Probabilistic to Statistical Reasoning
In this handout I introduce the basic concepts of random variable and associated probability distribution functions. Roughly speaking, a random variable is
a numerical-valued quantity whose value depends on the outcome of a probabilistic experiment. Its associated probility distribution specifies the probability
that the variable will assume a given numerical value.
In applications of probability theory to probabilistic (random) experiments the
first step consists of specifying the experiments correct probability model , A, P
with a suitable sample space , event algebra A and a probability measure P
thereon. The next step involves the calculation of probability values of events
that are of interest. In complex (composite) experiments several probability
models may become necessary. When an experiment is performed, statistically oriented experimenters are often less interested in the specific outcome
(or event) of it than in some aspect of the outcome that may be shared with
other particular outcomes. For example, a statistician may be curious about
the number of babies born in a certain hospital each day (which is of course not
a fixed quantity, because it depends on many random factors that vary from one
day to another) and not at all in the particular babies themselves. In statistical
usage, such a quantity is called a random variable, doubtless because its value
tends to vary from one outcomne to the next (hence the term variable) and
the outcome itself depends on chance (thus, the term random). The mathematical concept that captures these quantities is that of a real-valued function
defined on the sample space .
Specifically, given a probability model , A, P, a random variable is any function of the form X : R that assigns to each outcome in a unique
numerical value X() in the set R of real numbers and satisfies the measurability
ondition
{ | X() a } A
for all real numbers a in R. This last condition is thrown in for theoretical
reasons, because strictly speaking, not all real-valued functions are considered
1

to be random variables. In general, there are extremely complex functions (e.g.,


not summable or integrable, etc.) that simply resist mathematical tractability.
Fortunately, we will not have to worry about any of these, so that for us a random variable on a probability model , A, P is a real-valued function defined
on .1 Generally, as common in statistics, we shall denote random variables by
capital letters X, Y , and Z from the end of the latin alphabet.
In many finitary applications, involving small sample spaces, a particular random variable X may be completely described by making a table of its values like
the one shown below. (This is of course not practical for with a large number
of sample points. In such cases, a random variable is specified analytically, by
providing a rule or a graph for X().)
Specification of X : R
From to R
Assignment
Sample points
1
2

X(1 )

X(n )

Values of X

X(2 )

Example 1:
Consider an experiment in which three fair coins are tossed once. We
already know that the probability model 8 , A, P for this experiment is
specified by the 8-element sample space

!
"
8 = 2 2 2 = T T T, T T H, T HT, HT T, HHT, HT H, T HH, HHH ,
event algebra A consisting of all subsets of 8 , and the Laplacean probability measure P, defined by
P(A) =

#{A}
#{8 }

for any event A in A.


As statisticians, we may only be interested in the random variable X :
8 R, defined by the total number of heads in each trial. That is to
say, the random variable X : 8 R is defined argumentwise by
1
In what follows, we use the standard mathematical notation f : A B to indicate that f is a
function defined on set A (known as its domain), taking values in set B (known as its codomain or
range). Please do not confuse this notation with logical conditionals P Q.

X(T T T ) = 0
X(T T H) = X(T HT ) = X(HT T ) = 1
X(T HH) = X(HT H) = X(HHT ) = 2
X(HHH) = 3
Example 2:
Suppose a probabilistic experiment consists of rolling two fair dice once. Its
accociated probability model 36 , A, P is given by a 36-element sample
space
!
"
36 = 11, 12, , 16, 21, 22, , 26, , , 61, 62, , 66 ,
the usual event algebra A of all subsets of 36 , and the Laplacean probability measure, defined by the fraction
P(A) =

#{A}
#{36 }

for any event A in A. Note that each event A gives rise to a unique 2valued random variable IA : 36 R, given by IA () = 1 if is in
A, and IA () = 0 otherwise. Here IA is called the indicator random
variable of A. Of course, I = 0 at each sample point in 36 . Likewise,
it can be seen that I36 = 1 for all sample points. Indicator random
variables are encoding information about outcomes into numbers. We have
seen that sentences in deductive logic and events in event algebras are
qualitative entities. So far only probabilities were quantitative (numerical).
The introduction of random variables turns probabilistic reasoning into a
variant of quantitative reasoning.
Returning to the topic of random variables, suppose we are interested only
in the sum of outcomes in rolling two dice. For this we need to define a
random variable Y : 36 R by setting Y (ij) =df i + j for all outcomes
or (sample) pairs ij in 36 = 6 6 . Clearly,

Y (11) = 2
Y (12) = Y (21) = 3
Y (13) = Y (22) = Y (31) = 4
Y (23) = Y (32) = Y (41) = Y (14) = 5
Y (33) = Y (42) = Y (24) = Y (15) = Y (51) = 6
Y (43) = Y (34) = Y (25) = Y (52) = Y (61) = Y (16) = 7
Y (44) = Y (53) = Y (35) = Y (26) = Y (62) = 8
Y (45) = Y (54) = Y (63) = Y (36) = 9
Y (64) = Y (46) = Y (55) = 10
Y (56) = Y (65) = 11
Y (66) = 12
Example 3:
Suppose a probability experiment consists, once again, of rolling two fair
dice once. We know that its associated probability model 36 , A, P is
given by a 36-element sample space
!
"
36 = 11, 12, , 16, 21, 22, , 26, , , 61, 62, , 66

with the event algebra A of all subsets of 36 and the Laplacean probability
measure
P(A) =

#{A}
#{36 }

for any event A in A.


However, this time let the random variable Z : 36 R be specified by
the maximum of
! the
" two numbers on the upturned faces. That is to say, let
Y (ij) =df max i, j for all outcomes or (sample) pairs ij in 36 = 6 6 .
Clearly, we have

Z(11) = 1
Z(12) = Z(21) = Z(22) = 2
Z(13) = Z(31) = Z(23) = Z(32) = Z(33) = 3
Z(14) = Z(24) = Z(34) = Z(44) = Z(41) = Z(42) = Z(43) = 4
Z(15) = Z(25) = Z(35) = Z(45) = Z(55) = Z(54) = Z(53) = Z(52)
= Z(51) = 5
Z(16) = Z(26) = Z(36) = Z(46) = Z(56) = Z(66) = Z(65) = Z(64)
= Z(63) = Z(62) = Z(61) = 6

We see that random variable Z takes only 6 values from 1 to 6. It is not


surprising that the measurable space 36 , A carries many, many random variables but only a few of them are of practical interest. For example, we can
have a random variable W : 36 R that assigns to each outcome the same
constant value, say 1, i.e., we have W (ij) = 1 for all outcomes (ij). Needless
to add, in applications this variable is not important at all.
2. Probability Distribution Functions, Expectations and Moments
Up to this point, random variables have been treated simply as functions on
the sample space of a probability model. It is now time to take into account
also the probability models probability measure. Given a random variable
X : R on a general probability model , A, P, the real-valued function
pX : X() [0, 1], defined by

on the set

#
$
pX (xi ) =df P(X = xi ) = P { | X() = xi ) }

X() = {X(1 ), X(2 ), , X(n )}


of values xi of variable X, is called the probability distribution function or simply
the pdf of X. Here P(X = xi ) denotes the probability that random variable X
takes value xi in a given random experiment.
We now return to Example 1, wherein we have specified X as the number of
heads in tossing three fair coins. From the associated probability model we find
that

pX (0) = P(X = 0) = P(T T T ) =

1
= 0.125
8

3
= 0.375
8
3
pX (2) = P(X = 2) = P(T HH, HHT, HT H) = = 0.375
8
1
pX (3) = P(X = 3) = P(HHH) = = 0.125
8
pX (1) = P(X = 1) = P(T T H, HT T, T HT ) =

Of course, the sum of probabilities of a probability distribution function pX (x)


must be equal to 1. A probability distribution function (pdf) can also be shown
graphically by labeling the x axis with the values of the random variable (belonging to the set of all values X(8 )) and letting the values on the y axis represent
the probabilities for equational statements of the form X = 0, X = 1, X = 2,
about the random variable X. The graph for the pdf of X is shown below:
#
$
pX (x) = P X = x
0.4
0.3
0.2
0.1
1

As we will note later, this is a special kind of discrete probability distribution


function, called a binomial distribution.
In probabilistic studies of a random variable X, statisticians are also interested
in calculating the probability
FX (xi ) =df P(X xi ) = P(X = x1 ) + P(X = x2 ) + + P(X = xi )
for any 0 i n. Given a random variable X, the above-defined function
FX : R [0, 1] is called the cumulative distribution function of X. The following theorems are easily verified:
6

THEOREM 1
(i) P(X > x) = 1 P(X x).
(ii) P(x < X x ) = P(X x ) P(X x) = FX (x ) FX (x).
(iii) P(X = x) = P(X x) P(X < x).
Suppose a coin is tossed many times and the number of heads is systematically
recorded. It is possible to predict ahead of time the average number of heads.
The mathematical tool for calculating the mean or average of a random
variable X is the so-called expectation functional E. More generally, in studying
the probability distribution of a random variable, it is often useful to be able
to summarize a given aspect of the distribution by means of a single number
that then serves to measure that aspect. Statisticians call such a number a
paremeter of the distribution. In what follows we introduce two such parameters
of a probability distribution, namely the crucial expectation and variance.
Now we make the pertinent formal definition. Given a discrete random variable
X : R of a probability model , A, P,!the expected value
" of X (or simply
the expectation of X) with values X() = x1 , x2 , , xn is defined by the
weighted average (or center of gravity)
X = E(X) =df x1 pX (x1 ) + x2 pX (x2 ) + + xn pX (xn )
We see that the expectation of a random variable is a type of average value
of that random variable. It is a value about which the possible values of the
random variable are scattered. Note that the expectation of X is determined
by pX .
To illustrate the idea behind expectation, let us go back to Example 1. The
expected value of random variable X (introduced in Example 1) is given by
E(X) = 0

1
3
3
1
+ 1 + 2 + 3 = 1.5.
8
8
8
8

.
Although we think of 1.5 as an average value of X, it is clearly not a possible
value of X at all in X(8 ). It is, however, a value in a central location relative
to the possible values 0, 1, 2 and 3 of X. The number E(X) is often called a
measure of location. It is clear that it falls between the smallest value (namely
0) assumed by X and the largest value (namely 3) assumed by X. Thus, a
knowledge of E(X) gives a rough idea of the possible size of the possible values
of X. Finally, we might consider a beautiful analogy between the concept of
expectation and the concept of center of gravity in mechanics. If we imagine
7

masses of size pX (xi ) being placed at points xi on the line (with i = 1, 2, , n),
then E(X) is exactly what physicists call the center of gravity of this mass
distribution.
Suppose again that we are given a probability model , A, P of a random
experiment and a random variable X : R thereon. Let g : R R be
any (measurable) real-valued function.
% Then
& the composite
#
$ g(X) : R is
again a random variable, defined by g(X) () =df g X() for all in , with
expectation
#
$
E g(X) =df g(x1 ) pX (x1 ) + g(x2 ) pX (x2 ) + + g(xn ) pX (xn ).

For example, if g(x) = 2x + 5, then the composite random variable is g(X) =


2X + 5. Given two (or more) random variables X1 : R and X2 : R,
their
% sum X1 &+ X2 : R is again a random variable, defined coordinatewise
%by X&1 + X2 () = X1 () + X2 () for all sample points in . Of course,
X +c () = X()+c, where c is a real constant. All other operations (subtraction, multiplication, division, etc.) on random variables are defined similarly,
argumentwise. The power of reasoning in terms of random variables relies precisely on the algebraic richness of operations on random variables. For example,
the so-called sample average, to be discussed later, is a random variable defined
by X n = X1 +X2n++Xn .
Expectation by itself does not give much information about the random variable
of interest. Additional measures are needed. One such quantity is the variance
of a random variable. It measures the fluctuation or dispersion of X from its
expectation (or the spread of pX ).
Let =df E(X) be the expectation of random variable X with the same conditions regarding X() as above. For technical reasons, the variance, denoted
2
X
= Var(X), of X is defined by the quadratic distance
%
&
Var(X) = E (X)2 = (x1 )2 pX (x1 )+(x2 )2 pX (x2 )+ +(xn )2 pX (xn ).
Since Var(X) 0 and the units (inches, degrees of Fahrenheit, etc.) of random variables are important, statisticians are more interested in the standard
deviation X of X, defined by the square root
X =

' %
&
E (X )2 .

# $ %
&2
2
Since X
= Var(X) = E X 2 E(X) , the variance of X from Example 1
is given by Var(X) = 3 94 = 43 = 0.75, and hence the standard deviation is
8

'
X = 34 = 0.866. We see that the dispersion is quite large; about one head in
each trial.
The probability distribution function defined in Example 1 is a special case of
the so-called binomial distribution function, defined by
BSn (k) =df = P

( )
n
Sn = k =
pk (1 p)nk ,
k
$

where the sample space is a n-fold product 2n = {0, 1} {0, 1} {0, 1}


of a 2-element or dubleton set {0, 1}, representing an outcome of a coin toss
perfomed n times or any other binary outcome of a binary experiment executed
independently n times, so that the entries in 2n are n-fold sequences of 0s and
1s (heads and tails, etc.). In addition, function Sn : 2n R is a so-called
success random variable on 2n that counts the number of 1s, heads (as in Example 1), or other success outcomes in each n-fold sequence of independent
trials. Finally, 0 p 1 is a parameter capturing the probability of getting 1,
a head, or anything else of interest, which is set to be equal 21 for a fair coin.
To summarize, Sn : 2n R is a random variable whose probability distribution function is BSn with parameter p. This means that S#n $can take values
0, 1, 2, , n, and the probability that Sn equals k is given by nk pk (1p)nk ,
where 0 k n. It is not hard to show that the expectation of Sn is E(Sn ) = np
and that its variance Var(X) is Var(X) = np(1 p).
Returning to Example 2, from the associated probability model we find that the
probability distribution function pY of Y is given by the list of equations

pY (2) = P(Y = 2) = P(11) =

1
36

pY (3) = P(Y = 3) = P(12, 21) =

2
36

pY (4) = P(Y = 4) = P(13, 22, 31) =

3
36

pY (5) = P(Y = 5) = P(23, 32, 41, 14) =

4
36

pY (6) = P(Y = 6) = P(33, 42, 24, 15, 51) =

5
36

pY (7) = P(Y = 7) = P(43, 34, 25, 52, 16, 61) =


pY (8) = P(Y = 8) = P(44, 53, 35, 26, 62) =
pY (9) = P(Y = 9) = P(54, 45, 63, 36) =
pY (10) = P(Y = 10) = P(64, 46, 55) =
pY (11) = P(Y = 11) = P(56, 65) =
pY (12) = P(Y = 12) = P(66) =

1
36

2
36

4
36

5
36

6
36

3
36

#
$
It is easy to check that the sum of the values P X = xi of probabilities is 1.
Note that Y is not defined at sample points that fall outside the sample space
36 . The graph of the probability distribution function pY is shown in the figure
below:

10

#
$
pY (x) = P Y = x
0.4
0.3
0.2
0.1
1

10

11

12

It is not hard (but tedious) to show that E(Y ) = 7. A simpler method is to


calculate the expected value of the outcome of rolling a fair die and then apply
the additivity property of the expectation functional.
Specifically, recall that the experiment specified by rolling a fair die has the
obvious Laplacean probability model 6 , A, P with a 6-element sample space
6 . Now, let function X1 : 6 R be the random variable defined by the
number of dots on the dies upturned face in a given trial. Obviously, X1 takes
six possible values from 1 to 6. Using the definition of expectation of a random
variable, it is easy to check that E(X1 ) = 72 . In addition, by a related definition
we have E(X12 ) = 91
for the square random variable X12 , important for calcu6
lating the variance of X1 .
Because the expectation functional is linear, i.e., it satisfies the linearity conditions for any pair of random variables X1 , X2 : R, as recalled in the
theorem below,
THEOREM 2
(i) E(X1 + X2 ) = E(X1 ) + E(X2 );
(ii) E(aX1 ) = aE(X1 );
(iii) E(X1 ) = E(X1 );
and since the random variable Y in Example 2 is given by the sum Y = X1 + X2
of two random variables (characterizing the upturned faces of two dice), we have
E(Y ) = E(X1 + X2 ) = E(X1 ) + E(X2 ) = 72 + 72 = 7.
# $ %
&2
Now, because Var(X1 ) = E X12 E(X1 ) , the variance of the random variable
11

X1 (capturing the numbers on a dies upturned faces) is Var(X1 ) =


35
= 2.91666.
12

91
6

49
4

Turning to variance Var(Y ) = Var(X1 + X2 ) of the random variable Y , defined


by the sum of numbers on upturned faces of two fair dice that are rolled (discussed in Example 2), in view of independence of X1 and X2 (rolling two dice is
a statistically independent experiment), the additivity
Var(X1 + X2 ) = Var(X1 ) + Var(X2 )
holds, so that Var(Y ) = Var(X1 ) + Var(X2 ) =

35
12

35
12

35
6

= 5.8333.

We now return to Example 3 to illuminate further the concepts of probability


distribution functions (pdf) and cumulative distribution functions (cdf). Recall
that the experiment consists of rolling two balanced dice with the probability
model 36 , A, P given by a 36-element sample space
!
"
36 = 11, 12, , 16, 21, 22, , 26, , , 61, 62, , 66

with the event algebra A of all subsets of 36 and the Laplacean probability
measure
P(A) =

#{A}
#{36 }

for any event A in A. The random variable Z : 36 R of interest is given


by the
of the two numbers on the upturned faces. I.e., Y (ij) =df
! maximum
"
max i, j for all outcomes or (sample) pairs ij in 36 = 6 6 . From the
associated probability model we find that the probability distribution function
pZ of Z is given by the list of equations
pZ (1) = P(Z = 1) = P(11) =

1
36

pZ (2) = P(Z = 2) = P(12, 22, 21) =

3
36

pZ (3) = P(Z = 3) = P(13, 23, 33, 32, 31) =

5
36

pZ (4) = P(Z = 4) = P(14, 22, 34, 44, 43, 42, 41) =

7
36

pZ (5) = P(Z = 5) = P(15, 25, 35, 45, 55, 54, 53, 52, 51) =

9
36

pZ (6) = P(Z = 6) = P(16, 26, 36, 46, 56, 66, 65, 64, 63, 62, 61) =

12

11
36

The graphical representation of pZ is shown below.


#
$
pZ (x) = P Z = x
0.4
0.3
0.2
0.1
1

The cumulative distribution function FZ , denined by


FZ (xi ) =df P(Z xi ) = P(Z = x1 ) + P(Z = x2 ) + + P(Z = xi )
for any 0 i 6, is specified by

FZ (x) =

36

36

if

x < 1

if

1 x < 2

if

2 x < 3

9
36
16
36
25
36

if

3 x < 4

if

4 x < 5

if

5 x; < 6

if

x 6

3. Counting Algorithms and Combinatorics


Combinatorial analysis deals with counting. To count the number of outcomes
of a finitary probability experiment, it is often useful to look for special patterns
and algorithms, supporting techniques of counting. The simplest rule is called
the Fundamental counting rule or Multiplication Principle: For a sequence of n
events in which the first event can occur in k1 ways and the second event can occur
in k2 ways, and the third event can occur in k3 ways, and so on, the total number
of ways the sequence of n events can occur is the product k1 k2 k3 kn .
The number of permutations (i.e., possible ordered arrangements) of n dierent
objects is given by the descending product
13

n! =df n (n 1) (n 2) 2 1,
read n factorial. For example, there are exactly 6 = 3! ordered arrangements of three events A, B, C, namely ABC, ACB, BAC, BCA, CBA, CAB.
The point is that one can put any of n objects in the first place, but then only
n 1 are left for the second place, and so forth, until only one is left for the
last place. By default, we set 0! = 0, and of course 1! = 1.
A slightly more interesting care arises when some of the objects are identical
or alike, as for example, the letters I, S, and P in MISSISSIPPI. In this
case, the number of permutations of n objects, where k1 are identical (or alike)
of type or kind one, k2 are identical (or alike) of type or kind two, and so on,
and km are identical of type m with k1 + k2 + + km = n is given by the
fraction
(

n
k1 , k2 , , km1

=df

n!
.
k1 ! k2 ! km !

Question: How many dierent permutations can be made from the letters of
the word MISSISSIPPI? Answer: Since n = 11 and there are 4 types in total
(type M, S, I and type P), we can define k1 = 4 for type S, k2 = 4 for type
I, k3 = 2 for type P , and k4 = 1 for type M . Therefore, in this example the
11!
number of permutations is 4!4!2!1!
= 34, 650.
The number of ordered arrangements of n objects using k (k n) objects at a
time is given by the formula
n!
,
(n k)!
and is called the permutation of n distinct objects taking k objects at a time. For
example, two letters from the set of three {A, B, C} can be arranged exactly in
3!
= 6 ways, namely AB, BA, AC, CA, BC, CB.
1!
Finally, the most important algorithm is the combination rule: The number of
ways of selecting k objects (k n) from a list of n distinct objects without
regard to order is
( )
n
n!
=df
.
k
k! (n k)!
For example, two letters from the set of three {A, B, C} can be selected without
3!
= 3 ways, namely AB, BC, AC.
regard to order exactly in 2!1!
14

Another example: In a classroom there are 8 women and 5 men. A committee


of 3 women and 2 men is to be formed #for
$ #a5$project. How many dierent
8
possibilities are there? Answer: There are 3 2 = 560 dierent ways to make
the selection.

15

You might also like