You are on page 1of 298

Before we give a definition of probability, let us examine the following concepts:

Random Experiment: An experiment is a random experiment if its outcome cannot be predicted


precisely. One out of a number of outcomes is possible in a random experiment.

2. Sample Space: The sample space S is the collection of all outcomes of a random experiment.
The elements of S are called sample points.
• A sample space may be finite, countably infinite or uncountable.
• A finite or countably infinite sample space is called a discrete sample space.
• An uncountable sample space is called a continuous sample space

Sample
space S
Sample •s1
point •s2

3. Event: An event A is a subset of the sample space such that probability can be assigned to it.
Thus
• A ⊆ S.
• For a discrete sample space, all subsets are events.
• S is the certain event (sue to occur) and φ . is the impossible event.
Consider the following examples.
Example 1 Tossing a fair coin –The possible outcomes are H (head) and T (tail). The
associated sample space is S = {H , T }. It is a finite sample space. The events associated with the
sample space S are: S ,{H },{ T } and φ .

Example 2 Throwing a fair die- The possible 6 outcomes are:

‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘6’

The associated finite sample space is S = {'1', '2', '3', '4', '5', '6'}. Some events are
A = The event of getting an odd face={'1', '3', '5'}.
B = The event of getting a six={6}
And so on.

Example 3 Tossing a fair coin until a head is obatined


We may have to toss the coin any number of times before a head is obtained. Thus the possible
outcomes are: H, TH,TTH,TTTH,….. How many outcomes are there? The outcomes are countable
but infinite in number. The countably infinite sample space is S = {H , TH , TTH ,......}.
Example 4 Picking a real number at random between -1 and 1
The associated sample space is S = {s | s ∈ , −1 ≤ s ≤ 1} = [ −1, 1]. Clearly S is a continuous sample
space.
The probability of an event A is a number P( A) assigned to the event . Let us see how we can
define probability.

1. Classical definition of probability ( Laplace 1812)


Consider a random experiment with a finite number of outcomes N . If all the outcomes of the
experiment are equally likely, the probability of an event A is defined by

NA
P( A) =
N
where
N A = Number of outcomes favourable to A.

Example 4 A fair die is rolled once. What is the probability of getting a‘6’?
Here S = {'1', ' 2 ', '3', ' 4 ', '5', '6 '} and A = { '6 '}
∴ N = 6 and N A = 1
1
∴ P( A) =
6

Example 5 A fair coin is tossed twice. What is the probability of getting two ‘heads’?

Here S = {HH , TH , TT , TT } and A = {HH }.

Total number of outcomes is 4 and all four outcomes are equally likely.
Only outcome favourable to A is {HH}
1
∴ P( A) =
4

Discussion
• The classical definition is limited to a random experiment which has only a finite
number of outcomes. In many experiments like that in the above example, the sample
space is finite and each outcome may be assumed ‘equally likely.’ In such cases, the
counting method can be used to compute probabilities of events.

• Consider the experiment of tossing a fair coin until a ‘head’ appears.As we have
discussed earlier, there are countably infinite outcomes. Can you believe that all these
outcomes are equally likely?

• The notion of equally likely is important here. Equally likely means equally probable.
Thus this definition presupposes that all events occur with equal probability. Thus the
definition includes a concept to be defined.

2. Relative-frequency based definition of proability( von Mises, 1919)

If an experiment is repeated n times under similar conditions and the event A occurs in nA times,
then

nA
P( A) = Lim
n→∞ n
This definition is also inadequate from the theoretical point of view.
¾ We cannot repeat an experiment infinite number of times.
¾ How do we ascertain that the above ratio will converge for all possible sequences of
outcomes of the experiment?

Example Suppose a die is rolled 500 times. The following table shows the frequency each face.

Face 1 2 3 4 5 6
Frequency 82 81 88 81 90 78
Relative frequency 0.164 0.162 0.176 0.162 0.18 0.156

3. Axiometic definition of probability ( Kolmogorov, 1933)

We have earlier defined an event as a subset of the sample space. Does each subset of the sample
space forms an event?
The answer is yes for a finite sample space. However, we may not be able to assign probability
meaningfully to all the subsets of a continuous sample space. We have to eliminate those subsets.
The concept of the sigma algebra is meaningful now.

Definition: Let S be a sample space and F a sigma field defined over it. Let P : F → ℜ be a
mapping from the sigma-algebra F into the real line such that for each A ∈ F , there exists a unique
P ( A) ∈ ℜ. Clearly P is a set function and is called probability if it satisfies the following these
axioms

1. P ( A) ≥ 0 for all A ∈ F
2. P ( S ) = 1
3. Countable additivityIf A1 , A2 ,... are pair-wise disjoint events, i.e. Ai ∩ Aj = φ for i ≠ j, then
∞ ∞
P ( ∪ Ai ) = ∑ P ( Ai )
i =1 i =1

0
S

Remark
• The triplet ( S , F, P). is called the probability space.
• Any assignment of probability assignment must satisfy the above three axioms
• If if A ∩ B = 0, / P( A ∪ B) = P( A) + P( B)
This is a special case of axiom 3 and for a discrete sample space, this simpler version may be
considered as the axiom 3. We shall give a proof of this result below.
• The events A and B are called mutually exclusive if A ∩ B = 0, /

Basic results of probability

From the above axioms we established the following basic results:

1. P(φ ) = 0
This is because,
S ∪φ = S
⇒ P( S ∪ φ ) = P( S )
⇒ P( S ) + P(φ ) = P ( S )
∴ P (φ ) = 0
2. P ( Ac ) = 1- P ( A) where where A∈ F

We have
A ∪ Ac = S
⇒ P( A ∪ Ac ) = P( S )
⇒ P( A) + P( Ac ) = 1 ∵ A ∩ Ac = φ
∴ P ( A) = 1 − P ( Ac )
/ P ( A ∪ B ) = P( A) + P ( B )
3. If A, B ∈ F and A ∩ B = 0,

We have
A ∪ B = A ∪ B ∪ φ ... ∪ φ ..
∴ P ( A ∪ B) = P ( A) + P( B) + P (φ )... + P (φ ) + .. (using axiom 3)
∴ P ( A ∪ B) = P ( A) + P( B)
4. If A, B ∈ F, P ( A ∩ B c ) = P( A) − P( A ∩ B)

B
A

We have A ∩ Bc A∩ B
( A ∩ B ) ∪ ( A ∩ B) = A
c

∴ P[( A ∩ B c ) ∪ ( A ∩ B)] = P( A) A = ( A ∩ B c ) ∪ ( A ∩ B)
⇒ P( A ∩ B c ) + P ( A ∩ B ) = P ( A)
⇒ P( A ∩ B c ) = P( A) − P ( A ∩ B )

We can similarly show that


P ( Ac ∩ B ) = P ( B ) − P ( A ∩ B )
5. If A, B ∈ F, P ( AUB ) = P ( A) + P( B) − P ( A ∩ B )

We have
A ∪ B = ( Ac ∩ B) ∪ ( A ∩ B) ∪ ( A ∩ B c )
∴ P( A ∪ B ) = P[( Ac ∩ B) ∪ ( A ∩ B) ∪ ( A ∩ B c )]
= P ( Ac ∩ B) + P( A ∩ B) + P( A ∩ B c )
=P ( B ) − P( A ∩ B) + P( A ∩ B) + P( A) − P( A ∩ B)
=P ( B) + P( A) − P( A ∩ B )
6. We can apply the properties of sets to establish the following result for
A, B, C ∈ F, P ( A ∪ B ∪ C ) = P ( A) + P ( B ) + P(C ) − P ( A ∩ B) − P( B ∩ C ) − P( A ∩ C ) + P( A ∩ B ∩ C )
The following generalization is known as the principle inclusion-exclusion.

7. Principle of Inclusion-exclusion

Suppose A1 , A2 ,..., An ∈ F Then

⎛ n ⎞ n ⎛ n ⎞
P ⎜ ∪ Ai ⎟ = ∑ P ( Ai ) − ∑ P ( Ai ∩ Aj ) + ∑ P ( Ai ∩ Aj ∩ AK ) − .... + (−1) n +1 P ⎜ ∩ Ai ⎟
⎝ i =1 ⎠ i =1 i , j |i < j i , j , k |i < j < k ⎝ i =1 ⎠

Discussion
We require some rules to assign probabilities to some basic events in ℑ . For other events we can
compute the probabilities in terms of the probabilities of these basic events.

Probability assignment in a discrete sample space

Consider a finite sample space S = { s1 , s2 ,...... sn } . Then the sigma algebra ℑ is defined
by the power set of S. For any elementary event {si } ∈ ℑ , we can assign a probability P( si )
such that,
N

∑ P ( {si } )
i =1
= 1

For any event A ∈ ℑ , we can define the probability


P ( A) = ∑ P ({ Ai })
Ai ∈ A

In a special case, when the outcomes are equi-probable, we can assign equal probability p to each
elementary event.
n
∴ ∑p
i =1
= 1

⇒ p = 1n
⎛ ⎞
∴ P ( A) = P ⎜ ∪ {Si } ⎟
⎜S ∈A ⎟
⎝ i ⎠
1 n( A)
= n( A) =
n n
Example Consider the experiment of rolling a fair die considered in example 2.
Suppose Ai , i = 1,.., 6 represent the elementary events. Thus A1 is the event of getting ‘1’, A2 is the
event of getting ’2’ and so on.
Since all six disjoint events are equiprobale and S = A1 ∪ A2 ∪ .... ∪ A6 we get
1
P ( A1 ) = P ( A2 ) = ... = P( A6 ) =
6
Su ppose A is the event of getting an odd face. Then
A = A1 ∪ A3 ∪ A5
1 1
∴ P ( A) = P ( A1 ) + P ( A3 ) + P( A5 ) = 3 × =
6 2

Example Consider the experiment of tossing a fair coin until a head is obtained discussed in
Example 3. Here S = {H , TH , TTH ,......}. Let us call
s1 = H
s2 = TH
s3 = TTH
1
and so on. If we assign, P({sn }) = then ∑ P({sn }) = 1. Let A = {s1 , s2 , s3} is the event of
2n sn ∈S

obtaining the head before the 4th toss. Then


P( A) = P({s1}) + P({s2 }) + P({s3})
1 1 1 7
= + + =
2 2 2 23 8

Probability assignment in a continuous space

Suppose the sample space S is continuous and un-countable. Such a sample space arises when the
outcomes of an experiment are numbers. For example, such sample space occurs when the
experiment consists in measuring the voltage, the current or the resistance. In such a case, the sigma
algebra consists of the Borel sets on the real line.

Suppose S = and f : → is a non-negative integrable function such that,


∫ f ( x) dx =1

For any Borel set A,


P( A) = ∫ f ( x) dx
A
defines the probability on the Borel sigma-algebra Β.
2 3
We can similarly define probability on the continuous space of , etc.
Example Suppose
⎧ 1
⎪ for x ∈ [a, b]
f X ( x) = ⎨ b - a
⎪⎩0 otherwise
Then for [a1 , b1 ] ⊆ [a, b]
1
P([a1 , b1 ]) =
b - a
Example Consider S = 2 the two-dimensional Euclidean space. Let S1 ⊆ 2
and S1 represents
the area under S1.
⎧ 1
⎪ for x ∈ S1
f X ( x) = ⎨ S1
⎪0
⎩ otherwise
A
P ( A) =
S1

This example interprets the geometrical definition of probability.


Probability Using Counting Method:

In many applications we have to deal with a finite sample space S and the elementary
events formed by single elements of the set may be assumed equiprobable. In this case,
we can define the probability of the event A according to the classical definition
discussed earlier:

nA
P ( A) =
n

where nA = number of elements favorable to A and n is the total number of elements in


the sample space S .

Thus calculation of probability involves finding the number of elements in the sample
space S and the event A. Combinatorial rules give us quick algebraic rules to find the
elements in S . We briefly outline some of these rules:

(1) Product rule: Suppose we have a set A with m distinct elements and the set B
{ }
with n distinct elements and A × B = ( ai , b j ) | ai ∈ A, b j ∈ B . Then A × B contains
mn ordered pair of elements. This is illustrated in Fig for m = 5 and n = 4. In
other words if we can choose element a in m possible ways and the element b in n
possible ways then the ordered pair (a,b) can be chosen in mn possible ways.

B a1 , b4 a2 , b4 a3 , b4 a4 , b4 a5 , b5

a1 , b3 a2 , b3 a3 , b3 a4 , b3 a5 , b3

a1 , b2 a2 , b2 a3 , b2 a4 , b2 a5 , b2

a1 , b1 a2 , b1 a3 , b1 a4 , b1 a5 , b1

Fig Illustration of the product rule

The above result can be generalized as follows:

The number of distinct k-tupples in


A1 × A2 × ..... Ak = {( a1 , a2 ,..., ak ) | a1 ∈ A1 , a2 ∈ A2 ,........., ak ∈ Ak } is n1n2 .........nk where ni
represents the number of distinct elements in Ai .
Example
A fair die is thrown twice. What is the probability that a 3 will appear at least once.
Solution:
The sample space corresponding to two throws of the die is illustrated in the following
table. Clearly, the sample space has 6.6 = 36 elements by the product rule. The event
corresponding to getting at least one 3 is highlighted and contains 11 elements.
11
Therefore, the required probability is
36

(1,6) (2,6) (3,6) (4,6) (5,6) (6,6)


T (1,5 ) (2,5 ) (3,5 ) (4,5 ) (5,5 ) (6,5 )
h (1,4 ) (2,4 ) (3,4 ) (4,4 ) (5,4 ) (6,4 )
r (1,3) (2,3) (3,3) (4,3) (5,3) (6,3)
o (1,2) (2,2) (3,2) (4,2) (5,2) (6,2)
w (1,1) (2,1) (3,1) (4,1) (5,1) (6,1)
2
Throw 1
(2) Sampling with replacement and ordering:

Suppose we have to choose k objects from a set of n objects. Further, after every
choosing, the object is placed back in the set. In this case, the number of distinct
ordered k-tupples = n × n × ...... × n (k − times) = n k .

(3) Sampling without replacement:

Suppose we have to choose k objects from a set of n objects by picking one object
after another at random.

In this case the first object can be chosen from n objects, the second object can be
chosen from n-1 objects, and so. Therefore, by applying the product rule, the number
of distinct order k-tupples in this case is
n × ( n − 1) ....... ( n − k + 1)
n!
=
( n − k )!
n!
The number is called the permutation of n objects taking k at a time and
( n − k )!
denoted by n Pk . Thus
n!
n
Pk =
( n − k )!

Clearly , n Pn = n !

Example: Birthday problem- Given a class of students, what is the probability of


two students in theclass having the same birthday? Plot this probability vs.
number of people and be surprised! If the group has more than 365 people the
probability of two people in the group having the same birthday is obviously 1.

Let k be the number of students in the class.


Then the number of possible birth days=365.365....365 ( k -times) = 365k
The number of cases with each of the k students having a different birth
day is = 365
Pk = 365 .364.....(365 - k + 1)
365
Pk
Therefore, the probability of common birthday = 1-
365k

Number of persons Probability


2 0.0027
10 0.1169
15 0.4114
25 0.5687
40 0.8912
50 0.9704
60 0.9941
80 0.9999
100
The plot of probability vs number of students is shown in Fig. . Observe the steep rise in
the probability in the beginning. In fact this probability for a group of 25 students is
greater than 0.5 and that for 60 students onward is closed to 1. This probability for 366
or more number of stuents is exactly one.

(4) Sampling without replacing and without ordering

Suppose nCk be the number of ways in which k objects can be chosen out of a set of
n objects. In this case ordering of the objects in the set of k objects is not considered.

Note that k objects can be arranged among themselves in k ! ways. Therefore, if


ordering of the k objects is considered, the number of ways in which k objects can be
chosen out of n objects is nCk k ! . This is the case of sampling with ordering.

n
∴ nCk k ! = n pk =
n−k

n
∴ n Ck =
k n−k
n
Ck is also called the binomial co-efficient.
Example:

An urn contains 6 red balls, 5 green balls and 4 blue balls. 9 balls were picked at random
from the urn without replacement. What is the probability that out of the balls 4 are red, 3
are green and 2 are blue?

Solution:
15!
9 balls can be picked from a population of 15 balls in 15C9 =
9!6!

6
C4 × 5C3 × 4C2
Therefore the required probability is 15
C9

(5) Arranging n objects into k specific groups


Suppose we want to partition a set of n distinct elements into k distinct subsets
A1 , A2 ,....., An of sizes n1 , n2 ,.., nk respectively so that n = n1 + n2 + ....... + nk . Then the
total number of distinct partitions is

n!
n1 !n2 !....nk !

This can be proved by noting that the resulting number of partitions is


n
Cn1 × n − n1 Cn2 × ........... × n − n1 − n2 ...nk −1 Cnk

=
n!
×
( n − n1 )! × ..... × ( n − n1 − n2 .... − nk −1 )!
n1 !( n − n1 ) ! n2 !( n − n1 − n2 ) ! nk !( n − n1 − n2 .. − nk −1 − nk ) !
n!
=
n1 !n2 !....nk !

Example: What is the probability that in a throw of 12 dice each face occurs twice.
Solution: The total number of elements in the sample space of the outcomes of a single
throw of 12 dice is = 612
The number of favourable outcomes is the number of ways in which 12 dice can be
arranged in six groups of size 2 each – group 1 consisting of two dice each showing 1,
group 2 consisting of two dice each showing 2 and so on.
Therefore, the total number distinct groups
12!
=
2!2!2!2!2!2!
Hence the required probability is
12!
=
(2)6 612
Conditional probability
Consider the probability space ( S , F, P ). Let A and B two events in F . We ask the following
question –
Given that A has occurred, what is the probability of B?
The answer is the conditional probability of B given A denoted by P( B / A). We shall develop the
concept of the conditional probability and explain under what condition this conditional
probability is same as P ( B ).
Notation
P( B / A) = Conditional
probability of B given A

Let us consider the case of equiprobable events discussed earlier. Let N AB sample points be
favourable for the joint event A ∩ B.

Clearly
Number of outcomes favourable to A and B
P ( B / A) =
Number of outcomes in A
NAB
=
NA
NAB
P( A ∩ B)
= N =
NA P ( A)
N

This concept suggests us to define conditional probability. The probability of an event B under the
condition that another event A has occurred is called the conditional probability of B given A and
defined by
P (A ∩ B )
P (B / A) = , P (A ) ≠ 0
P (A )

We can similarly define the conditional probability of A given B, denoted by P( A / B).


From the definition of conditional probability, we have the joint probability P ( A ∩ B) of two
events A and B as follows
P ( A ∩ B ) = P (A ) P ( B / A ) = P ( B ) P ( A / B )

Example 1
Consider the example tossing the fair die. Suppose
A = event of getting an even number = {2, 4, 6}
B = event of getting a number less than 4 = {1, 2,3}
∴ A ∩ B = {2}
P( A ∩ B ) 1/ 6 1
∴ P ( B / A) = = =
P( A) 3/ 6 3

Example 2
A family has two children. It is known that at least one of the children is a girl. What is the
probability that both the children are girls?
A = event of at least one girl
B = event of two girls

Clearly
S = {gg , gb, bg , bb}, A = {gg , gb, bg} and B = {gg}
A ∩ B = {gg}
P ( A ∩ B ) 1/ 4 1
∴ P ( B / A) = = =
P ( A) 3/ 4 3

Conditional probability and the axioms of probability

In the following we show that the conditional probability satisfies the axioms of probability.

P( A ∩ B)
By definition P(B / A) = , P( A) ≠ 0
P( A)
Axiom 1
P( A ∩ B) ≥ 0, P( A) > 0
P( A ∩ B)
∴ P( B / A) = ≥0
P( A)
Axiom 2 We have S ∩ A = A
P(S ∩ A) P(A)
∴P(S / A) = = =1
P(A) P(A)
Axiom 3 Consider a sequence of disjoint events B1 , B2 ,..., Bn ,... We have
∞ ∞
(∪ Bi ) ∩ A = ∪ ( Bi ∩ A)
i =1 i =1
( See the Venn diagram below for illustration of finite version of the result.)
Note that the sequence Bi ∩ A, i = 1, 2,... is also sequence of disjoint events.
∞ ∞
∴ P(∪(Bi ∩ A)) = ∑ P(Bi ∩ A)
i =1 i =1
∞ ∞


P(∪Bi ∩ A) ∑P(B ∩ A)i ∞
∴P(∪Bi / A) = i =1
= i=1
= ∑P(Bi / A)
i =1 P( A) P( A) i =1

Properties of Conditional Probabilities

(1) If B ⊆ A, then P ( B / A) = 1 and P ( A / B ) ≥ P( A)


We have A ∩ B = B
P ( A ∩ B) P( A)
∴ P ( B / A) = = =1 S
P( A) P( A) A
and B
P( A ∩ B)
P( A / B) =
P( B)
P( A) P( B / A)
=
P( B)
P( A)
=
P( B)
≥ P( A)
(2) Chain rule of probability
P(A1 ∩ A2 ... An ) = P(A1 ) P( A2 / A1 )P( A3 / A1 ∩ A2 )...P ( An / A1 ∩ A2 .... ∩ An −1 )
We have
(A ∩ B ∩ C ) = ( A ∩ B ) ∩ C
P ( A ∩ B ∩ C ) = P (A ∩ B )P (C / A ∩ B )
= P (A ∩ B )P (C / A ∩ B )
= P (A ) P ( B / A )P (C / A ∩ B )
∴ P ( A ∩ B ∩ C ) = P (A ) P ( B / A )P (C / A ∩ B )
We can generalize the above to get the chain rule of probability
P(A1 ∩ A2 ... An ) = P(A1 ) P( A2 / A1 )P( A3 / A1 ∩ A2 )...P ( An / A1 ∩ A2 .... ∩ An −1 )
(3) Theorem of Total Probability: Let A1 , A2 , . . . An be n events such that
S = A1 ∪ A2 ..... ∪ An and Ai ∩ Aj = φ for i ≠ j. Then for any event B,
n
P(B) = ∑ P( A )P(B / A )
i =1
i i

n
Proof: We have ∪
i=1
B ∩ A i = B and the sequence B ∩ Ai is disjoint.
n
∴ P ( B ) = P ( ∪ B ∩ Ai )
i =1
n
= ∑
i =1
P ( B ∩ Ai )
n
= ∑
i =1
P ( Ai ) P ( B / Ai )

S
A1

A2 A3

Remark
(1) A decomposition of a set S into 2 or more disjoint nonempty subsets is called a partition of S .
The subsets A1 , A2 , . . . An form a partition of S if
S = A1 ∪ A2 ..... ∪ An and Ai ∩ Aj = φ for i ≠ j.
(2) The theorem of total probability can be used to determine the probability of a complex event in
terms of related simpler events. This result will be used in Bays’ theorem to be discussed to the
end of the lecture.

Example 3 Suppose a box contains 2 white and 3 black balls. Two balls are picked at random
without replacement. Let A1 = event that the first ball is white and Let A1c = event that the first
ball is black. Clearly A1 and A1c form a partition of the sample space corresponding to picking
two balls from the box. Let B = the event that the second ball is white. Then
P( B) = P( A1 ) P( B / A1 ) + P( A1c ) P( B / A1c )
2 1 3 2 2
= × + × =
5 4 5 4 5
Independent events

Two events are called independent if the probability of occurrence of one event does not affect the
probability of occurrence of the other. Thus the events A and B are independent if
P( B / A) = P( B) and P( A / B) = P( A).
where P( A) and P( B) are assumed to be non-zero.
Equivalently if A and B are independent, we have

P( A ∩ B)
= P( B)
P ( A)
or

P ( A ∩ B ) = P( A) P( B)

Joint probability is the product


of individual probabilities.

Two events A and B are called statistically dependent if they are not independent.
Similarly, we can define the independence of n events. The events A1 , A2 ,..., An are called
independent if and only if
P ( Ai ∩ Aj ) = P( Ai ) P( Aj )
P ( Ai ∩ Aj ∩ Ak ) = P ( Ai ) P ( Aj ) P( Ak )
P ( Ai ∩ Aj ∩ Ak ∩ ... An ) = P ( Ai ) P( Aj ) P( Ak )...P( An )

Example 4 Consider the example of tossing a fair coin twice. The resulting sample space is given
by S = {HH , HT , TH , TT } and all the outcomes are equiprobable.
Let A = {TH , TT } be the event of getting ‘tail’ in the first toss and B = {TH , HH } be the event of
getting ‘head’ in the second toss. Then
1 1
P ( A ) = and P ( B ) = .
2 2
Again, ( A ∩ B ) = {TH } so that
1
P ( A ∩ B) = = P ( A) P ( B )
4
Hence the events A and B are independent.
Example 5 Consider the experiment of picking two balls at random discussed in example 3. In
2 1
this case, P ( B) = and P ( B / A1 ) = .
5 4
Therefore, P ( B ) ≠ P ( B / A1 ) and A1 and B are dependent.

Bayes’ Theorem
Suppose A1 , A2 , . . . An are partitions on S such that S = A1 ∪ A2 ..... ∪ An and Ai ∩ A j = φ for i ≠ j.
Suppose the event B occurs if one of the events A1 , A2 , . . . An occurs. Thus we have the information of
probabilities P ( Ai ) and P ( B / Ai ), i = 1, 2.., n. We ask the following question:
Given that B has occured what is the probability that a particular event Ak has occured ? In other wo
what is P ( Ak / B ) ?

n
We have P ( B ) = ∑ P(A
i =1
i ) P ( B | Ai ) ( Using the theorem of total probability)

P(Ak ) P ( B/Ak )
∴ P(Ak | B) =
P(B)
P(Ak ) P ( B/Ak )
= n


i =1
P(Ai )P ( B / Ai )

This result is known as the Baye’s theorem. The probability P(Ak ) is called the a priori
probability and P(Ak / B) is called the a posteriori probability. Thus the Bays’ theorem enables us
to determine the a posteriori probability P(Ak | B) from the observation that B has occurred. This
result is of practical importance and is the heart of Baysean classification, Baysean estimation etc.

Example 6
In a binary communication system a zero and a one is transmitted with probability 0.6 and 0.4
respectively. Due to error in the communication system a zero becomes a one with a probability
0.1 and a one becomes a zero with a probability 0.08. Determine the probability (i) of receiving a
one and (ii) that a one was transmitted when the received message is one.

Let S be the sample space corresponding to binary communication. Suppose T0 be event of


transmitting 0 and T1 be the event of transmitting 1 and R0 and R1 be corresponding events of
receiving 0 and 1 respectively.

Given P (T0 ) = 0.6, P (T1 ) = 0.4, P ( R1 / T0 ) = 0.1 and P ( R0 / T1 ) = 0.08.


(i) P( R1 ) = Probabilty of receiving 'one'
= P(T1 ) P ( R1 / T1 ) + P(T0 ) P( R1 / T0 )
= 0.4 × 0.92 + 0.6 × 0.1
= 0.448
(ii) Using the Baye's rule
P (T1 ) P( R1 / T1 )
P (T1 / R1 ) =
P( R1 )
P(T1 ) P ( R1 / T1 )
=
P (T1 ) P( R1 / T1 ) + P (T0 ) P ( R1 / T0 )
0.4 × 0.92
=
0.4 × 0.92 + 0.6 × 0.1
= 0.8214

Example 7 In an electronics laboratory, there are identically looking capacitors of three


makes A1 , A2 and A3 in the ratio 2:3:4. It is known that 1% of A1 , 1.5% of A2 and 2% of A3 are
defective. What percentage of capacitors in the laboratory are defective? If a capacitor picked at
defective is found to be defective, what is the probability it is of make A3 ?
Let D be the event that the item is defective. Here we have to find
P ( D) and P ( A3 / D ).
2 1 4
Here P( A1 ) = , P( A2 ) = and P( A3 ) = .
9 3 9
The conditional probabilities are P( D / A1 ) = 0.01, P( D / A2 ) = 0.015 and P( D / A3 ) = 0.02.
∴ P ( D) = P ( A1 ) P( D / A1 ) + P( A2 ) P( D / A2 ) + P( A3 ) P( D / A3 )
2 1 4
= × 0.01 + × 0.015 + × 0.02
9 3 9
= 0.0167
and
P( A3 ) P( D / A3 )
P ( A3 / D) =
P( D)
4
× 0.02
=9
0.0167
= 0.533
Repeated Trials
In our discussions so far, we considered the probability defined over a sample space
corresponding to a random experiment. Often, we have to consider several random
experiments in a sequence. For example, the experiment corresponding to sequential
transmission of bits through a communication system may be considered as a sequence
of experiments each representing transmission of single bit through the channel.

Suppose two experiments E1 and E2 with the corresponding sample space S1 and S 2 are
performed sequentially. Such a combined experiment is called the product of two
experiments E1 and E2.

Clearly, the outcome of this combined experiment consists of the ordered pair
( s1 , s2 ) where s1 ∈ S1 and s2 ∈ S 2 . The sample space corresponding to the combined
experiment is given by S = S1 × S2 . The events in S consist of all the Cartesian products
of the form A1 × A2 where A1 is an event in S1 and A2 is an event in S 2 . Our aim is to
define the probability P ( A1 × A2 ) .

The sample space S1 × S2 and the events A1 × S 2 ,S1 × A2 and A1 × A2 are illustrated in in
Fig.

We can easily show that

P ( S1 × A2 ) = P2 ( A2 )
and
P ( A1 × S2 ) = P1 ( A1 )

where Pi is the probability defined on the events of Ai i = 1, 2. Ai,. This is because, the
event A1 × S2 in S occurs whenever A1 in S1 occurs, irrespective of the event in S 2 .

S2

Also note that A2 S1 × A2 A1 × A2


A1 × A2 = A1 × S 2 ∩ S1 × A2
∴ P ( A × B ) = P ⎡⎣( A1 × S 2 ) ∩ ( S1 × A2 ) ⎤⎦
A1 × S2

A1 S1
Fig. (Animate)
Independent Experiments:

In many experiments, the events A1 × S2 and S1 × A2 are independent for every selection
of A1 ∈ S1 and A2 ∈ S2 . Such experiments are called independent experiments. In this
case can write

P( A × B) = P ⎡⎣( A1 × S2 ) ∩ ( S1 × A2 ) ⎤⎦
= P ( A1 × S2 ) P ( S1 × A2 )
= P1 ( A1 ) P2 ( A2 )

Example 1
Suppose S1 is the sample space of the experiment of rolling of a six-faced fair die and S 2
is the sample space of the experiment of tossing of a fair die.
Clearly,
S1 = {1, 2,3, 4,5, 6} , A1 ={2,3}
and
S 2 = { H , T } , A2 = {H }
∴ S1 × S2 = {(1, H ), (1, T ), (2, H ), (2, T ), (3, H ), (3, T ), (4, H ), (4, T ), (5, H ), (5, T ), (6, H ), (6, T )}
and
1 2 1
P( A1 × A2 ) = . =
2 6 6

Example 2

In a digital communication system transmitting 1 and 0, 1 is transmitted twice as often as


0. If two bits are transmitted in a sequence, what is the probability that both the bits will
be 1?
S1 = {0,1} , A1 ={1}
and
S 2 = {0,1} , A2 = {1}
∴ S1 × S2 = {(0, 0), (0,1), (1, 0), (1,1)}
and
1 1 1
P( A1 × A2 ) = = ×
4 2 2
Generalization
We can similarly define the sample space S = S1 × S2 × .............. × Sn corresponding to n
experiments and the Cartesian product of
{ }
events A1 × A2 × .......... × An = ( s1 , s2 ,......., sn ) s1 ∈ A1 , s2 ∈ A2 ......., sn ∈ An .

If the experiments are independent, we can write

P ( A1 × A2 × ...... × An ) = P1 ( A1 ) P2 ( A2 ) ........Pn ( An )

where Pi is probability defined on the event of si .

Bernoulli trial

Suppose in an experiment, we are only concerned whether a particular event A has


occurred or not. We call this event as the ‘success’ with probability P ( A) = p and the
complementary event Ac as the ‘failure’ with probability P ( Ac ) = 1 − p. Such a random
experiment is called Bernoulli trial.

Binomial Law:

We are interested in finding the probability of k ‘successes’ in n independent Bernoulli


trials. This probability px (k ) is given by,

pn ( k ) = nCk p k (1 − p ) n − k
Consider n independent repetitions of the Bernoulli trial. Let S1 be the sample space
associated with each trial and we are interested in a particular event A1 ∈ s and its
complement Ac such that P ( A) = p and P( Ac ) = 1 − p . If A occurs in a trial, then we have
a ‘success’ otherwise a ‘failure’.

Thus the sample space corresponding to the n repeated trials is S = S1 × S2 × ....... × Sn .

Any event in S is of the form A1 × A2 × ........ × An where some Ai s are A and remaining
Ai s are Ac .

Using the property of independent experiment we have,

P( A1 × A2 × ......... × An ) = P ( A1 ) P ( A2 ) ...........P ( An ) .

If k Ai s are A and the remaining n - k Ai s are Ac , then


P ( A1 × A2 × ......... × An ) = p k (1 − p)n −k

But the nCk number of events in S with k number of As and n – k number of Ac s. For
example, if n = 4, k = 2 , the possible events are

A × Ac × Ac × A
A × A × Ac × A c
A × Ac × A × A c
Ac × A × A × A c
Ac × A c × A × A
A c × A × Ac × A

We also note that all the nCk events are mutually exclusive.

Hence the probability of k successes in n independent repetitions of the Bernoulli trial is


given by
pn ( k ) = nCk p k (1 − p ) n − k
Example1:

A fair dice is rolled 6 times. What is the probability that a 4 appears thrice?

Solution:

We have s1 = {1, 2,3, 4,5, 6}


1
A = {4} with P ( A) = pi =
6

And Ac = {1, 2,3,5, 6} with P ( Ac ) = 1 − p = 5 / 6

6×5 ⎛ 1 ⎞ ⎛ 5 ⎞
4 2

∴ P6 (4) = × ⎜ ⎟ × ⎜ ⎟ = 0.008
2 ⎝6⎠ ⎝6⎠

Example2:

A communication source emits binary symbols 1 and 0 with probability 0.6 and 0.4
respectively. What is the probability that there will be 5 1s in a message of 20 symbols?

Solution:
S1 = {0,1}
A = {1} , P( A) = p = 0.6
∴ P20 (5) = 20C5 ( 0.6 ) ( 0.4 ) =0.0013
5 15

Example 3
In a binary communication system, bit error occurs with a probability of10−5 . What is the
probability of getting at least one error bit in a message of 8 bits?

Here we can consider the sample space


S1 = {error in transmission of 1 bit } ∪ {no error in transmission in transmission of 1 bit}
with p={error in transmission of 1 bit }=10-5
∴ Probability of no bit-error in transmission of 8 bits
=P8 (0) = (1 − 10−5 )8 = 0.9999
∴ Probability of at least bit-error in transmission of 8 bits=1-P8 (0) = 0.0001

Typical plots of binomial probabilities are shown in the figure.

Approximations of the Binomial probabilities

Two interesting approximations of the binomial probabilities are very important.

Case 1
Suppose n is very large and p is very small and np = λ a constant.
Pn (k ) = nCk p k (1 − p) n − k
n−k
⎛λ⎞ ⎛ λ⎞
k

= nC k ⎜ ⎟ ⎜1 − ⎟
⎝n⎠ ⎝ n⎠
⎛ λ⎞
n

λ ⎜1 − ⎟
k

n(n − 1)....(n − k + 1)
. ⎝
n⎠
=
λ⎞
k
k ⎛
k
n ⎜1 − ⎟
⎝ n⎠
⎛ k −1⎞ k ⎛ λ ⎞
n

n (1 − 1/ n )(1 − 2 / n ) ....... ⎜1 −
k
⎟ λ ⎜1 − ⎟
= ⎝ n ⎠ ⎝ n⎠
⎛ λ⎞
k
k
n k ⎜1 − ⎟
⎝ n⎠

⎛ λ⎞ ⎛ λ⎞
k n

lim ⎜ 1 − ⎟ = 0 and lim ⎜ 1 − ⎟ = e − λ


n →∞
⎝ n⎠ n →∞
⎝ n⎠

λ k e− λ
∴ pn (k ) =
k
This distribution is known as Poisson probability and widely used in engineering and
other fields. We shall discuss more about this distribution in a later class.
Case 2
When n is sufficiently large and np (1 − p )  1 , pn (k ) may be approximated as


( k − np )2
1
pn (k ) ≈ e 2 np(1− p )
2π np (1 − p )

The right hand side is an expression for normal distribution to be discussed in a later
class.

Example: Consider the problem in example 2.


( 5 −3)2
1 −
Here p20 (5) ≈ e 2 n×0.6×0.4
= 0.0128
2π × 20 × 0.6 × 0.4
Random Variables
In application of probabilities, we are often concerned with numerical values which are
random in nature. These random quantities may be considered as real-valued function on
the sample space. Such a real-valued function is called real random variable and plays an
important role in describing random data. We shall introduce the concept of random
variables in the following sections.

Mathematical Preliminaries
Real-valued point function on a set

Recall that a real-valued function f : S → maps each element s ∈ S , a unique element


f ( s ) ∈ . The set S is called the domain of f and the set R f = { f ( x ) | x ∈ S } is called
the range of f . Clearly R f ⊆ . The range and domain of f are shown in Fig, .

f ( s1 )
s1
f ( s2
s2
s3 f ( s3 )
s4
f ( s1

Domain of
f Range of
f

Image and Inverse image


For a point s ∈ S , the functional value f ( s ) ∈ is called the image of the point s. If
A ⊆ S , then the set of the images of elements of A is called the image of A and denoted
by f ( A). Thus
f ( A) = { f ( s ) | s ∈ A}
Clearly f ( A) ⊆ R f
Suppose Β ⊆ . The set {x | f ( x) ∈ Β} is called the inverse image of Β under f and is
denoted by f −1 (Β).
Example Suppose S = {H , T } and f : S → is defined by f (H ) = 1 and f (T ) = −1.
Therefore,
• R f = {1, −1} ⊆
• Image of H is 1 and that of T is -1.
• For a subset of say Β1 = (−∞, 1.5],
f −1 (Β1 ) = {s | f ( s ) ∈ Β1 } = {H , T }.
For another subset B2 = [5, 9], f −1 ( B2 ) = φ .
Random variable
A random variable associates the points in the sample space with real numbers.

Consider the probability space ( S , F, P ) and function X : S → mapping the sample


space S into the real line. Let us define the probability of a subset B ⊆ by
PX ({B}) = P( X −1 ( B)) = P({s | X ( s ) ∈ B}). Such a definition will be valid if
( X −1 ( B)) is a valid event. If S is a discrete sample space, ( X −1 ( B)) is always a valid
event, but the same may not be true if S is infinite. The concept of sigma algebra is
again necessary to overcome this difficulty. We also need the Borel sigma algebra B -
the sigma algebra defined on the real line.

The function X : S → called a random variable if the inverse image of all Borel sets
under X is an event. Thus, if X is a random variable, then
X −1 (Β) = {s | X ( s ) ∈ Β} ∈ F.

X −1 Notations:
−1
A = X ( B) • Random variables are represented by
upper-case letters.
B • Values of a random variable are
denoted by lower case letters
• X ( s ) = x means that x is the value of a
S random variable X at the sample point
s.
• Usually, the argument s is omitted and
Figure Random Variable we simply write X = x.

(To be animated)

Remark
• S is the domain of X .
• The range of X , denoted by R X , is given by

RX = { X ( s ) | s ∈ S}.

Clearly R X ⊆ .
• The above definition of the random variable requires that the mapping X is such that
X −1 (Β) is a valid event in S . If S is a discrete sample space, this requirement is
met by any mapping X : S → . Thus any mapping defined on the discrete sample
space is a random variable.

Example 1: Consider the example of tossing a fair coin twice. The psample space is S={
HH,HT,TH,TT} and all four outcomes are equally likely. Then we can define a random
variable X as follows

Sample Point Value of the


random
Variable X = x
HH 0
HT 1
TH 2
TT 3

Here RX = {0,1, 2,3}.

Example 2: Consider the sample space associated with the single toss of a fair die. The sample
space is given by S = {1, 2,3, 4,5,6}
If we define the random variable X that associates a real number equal to the number in the
face of the die, then
X = {1, 2,3, 4,5,6}

Probability Space induced by a Random Variable

The random variable X induces a probability measure PX on B defined by


. PX ({B}) = P( X −1 ( B)) = P({s | X ( s ) ∈ B})
The probability measure PX satisfies the three axioms of probability:
Axiom 1
. PX ( B) = P( X −1 ( B)) ≤ 1

Axiom 2
PX ( ) = P( X −1 ( )) = P( S ) = 1
Axiom 3
Suppose B1 , B2 ,.... are disjoint Borel sets. Then X −1 ( B1 ), X −1 ( B2 ),.... are distinct events in
F. Therefore,
∞ ∞
PX ( ∪ Bi ) = P ( ∪ X −1 ( Bi )
i =1 i =1

= ∑ P ( X −1 ( Bi ))
i =1

= ∑ PX ( Bi )
i =1
Thus the random variable X induces a probability space ( S , B, PX )
Probability Distribution Function
We have seen that the event B and {s | X ( s ) ∈ B} are equivalent and
PX ({B}) = P({s | X ( s ) ∈ B}). The underlying sample space is omitted in notation and we
simply write { X ∈ B} and P ({ X ∈ B}) in stead of {s | X ( s ) ∈ B} and P({s | X ( s ) ∈ B})
respectively.

Consider the Borel set (−∞, x] where x represents any real number. The equivalent
event X −1 ((−∞, x]) = {s | X ( s) ≤ x, s ∈ S} is denoted as { X ≤ x}. The event { X ≤ x} can
be taken as a representative event in studying the probability description of a random
variable X . Any other event can be represented in terms of this event. For example,
{ X > x} = { X ≤ x}c ,{x1 < X ≤ x2 } = { X ≤ x2 } \{ X ≤ x1},
∞ ⎛ 1 ⎞
{ X = x} = ∩ ⎜ { X ≤ x} \{ X ≤ x − }⎟
n =1 ⎝ n ⎠

and so on.

The probability P({ X ≤ x}) = P({s | X ( s) ≤ x, s ∈ S}) is called the probability


distribution function ( also called the cumulative distribution function abbreviated as
CDF) of X and denoted by FX ( x). Thus (−∞, x],
FX ( x) = P({ X ≤ x})

Value of the random variable

FX ( x)
Random
variable
Eaxmple 3 Consider the random variable X in Example 1
We have
Value of the P ({ X = x})
random
Variable X = x
0 1
4
1 1
4
2 1
4
3 1
4
For x < 0,
FX ( x) = P({ X ≤ x}) = 0
For 0 ≤ x < 1,
1
FX ( x) = P({ X ≤ x}) = P({ X = 0}) =
4
For 1 ≤ x < 2,
FX ( x) = P({ X ≤ x})
= P({ X = 0} ∪ { X = 1})
= P({ X = 0}) + P ({ X = 1})
1 1 1
= + =
4 4 2
For 2 ≤ x < 3,
FX ( x) = P({ X ≤ x})
= P({ X = 0} ∪ { X = 1} ∪ { X = 2})
= P({ X = 0}) + P ({ X = 1}) + P ({ X = 2})
1 1 1 3
= + + =
4 4 4 4
For x ≥ 3,
FX ( x) = P({ X ≤ x})
= P( S )
=1

Properties of Distribution Function


• 0 ≤ FX ( x) ≤ 1
This follows from the fact that FX (x) is a probability and its value should lie between
0 and 1.
• FX (x) is a non-decreasing function of X. Thus,
x1 < x2 , then FX ( x1 ) < FX ( x2 )

x1 < x2
⇒ { X ( s) ≤ x1 } ⊆ { X ( s) ≤ x2 }
⇒ P{ X ( s) ≤ x1 } ≤ P{ X ( s) ≤ x2 }
∴ FX ( x1 ) < FX ( x2 )
A real function f ( x) is said to be continuous at a
point a if and only if
• FX (x) is right continuous (i) f (a) is defined
FX ( x + ) = lim FX ( x + h) = FX ( x) (ii) lim f ( x) = lim f ( x ) = f (a)
h →0
x→a + x→a −
h>0
Thefunction f ( x) is said to be right-continuous
Because, lim FX ( x + h) = lim P{ X ( s ) ≤ x + h}
h →0 h →0 at a point a if and only if
h >0 h>0
(iii) f (a) is defined
= P{ X ( s ) ≤ x}
lim f ( x) = f (a)
x→a +
=FX ( x)

• FX (−∞) = 0
Because, FX (−∞) = P{s | X ( s) ≤ −∞} = P(φ ) = 0

• FX (∞) = 1
Because, FX (∞) = P{s | X ( s) ≤ ∞} = P( S ) = 1

• P({x1 < X ≤ x2 }) = FX ( x2 ) − FX ( x1 )
We have
{ X ≤ x2 } = { X ≤ x1} ∪ {x1 < X ≤ x2 }
∴ P ({ X ≤ x2 }) = P({ X ≤ x1}) + P ({x1 < X ≤ x2 })
⇒ P ({x1 < X ≤ x2 }) = P ({ X ≤ x2 }) − P ({ X ≤ x1}) = FX ( x2 ) − FX ( x1 )

• FX ( x − )=FX ( x) − P( X = x)
FX ( x − ) = lim FX ( x − h)
h →0
h >0

= lim P{ X ( s) ≤ x − h}
h →0
h >0

= P{ X ( s ) ≤ x} − P( X ( s ) = x)
=FX ( x) − P( X = x)
We can further establish the following results on probability of events on the real line:

P{x1 ≤ X ≤ x2 } = FX ( x2 ) − FX ( x1 ) + P ( X = x1 )
P ({x1 ≤ X < x2 }) = FX ( x2 ) − FX ( x1 ) + P ( X = x1 ) − P ( X = x2 )
P ({ X > x}) = P ({x < X < ∞}) = 1 − FX ( x)

Thus we have seen that given FX ( x), -∞ < x < ∞, we can determine the probability of any
event involving values of the random variable X . Thus FX ( x) ∀x ∈ X is a complete
description of the random variable X .

Example 4 Consider the random variable X defined by

FX ( x) = 0, x < −2
1 1
= x + x, − 2 ≤ x < 0
8 4
= 1, x≥0

Find a) P(X = 0)

b) P { X ≤ 0}
c) P { X > 2}
d) P {−1 < X ≤ 1}
Solution:

a) P ( X = 0) = FX (0+ ) − FX (0− )
1 3
= 1− =
4 4
b) P { X ≤ 0} = FX (0)
=1
c) P { X > 2} = 1 − FX (2)
= 1−1 = 0
d) P {−1 < X ≤ 1}
= FX (1) − FX (−1)
1 7 FX ( x)
= 1− =
8 8 1

x→
Conditional Distribution and Density function:

We discussed conditional probability in an earlier class. For two events A and B


with P ( B ) ≠ 0 , the conditional probability P ( A / B ) was defined as
P ( A ∩ B)
P ( A / B) =
P ( B)

Clearly, the conditional probability can be defined on events involving a random variable
X.

Consider the event { X ≤ x} and any event B involving the random variable X. The
conditional distribution function of X given B is defined as

FX ( x / B ) = P ⎡⎣{ X ≤ x} / B ⎤⎦
P ⎡⎣{ X ≤ x} ∩ B ⎤⎦
= P ( B) ≠ 0
P ( B)

We can verify that FX ( x / B ) satisfies all the properties of the distribution function.

In a similar manner, we can define the conditional density function f X ( x / B ) of the


random variable X given the event B as

d
fX ( x / B) = FX ( x / B )
dx

Example 1: Suppose X is a random variable with the distribution function FX ( x ) .


Defined B = { X ≤ b} .

Then
P ⎡⎣{ X ≤ x} ∩ B ⎤⎦
FX ( x / B ) =
P ( B)
P ⎡⎣{ X ≤ x} ∩ { X ≤ b}⎤⎦
=
P { X ≤ b}
P ⎡⎣{ X ≤ x} ∩ { X ≤ b}⎤⎦
=
FX ( b )
Case 1: x<b

Then
P ⎡⎣{ X ≤ x} ∩ { X ≤ b}⎤⎦
FX ( x / B ) =
FX ( b )
P { X ≤ x} FX ( x )
= =
FX ( b ) FX ( b )

d FX ( x ) f X ( x )
And f X ( x / B ) = =
dx FX ( b ) f X ( b )

Case 2: x ≥ b

P ⎡⎣{ X ≤ x} ∩ { x ≤ b}⎤⎦
FX ( x / B ) =
FX ( x )
P { X ≤ b} FX ( b )
= =
FX ( x ) FX ( x )

d d FX ( b )
and f X ( x / B ) = FX ( x / B ) = =0
dx dx FX ( x )

FX ( x / B ) and f X ( x / B ) are plotted in the following figures.

Remark: We can define the Bayo rule in a similar manner.

Suppose the interval { X ≤ x} is portioned into non overlapping subsets such that
n
{ X ≤ x} = ∪
i =1
Bi .

n
Then FX ( x ) = ∑ P ( Bi )FX ( x / Bi )
i =1

P ⎡⎣ B ∪ { X ≤ x}⎤⎦
∴ P ( B / { X ≤ x} ) =
FX ( x )
P ⎡⎣ B ∪ { X ≤ x}⎤⎦
= n

∑ P(B ) F (x / B )
i =1
i X i
Mixed Type Random variable:

A random variable X is said to be mixed type if its distribution function FX ( x ) has


discontinuous finite number of points and increases strictly with respect to over at least
one interval of values of the random variable X.

Thus for a mixed type random variable X, FX ( x ) has discontinuous, but is not of stair
case type as the in the case of discrete random variable. A typical plot of the distribution
functions of a mixed type random variable as shown in Figure.

Suppose SD denotes the countable subset of points on SX such that the RV X is


characterized by the probability mass function p X ( x ) , x ∈ S D . Similarly let SC be a
continuous subset of points on SX such that RV is characterized by the probability
density function f X ( x ) , x ∈ SC .

Clearly the subsets SD and SC partition the set SX. If P ( S D ) , then P ( SC ) = 1 − p .

Thus the probability of the event { X ≤ x} can be expressed as

P { X ≤ x} = P ( S D ) P ({ X ≤ x} | S D ) + P ( SC ) P ({ X ≤ x} | SC )
= pFD ( x ) + (1 − p ) FC ( x )

Where FD ( x ) is the conditional distribution function of X given X is discrete and


FC ( x ) is the conditional distribution function given that X is continuous.

Clearly FD ( x ) is a staircase of function and FC ( x ) is a continuous function.


Also note that,

p = P ( SD )
= ∑
x∈ X D
pX ( x )

Example: Consider the distribution function of a random variable X given by

FX ( x ) = 0, x<0
1 1
= + x 0≤ x<4
4 16
3 1
= + ( x − 4) 4 ≤ x ≤ 8
4 16
=1 x>8

Express FX ( x ) as the distribution function of a mixed type random variable.

Solution:

The distribution function FX ( x ) is as


shown in the figure.

Clearly FX ( x ) has jumps


at x = 0 and x = 4 .

1 1 1
∴ p = pX ( 0 ) + pX ( 4 ) = + =
4 4 2

And

FD ( x ) = p X ( 0 ) u ( x ) + p X ( 4 ) u ( x − 4 )
1 1
= u ( x ) + u ( x − 4)
2 2
⎧0 ∞<x<0

FC ( x ) = ⎨ x 0≤ x≤8
⎪1 x >8

Example 2:

X is the RV representing the life time of a device with the PDF f X ( x ) for x>0.

Define the following random variable

y=X if X ≤a
=a if X >a

SD = a
SC = ( 0, a )

p = P { y ∈ D}
= P { X > a}
= 1 − FX ( a )

∴ FX ( x ) = pFD ( x ) + (1 − p ) FC ( x )
Discrete, Continuous and Mixed-type random variables

• A random variable X is called a discrete random variable FX (x) is piece-


wise constant. Thus FX (x) is flat except at the points of jump discontinuity. If
the sample space S is discrete the random variable X defined on it is always
discrete.
• X is called a continuous random variable if FX (x) is an absolutely continuous
function of x. Thus FX (x) is continuous everywhere on \ and FX ′ ( x) exists
everywhere except at finite or countably infinite points .
• X is called a mixed random variable if FX (x) has jump discontinuity at
countable number of points and it increases continuously at least at one
interval of values of x. For a such type RV X,
FX ( x) = pFX d ( x) + (1 − p) FX c ( x)
where FX d ( x ) is the distribution function of a discrete RV and FX c ( x ) is the
distribution function of a continuous RV. Typical plots of FX ( x ) for discrete,
continuous and mixed-random variables are shown in Fig below.

FX ( x)

x→
Plot FX ( x) vs. x for a discrete random variable ( to be animated)

FX ( x)
1

x→
FX ( x)
1

x→

Plot FX ( x) vs. x for a continuous random variable ( to be animated)

FX ( x)

x→
Plot FX ( x) vs. x for a mixed-type random variable ( to be animated)

Discrete Random Variables and Probability mass functions

A random variable is said to be discrete if the number of elements in the range of RX is finite or
countably infinite. Examples 1 and 2 are discrete random variables.
Assume RX to be countably finite. Let x1 , x2 , x3 ..., xN be the elements in RX . Here the mapping
X ( s ) partitions S into N subsets {s | X ( s) = xi } , i = 1, 2,...., N .
The discrete random variable in this case is completely specified by the probability mass
function (pmf) p X ( xi ) = P ( s | X ( s ) = xi ), i = 1, 2,...., N .

Clearly,

• p X ( xi ) ≥ 0 ∀xi ∈ RX and
• ∑ p X ( xi ) = 1
i∈RX

• Suppose D ⊆ RX . Then
P({x ∈ D}) = ∑p
xi ∈D
X ( xi )

X ( s1 )

X (s2 )
s1 s2
X ( s3 )
s3
s4
X (s4 )

Figure Discrete Random Variable


(To be animated)

Example

Consider the random variable X with the distribution function

⎧0 x<0
⎪1
⎪ 0 ≤ x <1
⎪4
FX ( x) = ⎨
⎪1 1≤ x < 2
⎪2
⎪1 x≥2

The plot of the FX ( x) is shown in Fig.


FX ( x)

1
2

1
4

0 1 2 x

The probability mass function of the random variable is given by

Value of the p X ( x)
random
Variable X = x
0 1
4
1 1
4
2 1
2

We shall describe about some useful discrete probability mass functions in a later class.

Continous Random Variables and Probability Density Functions

For a continuous random variable X , FX (x) is continuous everywhere. Therefore,

FX ( x) = FX ( x − ) ∀x ∈ \. This implies that

p X ( x) = P({ X = x})
= FX ( x) − FX ( x − )
= 0
Therefore, the probability mass function of a continuous RV X is zero for all x. A
continuous random variable cannot be characterized by the probability mass function. A
continuous random variable has a very important chacterisation in terms of a function
called the probability density function.

If FX (x) is differentiable, the probability density function ( pdf) of X , denoted by


f X ( x), us defined as

d
f X ( x) = FX ( x )
dx

Interpretation of f X ( x)

d
f X ( x) = FX ( x)
dx
F ( x + ∆x) − FX ( x)
= lim X
∆x → 0 ∆x
P ({x < X ≤ x + ∆x})
= lim
∆x → 0 ∆x

so that

P ({x < X ≤ x + ∆x})  f X ( x)∆x.

Thus the probability of X lying in the some interval ( x, x + ∆x] is determined by


f X ( x). In that sense, f X ( x) represents the concentration of probability just as the density
represents the concentration of mass.

Properties of the Probability Density Function

• f X ( x) ≥ 0.

This follows from the fact that FX ( x) is a non-decreasing function

x
• FX ( x) = ∫
−∞
f X (u )du


• ∫f
−∞
X ( x)dx = 1
x2

• P ( x1 < X ≤ x 2 ) = ∫f
− x1
X ( x)dx

f X ( x)

x0 x0 + ∆x0
x
Fig. P ({x0 < X ≤ x0 + ∆x0 })  f X ( x0 )∆x0

Example Consider the random variable X with the distribution function

⎧0 x<0
FX ( x) = ⎨ − ax
⎩1 − e , a > 0 x ≥ 0

The pdf of the RV is given by


⎧0 x<0
f X ( x) = ⎨ − ax
⎩e , a > 0 x≥0

Remark: Using the Dirac delta function we can define the density function for a discrete
random variables.
Consider the random variable X defined by the probability mass function (pmf)
p X ( xi ) = P( s | X ( s ) = xi ), i = 1, 2,...., N .
The distribution function FX ( x) can be written as
N
FX ( x) = ∑ p X ( xi )u ( x − xi )
i =1

where u ( x − xi ) shifted unit-step function given by


⎧1 for x ≥ xi
u ( x − xi ) = ⎨
⎩0 otherwise

Then the density function f X ( x) can be written in terms of the Dirac delta function as
n
f X ( x) = ∑ p X ( xi )δ ( x − xi )
i =1
Example
Consider the random variable defined in Example 1 and Example 3. The distribution
function FX ( x) can be written as

1 1 1
FX ( x) = u ( x) + u ( x − 1) + u ( x − 2)
4 4 2
and
1 1 1
f X ( x) = δ ( x) + δ ( x − 1) + δ ( x − 2)
4 4 2

Probability Density Function of Mixed-type Random Variable


Suppose X is a mixed-type random variable with FX ( x) having jump discontinuity at
X = xi , i = 1, 2,.., n. As already satated, the CDF of a mixed-type random variable X is
given by
FX ( x) = pFX d ( x) + (1 − p) FX c ( x)
where FX d ( x) is the distribution function of a discrete RV and FX c ( x) is the distribution
function of a discrete RV. Therefore,

f X ( x) = pf X d ( x) + (1 − p) f X c ( x)

where
n
f X d ( x) = ∑ p X ( xi )δ ( x − xi )
i =1

Example Consider the random variable X with the distribution function

⎧0 x<0
⎪0.1 x=0

FX ( x) = ⎨
⎪0.1 + 0.8 x 0 < x <1
⎪⎩1 x >1
FX ( x)
1

The plot of FX ( x) is shown in Fig.

0 1 x→
FX ( x) can be expressed as

FX ( x) = 0.2 FX d ( x) + 0.8 FX c ( x)

where
⎧0 x<0

FX d ( x) = ⎨0.5 0 ≤ x ≤1
⎪1 x >1

and

⎧0 x<0

FX c ( x) = ⎨ x 0 ≤ x ≤1
⎪1 x >1

The pdf is given by


f X ( x) = 0.2 f X d ( x) + 0.8 f X c ( x)

where
f X d ( x) = 0.5δ ( x) + 0.5δ ( x − 1)
and
⎧1 0 ≤ x ≤1
f X c ( x) = ⎨
⎩0 elsewhere

f X ( x)

0 1
x
Functions of Random Variables

Often we have to consider random variables which are functions of other random
variables. Let X be a random variable and g (.) is function \. Then Y = g ( X ) is a
random variable. We are interested to find the pdf of Y . For example, suppose
X represents the random voltage input to a full-wave rectifier. Then the rectifier output
Y is given by Y = X . We have to find the probability description of the random variable
Y . We consider the following cases:

(a) X is a discrete random variable with probability mass function p X ( x)

The probability mass function of Y is given by


pY ( y ) = P ({Y = y})
= P ({ x | g ( x) = y})
= ∑ P ({ X = x})
{ x| g ( x ) = y}

= ∑ p X ( x)
{ x| g ( x ) = y}

(b) X is a continuous random variable with probability density function f X ( x) and


y = g ( x ) is one-to-one and monotonically increasing

The probability distribution function of Y is given by


FY ( y ) = P {Y ≤ y}
= P { g ( X ) ≤ y}
= P { X ≤ g −1 ( y )}
= P({ X ≤ x}) ⎤⎦ x = g − ( y )
= FX ( x) ]x = g −1 ( y )
dFY ( y )
fY ( y ) =
dy
dFX ( x) ⎤
=
dy ⎥⎦ x = g −1 ( y )
dFX ( x) dx ⎤
=
dx dy ⎥⎦ x = g −1 ( y )
dFX ( x) ⎤

= dx ⎥
dy ⎥
dx ⎥⎦ x = g −1 ( y )
f X ( x) ⎤
=
g ′( x) ⎥⎦ x = g −1 ( y )

f X ( x) f X ( x) ⎤
fY ( y ) = =
dy g ′( x) ⎥⎦ x = g −1 ( y )
dx

This is illustrated in Fig.


Example 1: Probability density function of a linear function of random variable
Suppose Y = aX + b, a > 0.
y −b dy
Then x = and =a
a dx
y −b
fX ( )
f X ( x) a
∴ fY ( y ) = =
dy a
dx
Example 2: Probability density function of the distribution function of a random
variable
Suppose the distribution function FX ( x) of a continuous random variable X is
monotonically increasing and one-to-one and define the random variable
Y = FX ( X ). Then, fY ( y ) = 1 0 ≤ y ≤ 1.
y = FX ( x)
Clearly 0 ≤ y ≤ 1
dy dFX ( x)
= = f X ( x)
dx dx
f ( x) f X ( x)
∴ fY ( y ) = X = =1
dy f X ( x)
dx
∴ fY ( y ) = 1 0 ≤ y ≤ 1.
Remark
(1) The distribution given by fY ( y ) = 1 0 ≤ y ≤ 1 is called a uniform distribution over the
interval [0,1].
(2) The above result is particularly important in simulating a random variable with a
particular distribution function. We assumed FX ( x ) to be one-to-one function for
invariability. However, the result is more general- the random variable defined by the
distribution function of any random variable is uniformly distributed over [0,1]. For
example, if X is a discrete RV,
FY ( y ) =P(Y ≤ y )
= P ( FX ( x) ≤ y )
= P ( X ≤ FX−1 ( y ))
= FX ( FX−1 ( y ))
= y ( Assigning FX−1 ( y ) to the left-most point of the interval for which FX ( x) = y ).
dF ( y )
∴ fY ( y ) = Y = 1 0 ≤ y ≤ 1.
dy

Y = FX ( X )

X
x = F X− 1 ( y )

(c) X is a continuous random variable with probability density function f X ( x) and


y = g ( x) has multiple solutions for x

Suppose for y ∈ Y , y = g ( x) has solutions xi , i = 1, 2,3,............., n . Then


n
f X ( x)
fY ( y ) = ∑
i =1 dy
dx x = xi

Proof:

Consider the plot of Y = g ( X ) . Suppose at a point y = g ( x) , we have three distinct roots


as shown. Consider the event { y < Y ≤ y + dy} . This event will be equivalent to union
events

{ x1 < X ≤ x1 + dx1} , { x2 − dx2 < X ≤ x2 } and { x3 < X ≤ x3 + dx3}

∴ P { y < Y ≤ y + dy} = P { x1 < X ≤ x1 + dx1} + P { x2 − dx2 < X ≤ x2 } + P { x3 < X ≤ x3 + dx3 }

∴ fY ( y )dy = f X ( x1 )dx1 + f X ( x2 )(−dx2 ) + f X ( x3 )dx3

Where the negative sign in −dx2 is used to account for positive probability.

Therefore, dividing by dy and taking the limit, we get


⎛ dx ⎞ ⎛ dx ⎞ ⎛ dx ⎞
fY ( y ) = f X ( x1 ) ⎜ 1 ⎟ + f X ( x2 ) ⎜ − 2 ⎟ + f X ( x3 ) ⎜ 3 ⎟
⎝ dy ⎠ ⎝ dy ⎠ ⎝ dy ⎠
dx1 dx dx
= f X ( x1 ) + f X ( x2 ) 2 + f X ( x3 ) 3
dy dy dy
3
f X ( xi )
=∑
i =1 dy
dx x = xi

In the above we assumed y = g ( x) to have three roots. In general, if y = g ( x) has n


roots, then

n
f X ( xi )
fY ( y ) = ∑
i =1 dy
dx x = xi

Example 3: Probability density function of a linear function of random variable


Suppose Y = aX + b, a ≠ 0.
y −b dy
Then x = and =a
a dx
y −b
fX ( )
f X ( x) a
∴ fY ( y ) = =
dy a
dx

Example 4: Probability density function of the output of a full-wave rectifier

Suppose Y = X , − a ≤ X ≤ a, a>0
Y

−y y X

dy
y = x has two solutions x1 = y and x2 = − y and = 1 at each solution point.
dx

f X ( x) ]x = y f X ( x)]x =− y
∴ fY ( y ) = +
1 1
= f X ( x) + f X (− x)

Example 5: Probability density function of the output a square-law device

Y = CX 2 , C ≥ 0

y
∴ y = cx 2 => x = ± y≥0
c

dy dy
And = 2cx so that = 2c y / c = 2 cy
dx dx

∴ fY ( y ) =
fX ( )
y / c + fX ( −y /c ) y>0
2 cy

= 0 otherwise
Expected Value of a Random Variable

• The expectation operation extracts a few parameters of a random variable and


provides a summary description of the random variable in terms of these
parameters.
• It is far easier to estimate these parameters from data than to estimate the
distribution or density function of the random variable.

Expected value or mean of a random variable

The expected value of a random variable X is defined by



EX = ∫ xf X ( x)dx
−∞

provided ∫ xf X ( x)dx exists.
−∞
EX is also called the mean or statistical average of the random variable X and denoted
by µ X .

Note that, for a discrete RV X defined by the probability mass function (pmf)
p X ( xi ), i = 1, 2,...., N , the pdf f X ( x) is given by
N
f X ( x) = ∑ p X ( xi )δ ( x − xi )
i =1
∞ N
∴ µ X = EX = ∫ x ∑ p X ( xi )δ ( x − xi )dx
−∞ i =1
N ∞
= ∑ p X ( xi ) ∫ xδ ( x − xi )dx
i =1 −∞
N
= ∑ xi p X ( xi )
i =1

Thus for a discrete random variable X with p X ( xi ), i = 1, 2,...., N ,

N
µ X = ∑ xi p X ( xi )
i =1

Interpretation of the mean


• The mean gives an idea about the average value of the random value. The
values of the random variable are spread about this value.
• Observe that

µ X = ∫ xf X ( x)dx
−∞

∫ xf X ( x)dx ∞
−∞
= ∞
∵ ∫ f X ( x)dx = 1
−∞
∫ f X ( x)dx
−∞
Therefore, the mean can be also interpreted as the centre of gravity of the pdf curve.
Fig. Mean of a random variable

Example 1 Suppose X is a random variable defined by the pdf


⎧ 1
⎪ a≤ x≤b
f X ( x) = ⎨ b − a
⎪⎩ 0 otherwise

Then

EX = ∫ xf X ( x)dx
−∞
b1
=∫x dx
a b−a
a+b
=
2
Example 2 Consider the random variable X with pmf as tabulated below

Value of the 0 1 2 3
random variable x
p X ( x) 1 1 1 1
8 8 4 2
N
∴ µ X = ∑ xi p X ( xi )
i =1

1 1 1 1
=0 × + 1× + 2 × + 3 ×
8 8 4 2
17
=
8


Remark If f X ( x) is an even function of x, then ∫ xf X ( x)dx = 0. Thus the mean of a RV
−∞
with an even symmetric pdf is 0.
Expected value of a function of a random variable
Suppose Y = g ( X ) is a function of a random variable X as discussed in the last class.

Then, EY = Eg ( X ) = ∫ g ( x) f X ( x)dx
−∞

We shall illustrate the theorem in the special case g ( X ) when y = g ( x ) is one-to-one


and monotonically increasing function of x. In this case,

f X ( x) ⎤
fY ( y ) =
g ′( x) ⎥⎦ x = g −1 ( y ) g ( x)

EY = ∫ yfY ( y )dy
−∞
y2
−1
y2 f X ( g ( y ))
=∫ y dy
y1 g ′( g −1 ( y )
y1
where y1 = g (−∞) and y2 = g (∞).
Substituting x = g −1 ( y ) so that y = g ( x) and dy = g ′( x)dx, we get x


EY = ∫ g ( x) f X ( x)dx
−∞
The following important properties of the expectation operation can be immediately
derived:

(a) If c is a constant,
Ec = c

Clearly
∞ ∞
Ec = ∫ cf X ( x)dx = c ∫ f X ( x)dx = c
−∞ −∞

(b) If g1 ( X ) and g 2 ( X ) are two functions of the random variable X and


c1 and c2 are constants,
E[c1 g1 ( X ) + c2 g 2 ( X )]= c1 Eg1 ( X ) + c2 Eg 2 ( X )

E[c1 g1 ( X ) + c2 g 2 ( X )] = ∫ c1 [ g1 ( x) + c2 g 2 ( x)] f X ( x)dx
−∞
∞ ∞
= ∫ c1 g1 ( x) f X ( x) dx + ∫ c2 g 2 ( x) f X ( x)dx
−∞ −∞
∞ ∞
= c1 ∫ g1 ( x) f X ( x)dx + c2 ∫ g 2 ( x) f X ( x)dx
−∞ −∞
= c1 Eg1 ( X ) + c2 Eg 2 ( X )
The above property means that E is a linear operator.

Mean-square value

EX 2 = ∫ x 2 f X ( x)dx
−∞
Variance
For a random variable X with the pdf f X ( x) and men µ X , the variance of X is
denoted by σ X2 and defined as

σ X2 = E ( X − µ X )2 = ∫ ( x − µ X ) 2 f X ( x)dx
−∞

Thus for a discrete random variable X with p X ( xi ), i = 1, 2,...., N ,

N
σ X2 = ∑ ( xi − µ X ) 2 p X ( xi )
i =1

The standard deviation of X is defined as


σ X = E ( X − µ X )2

Example 3 Find the variance of the random variable discussed in Example 1.

σ X2 = E ( X − µ X ) 2
b a+b 2 1
= ∫ (x − ) dx
a 2 b−a
2
1 b 2 a+bb ⎛a+b⎞ b
= [ ∫ x dx − 2 × ∫ xdx + ⎜ ⎟ ∫ dx
b−a a 2 a ⎝ 2 ⎠ a
(b − a) 2
=
12

Example 4 Find the variance of the random variable discussed in Example 2.


As already computed
17
µX =
8
σ X = E ( X − µ X )2
2

17 2 1 17 1 17 1 17 1
= (0 − ) × + (1 − ) 2 × + (2 − ) 2 × + (3 − ) 2 ×
8 8 8 8 8 4 8 2
117
=
128
Remark
• Variance is a central moment and measure of dispersion of the random
variable about the mean.
• E ( X − µ X ) 2 is the average of the square deviation from the mean. It
gives information about the deviation of the values of the RV about the
mean. A smaller σ X2 implies that the random values are more clustered
about the mean, Similarly, a bigger σ X2 means that the random values are
more scattered.
For example, consider two random variables X 1 and X 2 with pmf as
1
shown below. Note that each of X 1 and X 2 has zero means. σ X2 1 = and
2
5
σ X2 2 = implying that X 2 has more spread about the mean.
3

pX (x)
1
2

1
4
x -1 0 1 x
p X1 ( x ) 1 1 1 -1 0 1
4 2 2

pX (x)
3
x -2 -1 0 1 2
p X 2 ( x) 1 1 1 1 1
6 3 6 6 1
6
6

2 -1 0 1 1 x

Fig. shows the pdfs of two continuous random variables with same mean but different
variances
• We could have used the mean absolute deviation E X − µ X for the same
purpose. But it is more difficult both for analysis and numerical
calculations.

Properties of variance
(1) σ X2 = EX 2 − µ X2
σ X2 = E ( X − µ X ) 2
= E ( X 2 − 2 µ X X + µ X2 )
= EX 2 − 2 µ X EX + E µ X2
= EX 2 − 2 µ X2 + µ X2
= EX 2 − µ X2

(2) If Y = cX + b, where c and b are constants, then σ Y2 = c 2σ X2


σ Y2 = E (cX + b − c µ X − b) 2
= Ec 2 ( X − µ X ) 2
= c 2σ X2

(3) If c is a constant,
var(c ) = 0.
nth moment of a random variable
We can define the nth moment and the nth central-moment of a random variable X
by the following relations

nth-order moment EX n = ∫ x n f X ( x ) dx n = 1, 2,..
−∞

nth-order central moment E ( X − µ X ) n = ∫ ( x − µ X ) n f X ( x ) dx n = 1, 2,...
−∞

Note that
• The mean µ X = EX is the first moment and the mean-square value EX 2
is the second moment
• The first central moment is 0 and the variance σ X2 = E ( X − µ X ) 2 is the
second central moment
• The third central moment measures lack of symmetry of the pdf of a random
E ( X − µ X )3
variable. is called the coefficient of skewness and If the pdf is
σ X3

symmetric this coefficient will be zero.


• The fourth central moment measures flatness of peakednes of the pdf of a
E ( X − µ X )4
random variable. is called kurtosis. If the peak of the pdf is
σ X4

sharper, then the random variable has a higher kurtosis.


Inequalities based on expectations
The mean and variance also give some quantitative information about the bounds of
RVs. Following inequalities are extremely useful in many practical problems.

Chebysev Inequality
Suppose X is a parameter of a manufactured item with known mean
µ X and variance σ X2 . The quality control department rejects the item if the absolute
deviation of X from µ X is greater than 2σ X . What fraction of the manufacturing
item does the quality control department reject? Can you roughly guess it?
The standard deviation gives us an intuitive idea how the random variable is
distributed about the mean. This idea is more precisely expressed in the remarkable
Chebysev Inequality stated below. For a random variable X with mean
µ X and variance σ X2

σ X2
P{ X − µ X ≥ ε } ≤
ε2
Proof:

σ x2 = ∫ ( x − µ X ) 2 f X ( x)dx
−∞

≥ ∫ ( x − µ X ) 2 f X ( x)dx
X − µ X ≥ε

≥ ∫ ε 2 f X ( x) dx
X − µ X ≥ε

= ε 2 P{ X − µ X ≥ ε }

σ X2
∴ P{ X − µ X ≥ ε } ≤
ε2

Markov Inequality
For a random variable X which take only nonnegative values
E( X )
P{ X ≥ a} ≤ where a > 0.
a

E ( X ) = ∫ xf X ( x)dx
0

≥ ∫ xf X ( x)dx
a

≥ ∫ af X ( x)dx
a

= aP{ X ≥ a}

E( X )
∴ P{ X ≥ a} ≤
a
E ( X − k )2
Remark: P{( X − k ) 2 ≥ a} ≤
a
Example

Example A nonnegative RV X has the mean µ X = 1. Find an upper bound of the

probability P ( X ≥ 3}).
By Markov’s inequality
E( X ) 1
P ( X ≥ 3}) ≤ = .
3 3
1
Hence the required upper bound = .
3
Characteristic Functions of Random Variables

Just as the frequency-domain charcterisations of discrete-time and continuous-time signals, the


probability mass function and the probability density function can also be characterized in the
frequency-domain by means of the charcteristic function of a random variable. These functions

are particularly important in


• calculating of moments of a random variable
• evaluating the PDF of combinations of multiple RVs.

Characteristic function

Consider a random variable X with probability density function f X ( x). The characteristic
function of X denoted by φ X ( w), is defined as
φX (w) = Ee jω X

= ∫ e jωx f X (x)dx
−∞

where j = −1.
Note the following:

• φ X (ω ) is a complex quantity, representing the Fourier transform of f ( x ) and


jω X − jω X
traditionally using e instead of e
. This implies that the properties of the Fourier
transform applies to the characteristic function.
• The interpretation that φ X (ω ) is the expectation of e
jω X
helps in calculating moments
with the help of the characteristics function.

• As φ X (ω ) always +ve and ∫ f X ( x ) dx = 1 , φ X (ω ) always exists.
−∞


[Recall that the Fourier transform of a function f(t) exists if ∫ ⎡⎣ f ( t )⎤⎦dt < ∞ , i.e., f(t) is absolutely
−∞
integrable.]

We can get f X ( x ) from φ X (ω ) by the inverse transform


1
fX ( x) = ∫ φ (ω ) e
− jω x


X
−∞

Example 1 Consider the random variable X with pdf f X ( x ) given by

1
fX ( x) = a≤ x≤b
b−a
= 0 otherwise. The characteristics function is given by

Solution:

b
1 jω x
φ X (ω ) = ∫ e dx
a
b−a
b
1 e jω x ⎤
=
b−a jω ⎥⎦ a

=
1
jω ( b − a )
( e jωb − e jω a )

Example 2 The characteristic function of the random variable X with


f X ( x) = λ e − λ x λ > 0, x > 0 is

φ X (ω ) = ∫ e jω x λ e − λ x
0

= λ ∫ e − ( λ − jw) x dx
0

λ
=
λ − jω

Characteristic function of a discrete random variable:

Suppose X is a random variable taking values from the discrete set RX = { x1 , x2 ,.....} with
corresponding probability mass function p X ( xi ) for the value xi .

Then

φ X (ω ) = Ee jω X
= ∑
X i ∈RX
p X ( xi ) e jω xi

Note that φ X (ω ) can be interpreted as the discrete-time Fourier transform with e jω x i


substituting
e − jω xi in the original discrete-time Fourier transform. The inverse relation is
1 π − jω xi
p X ( xi ) = ∫e φ X (ω )dω
2π −π

Example 3 Suppose X is a random variable with the probability mass function

p X ( k ) = nCk p k (1 − p )
n−k
, k = 0,1,...., n

n
φ X (ω ) = ∑ nCk p k (1 − p )
n−k
Then e jω k
k =0
k

= ∑ Ck ( pe ) (1 − p )
n
n jω n−k

k =0

= ⎡⎣ pe jω + ( − p ) ⎤⎦
n
(Using the Binomial theorem)

Example 4 The characteristic function of the discrete random variable X with

p X (k ) = p(1 − p) k , k = 0,1,....

φ X (ω ) = ∑ e jω k p(1 − p) k
k =0

= p ∑ e jω k (1 − p) k
k =0

p
=
1 − (1 − p)e jω
Moments and the characteristic function

Given the characteristics function φ X (ω ) , the kth moment is given by

1 dk
EX k = φ X (ω )
j dω k ω =0

To prove this consider the power series expansion of eiω X


( jω ) 2 X 2 ( jω ) n X n
e jω X = 1 + jω X + + ...... + + ..
2! n!
2 n
Taking expectation of both sides and assuming EX , EX ,..........., EX to exist, we get

( jω ) 2 EX 2 ( jω ) n EX n
φ X (ω ) = 1 + jω EX + + ...... + + .....
2! n!

Taking the first derivative of φ X (ω ) with respect to ω at ω = 0, we get


dφ X (ω )
= jEX
dω ω = 0
Similarly, taking the nth derivative of φ X (ω ) with respect to ω at ω = 0, we get
d nφ X (ω )
= j n EX n
dω n
ω =0
Thus
1 dφ X (ω )
EX =
j dω ω = 0
1 d nφ X (ω )
EX n =
j n dω n ω = 0

Example 3 First two moments of random variable in Example 2


λ
φ X (ω ) =
λ − jω
d λj
φ X (ω ) =
dω (λ − jω ) 2
d2 2λ j 2
φ (ω ) =
dω 2 (λ − jω )3
X

1 λj 1
EX = =
j (λ − jω ) 2 ω =0
λ
1 2λ j 2
2
EX 2 = = 2
j (λ − jω ) ω =0 λ
2 3

Probability generating function:

If the random variable under consideration takes non negative integer values only, it is convenient
to characterize the random variable in terms of the probability generating function G (z) defined
by

GX ( z ) = Ez X

= ∑ pX ( k ) z
k =0

Note that
ƒ GX ( z ) is related to z-transform, in actual z-transform, z − k is used instead of z k .
ƒ The characteristic function of X is given by φ X (ω ) = GX ( e jω )

ƒ GX (1) = ∑ p X ( k ) = 1
k =0

ƒ GX ' ( z ) = ∑ kp X ( k )z k −1
k =0

ƒ G '(1) = ∑ kp X ( k ) = EX
k =0
∞ ∞ ∞
ƒ GX ''( z ) = ∑ k (k − 1) p X ( k ) z k − 2 = ∑ k 2 px ( k ) z k − 2 − ∑ k px ( k ) z k − 2
k =0 k =0 k =0
∞ ∞
∴ GX ''(1) = ∑ k 2 p X ( k ) − ∑ kp X ( k ) = EX 2 − EX
k =0 k =0

∴σ X 2 = EX 2 − ( EX ) = GX ''(1) + GX '(1) − ( GX '(1) )


2 2

Ex: Binomial distribution:

pm ( x ) = nC x p x (1 − p ) x
∴φx ( z ) = ∑ nCx p x (1 − p) n − x z x
x

= ∑ nCx ( pz ) x (1 − p) n − x
x

= (1 − p + pz ) n

φ ' X (1) = EX = np

φ X '' (1) = EX 2 − EX = n(n − 1) p 2

∴ EX 2 = φ X '' (1) + EX
= n(n − 1) p 2 + np
= np 2 + npq

Example 2: Geometric distribution

p X ( x ) = p (1 − p ) x

φ X ( z ) = ∑ p(1 − p) x z x
x

= p ∑ ( (1 − p ) z )
x

1
=p
1 − (1 − p) z

p(1 − p)
φX ' ( z ) = +
(1 − (1 − p) z ) 2
p(1 − p ) p (1 − p) q
φ X ' (1) = = =
(1 − 1 + p) 2
p2 p

2 p(1 − p)(1 − p )
φ X '' ( z ) =
(1 − (1 − p) z )3
2
2 pq 2 ⎛q⎞
φ X (1) = 3 = 2 ⎜ ⎟
''

p ⎝ p⎠
2
q ⎛q⎞ q
EX = φ (1) + = 2 ⎜ ⎟ +
2 ''

p ⎝ p⎠ p

2 2
⎛q⎞ ⎛q⎞ ⎛q⎞
Var ( X ) = 2 ⎜ ⎟ + ⎜ ⎟ − ⎜ ⎟
⎝ p⎠ ⎝ p⎠ ⎝ p⎠
Moment Generating Function:

Sometimes it is convenient to work with a function similarly to the Laplace transform and
known as the moment generating function.

For a random variable X, the moment generating function M X ( s ) is defined by


M X ( s ) = Ee SX
= ∫ f X ( x )e SX dx
RX

Where RX is the range of the random variable X.

If X is a non negative continuous random variable, we can write


M X ( s ) = ∫ f X ( x ) e SX dx
0

Note the following:


ƒ M x '( s ) = ∫ xf x ( x ) e sx dx
0

∴ M '(0) = EX


dk
ƒ M X ( s ) = ∫ x k f X ( x ) e sx dx
ds k 0

= EX k
Example Let X be a continuous random variable with
α
fX ( x) = − ∞ < x < ∞, α > 0
π ( x2 + α 2 )

Then

EX = ∫ xf ( x ) dx
−∞
X

α ∞ 2x
π ∫0 x 2 + α 2
= dx


α ⎤
= ln (1 + x 2 ) ⎥
π ⎦0

Hence EX does not exist. This density function is known as the Cauchy density function.

2
⎛q⎞ ⎛q⎞
= ⎜ ⎟ +⎜ ⎟
⎝ p⎠ ⎝ p⎠

The joint characteristic function of two random variables X and Y is defined by,

φ X ,Y (ω1 , ω2 ) = Ee jω x + jω y
1 2

∞ ∞
= ∫∫
−∞ −∞
f X , y ( x, y )e jω1x + jω2 y dydx

And the joint moment generating function of φ ( s1 , s2 ) is defined by,

∞ ∞
φ X ,Y ( s1 , s2 ) = ∫∫
−∞ −∞
f X ,Y ( x, y )e xs1 + ys2 dxdy

= Ee s1x + s2 y

IF Z =ax+by, then

φZ ( s) = Ee Zs = Ee( ax +by ) s = φ X ,Y (as, bs)

Suppose X and Y are independent. Then


∞ ∞
φ X ,Y ( s1 , s2 ) = ∫ ∫e
s1 x + s2 x
f X ,Y ( x, y )dydx
−∞ −∞
∞ ∞
= ∫ e 1 f X ( x)dx ∫ e 2 fY ( y)dy
sx s y

−∞ −∞

= φ1 ( s1 )φ2 ( s2 )

Particularly if Z=X+Y and X and Y are independent, then

φZ ( s) = φ X ,Y ( s, s )
= φ X ( s)φY ( s)
Using the property of Laplace transformation we get,

f Z ( z ) = f X ( z ) * fY ( z )

Let us recall the MGF of a Gaussian random variable


X = N (µ X ,σ X 2 )
φ X (s) = Ee Xs
2
⎛ x−µ ⎞
∞ −1⎜ X ⎟
1 2 ⎜⎝ σ X ⎟
= ∫e
2πσ X −∞

.e xs dx

x2 − 2( µ X +σ X 2 s ) x + ( µ X +σ X 2 s )−( µ X +σ X 2 s )+..
∞ −1
1 σX2
∫e
2
= dx
2πσ X −∞
∞ 1 + 1 (σ X 4 s 2 + 2 µ X σ X 2 s )
1 − ( x − µ −σ s )
2
σX2

2
= e 2 X X dx + e dx
2πσ X −∞

We have,
φ X ,Y ( s1 , s2 )
= Ee Xs1 +Ys2

= E (1 + Xs1 + Ys2 +
( Xs1 + Ys2 ) + ..............)
2
s EX 2 s2 2 EY 2 2
= 1 + s1 EX + s2 EY + + + s1s2 EXY
1

2 2


Hence, EX = φ X ,Y ( s1 , s2 )]s1 =0
∂s1


EY = φ X ,Y ( s1 , s2 )]s2 =0
∂s2
∂2
EXY = φ X ,Y ( s1 , s2 )]s1 =0, s2 =0
∂s1∂s2

We can generate the joint moments of the RVS from the moment generating function.
Important Discrete Random Variables

Discrete uniform random variable

A discrete random variable X is said to be a uniform random variable if it assumes each


of the values x1 , x2 ,...xn with equal probability. The probability mass function of the
uniform random variable X is given by

1
p X ( xi ) =, i = 1, 2,...n
n
Its CDF is

1 n
FX ( x) = ∑ δ ( x − xi )
n i =1

p X ( x)

1
n
iii iii

x1 x2 x3 xn x

Mean and variance of the Discrete uniform random variable


n
µ X =EX = ∑ xi p X ( xi )
i =0

1 n
= ∑ xi
n i =0
n
EX 2 == ∑ xi2 p X ( xi )
i =0

1 n 2
= ∑ xi
n i =0
∴σ X2 = EX 2 − µ X2
2
1 n ⎛1 n ⎞
= ∑ xi2 − ⎜ ∑ xi ⎟
n i =0 ⎝ n i =0 ⎠

Example: Suppose X is the random variable representing the outcome of a single roll of a
fair dice. Then X can assume any of the 6 values in the set {1, 2, 3, 4, 5, 6} with the
probability mass function

1
p X ( x) = x = 1, 2,3, 4,5, 6
6

Bernoulli random variable

Suppose X is a random variable that takes two values 0 and 1, with probability mass
functions

p X (1) = P { X = 1} = p
and p X (0) = 1 − p, 0 ≤ p ≤1

Such a random variable X is called a Bernoulli random variable, because it describes


the outcomes of a Bernoulli trial.

The typical cdf of the Bernoulli RV X is as shown in Fig.


FX ( x)

1 x
0

Remark We can define the pdf of X with the help of delta function. Thus
f X ( x) = (1 − p)δ ( x) + pδ ( x)

Example 1: Consider the experiment of tossing a biased coin. Suppose


P { H } = p and P {T } = 1 − p .

If we define the random variable X ( H ) = 1 and X (T ) = 0 then X is a Bernoulli random


variable.

Mean and variance of the Bernoulli random variable

1
µ X =EX = ∑ kp X (k ) = 1× p + 0 × (1 − p) = p
k =0
1
EX 2 = ∑ k 2 p X (k ) = 1× p + 0 × (1 − p ) = p
k =0

∴σ = EX 2 − µ X2 = p (1 − p )
2
X

Remark
• The Bernoulli RV is the simplest discrete RV. It can be used as the building block
for many discrete RVs.
• For the Bernoulli RV, EX m = p m = 1, 2,3... . Thus all the moments of the
Bernoulli RV have the same value of p.

Binomial random variable:

Suppose X is a discrete random variable taking values from the set {0,1,......., n} . X is
called a binomial random variable with parameters n and 0 ≤ p ≤ 1 if
p X (k ) = nCk p k (1 − p ) n − k
where
n!
n
Ck =
k !( n − k ) !
As we have seen, the probability of k successes in n independent repetitions of the
Bernoulli trial is given by the binomial law. If X is a discrete random variable
representing the number of successes in this case, then X is a binomial random variable.
For example, the number of heads in ‘n’ independent tossing of a fair coin is a binomial
random variable.

• The notation X ~ B(n, p) is used to represent a binomial RV with the parameters


n and p.
• p X (k ), k=0,1,...,n k), defines a valid probability mass function. This is because
n n

∑ pX (k ) = ∑ nCk p k (1 − p)n−k = [ p +(1 − p)]n = 1.


k =0 k =0

• The sum of n independent identically distributed Bernoulli random variables is a


binomial random variable.
• The binomial distribution is useful when there are two types of objects - good,
bad; correct, erroneous; healthy , diseased etc.

Example: In a binary communication system, the probability of bit error is 0.001. If


a block of 8 bits are transmitted, find the probability that
(a) exactly 2 bit errors will occur
(b) at least 2 bit errors will occur
(c) all the bits will bit erroneous
Suppose X is the random variable representing the number of bit errors in a block of
8 bits. Then X ~ B(8, 0.01). Therefore,
(a) P (exactly 2 bit errors will occur) = p X (2)
= 8C2 × 0.012 × 0.996
(b) P (at least 2 bit errors will occur)=p X (0) + p X (1) + p X (2)
=0.998 + 8C1 × 0.011 × 0.997 + 8C2 × 0.012 × 0.996
(c) P( all 8 bits will be erroneous)=p X (8) = 0.018 = 10−6

The probability mass function for a binomial random variable with n = 6 and p =0.8 is
shown in the figure below.
Mean and variance of the Binomial random variable
We have
n
EX = ∑ kp X (k )
k =0
n
= ∑ k nCk p k (1 − p) n − k
k =0
n
=0 × q n + ∑ k nCk p k (1 − p) n − k
k =1
n
n!
=∑ k p k (1 − p) n − k
k =1 k !(n − k )!
n
n!
=∑ p k (1 − p ) n − k
k =1 ( k − 1)!( n − k )!
n
n − 1!
=np ∑ p k −1 (1 − p ) n −1− k1
k =1 ( k − 1)!( n − k )!
n −1
n − 1!
=np ∑ p k1 (1 − p ) n −1− k1 (Substituting k1 = k − 1)
k1 = 0 k1 !( n − 1 − k1 )!

= np ( p + 1 − p ) n −1
= np
Similarly
n
EX 2 = ∑ k 2 p X (k )
k =0
n
= ∑ k 2 nCk p k (1 − p) n − k
k =0
n
=02 × q n + ∑ k nCk p k (1 − p) n − k
k =1
n
n!
= ∑k2 p k (1 − p) n − k
k =1 k !(n − k )!
n
n!
= ∑k p k (1 − p) n − k
k =1 (k − 1)!(n − k )!
n
n − 1!
= np ∑ (k − 1 + 1) p k −1 (1 − p) n −1−( k −1)
k =1 (k − 1)!(n − k )!
n
n − 1! n
n − 1!
= np ∑ (k − 1) p k −1 (1 − p) n −1−( k −1) + np ∑ p k −1 (1 − p) n −1− ( k −1)
k =1 (k − 1)!(n − 1 − k + 1)! k =1 ( k − 1)!( n − 1 − k + 1)!

= np × (n − 1) p + np
= n(n − 1) p 2 + np
∴ σ X2 = variance of X = n(n − 1) p 2 + np − n 2 p 2 = np (1 − p ) Mean of B (n − 1, p )
Important Random Variables

Discrete uniform random variable

A discrete random variable X is said to be a uniform random variable if it assumes each


of the values x1 , x2 ,...xn with equal probability. The probability mass function of the
uniform random variable X is given by

1
p X ( xi ) = , i = 1, 2,...n
n

p X ( x)

1
n
iii iii

x1 x2 x3 xn x

Example: Suppose X is the random variable representing the outcome of a single roll of a
fair dice. Then X can assume any of the 6 values in the set {1, 2, 3, 4, 5, 6} with the
probability mass function

1
p X ( x) = x = 1, 2,3, 4,5, 6
6

Bernoulli random variable

Suppose X is a random variable that takes two values 0 and 1, with probability mass
functions

p X (1) = P { X = 1} = p
and p X (0) = 1 − p, 0 ≤ p ≤1
Such a random variable X is called a Bernoulli random variable, because it describes
the outcomes of a Bernoulli trial.

The typical cdf of the Bernoulli RV X is as shown in Fig.

FX ( x)

1 x
0

Remark We can define the pdf of X with the help of delta function. Thus
f X ( x) = (1 − p)δ ( x) + pδ ( x)

Example 1: Consider the experiment of tossing a biased coin. Suppose


P { H } = p and P {T } = 1 − p .

If we define the random variable X ( H ) = 1 and X (T ) = 0 then X is a Bernoulli random


variable.

Mean and variance of the Bernoulli random variable

1
µ X =EX = ∑ kp X (k ) = 1× p + 0 × (1 − p) = p
k =0
1
EX 2 = ∑ k 2 p X (k ) = 1× p + 0 × (1 − p ) = p
k =0

∴σ = EX 2 − µ X2 = p (1 − p )
2
X

Remark
• The Bernoulli RV is the simplest discrete RV. It can be used as the building block
for many discrete RVs.
• For the Bernoulli RV, EX m = p m = 1, 2,3... . Thus all the moments of the
Bernoulli RV have the same value of p.

Binomial random variable:

Suppose X is a discrete random variable taking values from the set {0,1,......., n} . X is
called a binomial random variable with parameters n and 0 ≤ p ≤ 1 if

p X (k ) = nCk p k (1 − p ) n − k
where
n!
n
Ck =
k !( n − k ) !
As we have seen, the probability of k successes in n independent repetitions of the
Bernoulli trial is given by the binomial law. If X is a discrete random variable
representing the number of successes in this case, then X is a binomial random variable.
For example, the number of heads in ‘n’ independent tossing of a fair coin is a binomial
random variable.

• The notation X ~ B(n, p) is used to represent a binomial RV with the parameters


n and p.
• p X (k ), k=0,1,...,n k), defines a valid probability mass function. This is because
n n

∑ pX (k ) = ∑ nCk p k (1 − p)n−k = [ p +(1 − p)]n = 1.


k =0 k =0

• The sum of n independent identically distributed Bernoulli random variables is a


binomial random variable.
• The binomial distribution is useful when there are two types of objects - good,
bad; correct, erroneous; healthy , diseased etc.

Example: In a binary communication system, the probability of bit error is 0.001. If


a block of 8 bits are transmitted, find the probability that
(a) exactly 2 bit errors will occur
(b) at least 2 bit errors will occur
(c) all the bits will bit erroneous
Suppose X is the random variable representing the number of bit errors in a block of
8 bits. Then X ~ B(8, 0.01). Therefore,
(a) P (exactly 2 bit errors will occur) = p X (2)
= 8C2 × 0.012 × 0.996
(b) P (at least 2 bit errors will occur)=p X (0) + p X (1) + p X (2)
=0.998 + 8C1 × 0.011 × 0.997 + 8C2 × 0.012 × 0.996
(c) P( all 8 bits will be erroneous)=p X (8) = 0.018 = 10−6
The probability mass function for a binomial random variable with n = 6 and p =0.8 is
shown in the figure below.

Mean and variance of the Binomial random variable


We have
n
EX = ∑ kp X (k )
k =0
n
= ∑ k nCk p k (1 − p) n − k
k =0
n
=0 × q n + ∑ k nCk p k (1 − p) n − k
k =1
n
n!
=∑ k p k (1 − p) n − k
k =1 k !(n − k )!
n
n!
=∑ p k (1 − p ) n − k
k =1 ( k − 1)!( n − k )!
n
n − 1!
=np ∑ p k −1 (1 − p ) n −1− k1
k =1 ( k − 1)!( n − k )!
n −1
n − 1!
=np ∑ p k1 (1 − p ) n −1− k1 (Substituting k1 = k − 1)
k1 = 0 k1 !( n − 1 − k1 )!

= np ( p + 1 − p ) n −1
= np
Similarly
n
EX 2 = ∑ k 2 p X (k )
k =0
n
= ∑ k 2 nCk p k (1 − p) n − k
k =0
n
=02 × q n + ∑ k nCk p k (1 − p) n − k
k =1
n
n!
= ∑k2 p k (1 − p) n − k
k =1 k !(n − k )!
n
n!
= ∑k p k (1 − p) n − k
k =1 (k − 1)!(n − k )!
n
n − 1!
= np ∑ (k − 1 + 1) p k −1 (1 − p) n −1−( k −1)
k =1 (k − 1)!(n − k )!
n
n − 1! n
n − 1!
= np ∑ (k − 1) p k −1 (1 − p) n −1−( k −1) + np ∑ p k −1 (1 − p) n −1− ( k −1)
k =1 (k − 1)!(n − 1 − k + 1)! k =1 ( k − 1)!( n − 1 − k + 1)!

= np × (n − 1) p + np
= n(n − 1) p 2 + np
∴ σ X2 = variance of X = n(n − 1) p 2 + np − n 2 p 2 = np (1 − p ) Mean of B (n − 1, p )
Geometric random variable:

X as a discrete random variable with range RX = {1, 2,.......} . X is called a geometric


random variable of p X (k ) = p(1 − p) k −1 0 ≤ p ≤1

• X describes the outcomes of independent Bernoulli trials each with probability of


success p, before a ‘success’ occurs. If the first success occurs at the kth trial,
then the outputs of the Bernoulli trials are
F ailiure, Failure,.., Failure , Success
k −1 times

∴ p X (k ) = (1 − p) k −1 p = p(1 − p) k −1
• RX is countablly infinite, because we may have to wait infinitely long before the
first success occurs.
• The geometric random variable X with the parameter p is denoted by
X ~ geo( p)
• The CDF of X ~ geo( p ) is given by
k
FX (k ) = ∑ (1 − p )i −1 p = 1 − (1 − p) k
i =1
which gives the probability that the first ‘success’ will occur before the (k + 1)th trial.

Following Figure shows the pmf of a random variable X ~ geo( p ) for p = 0.25 and
p = 0.5 respectively. Observe that the plots have a mode at k = 1.
Example:
Suppose X is the random variable representing the number of independent tossing of a
coin before a head shows up. Clearly X will be a geometric random variable.
Example: A fair dice is rolled repeatedly. What is the probability that a 6 will be shown
before the fourth roll.

Suppose X is the random variable representing the number of independent rolling of the
1
dice before a ‘6’ shows up. Clearly X will be a geometric random variable with p = . .
6
P( a '6' will be shown before the 4th roll)=P( X = 1 or X = 2 or X = 3)
= p X (1) + p X (2) + p X (3)
= p + p(1 − p) + p(1 − p ) 2
= p(3 − 3 p + p 2 )
91
=
196

Mean and variance of the Geometric random variable



Mean=EX = ∑ kp X (k )
k =0

= ∑ k p (1 − p ) k −1
k =0

= p ∑ k (1 − p ) k −1
k =0

d
= − p∑ (1 − p ) k
k = 0 dk

d ∞
= −p ∑
dk k =0
(1 − p ) k ( Sum of the geometric series)

p
=
p2
1
=
p

EX = ∑ k 2 p X (k )
2

k =0

= ∑ k 2 p (1 − p ) k −1
k =0

= p ∑ (k (k − 1) + k )(1 − p ) k −1
k =0
∞ ∞
= p (1 − p )∑ k (k − 1)(1 − p ) k − 2 + p ∑ k (1 − p ) k −1
k =0 k =0
2 ∞ ∞
d
= p (1 − p )
dp 2
∑ (1 − p)
k =0
k
+ p ∑ k (1 − p ) k −1
k =0

(1- p ) 1
=2 +
p2 p
(1- p ) 1 1
∴ σ X2 = 2 + − 2
p2 p p
(1- p )
=
p2
1
Mean µX =
p
(1- p )
Variance σ X2 =
p2
Poisson Random Variable:

X is a Poisson distribution with the parameter λ such that λ >0 and


e−λ λ k
p X (k ) = , k = 0,1, 2,....
k!

The plot of the pmfof the Poisson RV is shown in Fig.

Remark

e−λ λ k
• p X (k ) = satisfies to be a pmf, because
k!
∞ ∞
e− λ λ k ∞
λk

k =0
p X (k ) = ∑
k =0 k!
=e ∑
−λ

k =0 k !
= e− λ eλ = 1

• Named after the French mathematician S.D. Poisson


Mean and Variance of the Poisson RV

The mean of the Poisson RV X is given by



µ X = ∑ kp X (k )
k =0

e− λ λ k
= 0+∑k
k =1 k!

λ k −1
= λe−λ ∑
k =1 k − 1!


EX 2 = ∑ k 2 p X (k )
k =0

e−λ λ k

= 0+∑k 2

k =1 k!
kλ k∞
=e ∑ −λ

k =1 k − 1!

(k − 1 + 1)λ k
= e− λ ∑
k =1 k − 1!
⎛ ∞
λ k ⎞ −λ ∞ λ k
= e ⎜0 + ∑
−λ
⎟+e ∑
⎝ k = 2 k − 2! ⎠ k =1 k − 1!

λ k −2 ∞
λ k −1
= e− λ λ 2 ∑ + e−λ λ ∑
k =2 k − 2! k =1 k − 1!
= e − λ λ 2 eλ + e − λ λ eλ
= λ2 + λ
∴σ X2 = EX 2 − µ X2 = λ

Example: The number of calls received ina telephone exchange follows a Poisson
distribution with an average of 10 calls per minute. What is the probability that in one-
minute duration
(i) no call is received
(ii) exactly 5 callsare received
(iii) more than 3 calls are received

Let X be the random variable representing the number of calls received. Given
e−λ λ k
p X (k ) = where λ = 10. Therefore,
k!
(i) probability that no call is received = p X (0) = e −10 =
e −10 × 105
(ii) probability that exactly 5 calls are received = p X (5) = =
5!
(iii) probability that more the 3 calls are received
5 2 3
10 10 10
= 1 − ∑ p X (k ) = 1 − e −10 (1 + + + )=
k =0 1 2! 3!

The Poisson distribution is used to model many practical problems. It is used in many
counting applications to count events that take place independently of one another. Thus
it is used to model the count during a particular length of time of:
• customers arriving at a service station
• telephone calls coming to a telephone exchange packets arriving at a particular
server
• particles decaying from a radioactive specimen
Poisson approximation of the binomial random variable
The Poisson distribution is also used to approximate the binomial distribution B (n, p )
when n is very large and p is small.

Consider a binomial RV X ~ B(n, p) with


n → ∞, p → 0 so that EX = np = λ remains constant. Then
e− λ λ k
p X (k )
k!
p X (k ) = nCk p k (1 − p) n − k
n!
= p k (1 − p ) n − k
k !(n − k )!
n(n − 1)(n − 2).....(n − k + 1) k
= p (1 − p) n − k
k!
1 2 k −1
n k (1 − )(1 − ).....(1 − )
= n n n p k (1 − p) n − k
k!
1 2 k −1
(1 − )(1 − ).....(1 − )
= n n n (np ) k (1 − p) n − k
k!
1 2 k −1 λ
(1 − )(1 − ).....(1 − )(λ ) k (1 − ) n
= n n n n
λ
k !(1 − ) k
n

λ
Note that lim(1 − ) n = e − λ .
n →∞ n
1 2 k −1 λ
(1 − )(1 − ).....(1 − )(λ ) k (1 − ) n
n =e λ
−λ k
∴ p X (k ) lim n n n
n →∞ λ k!
k !(1 − ) k
n

Thus the Poisson approximation can be used to compute binomial probabilities for large
n. It also makes the analysis of such probabilities easier.Typical examples are:
• number of bit errors in a received binary data file
• number of typographical errors in a printed page
Example
Suppose there is an error probability of 0.01 per word in typing. What is the probability
that there will be more than 1 error in a page 120 words.
Suppose X is the RV representing the number of errors per page of 120 words.
X ~ B (120, p ) where p = 0.01. Therefore,

∴ λ = 120 × 0.01 = 0.12


P(more than one errors)
= 1 − p X (0) − p X (1)
1 − e− λ − λ e− λ
Uniform Random Variable
A continuous random variable X is called uniformly distributed over the interval [a, b], if its
probability density function is given by

⎧ 1
⎪ a≤ x≤b
f X ( x) = ⎨ b − a
⎪⎩ 0 otherwise

We use the notation X ~ U (a, b) to denote a random variable X uniformly distributed over the interval
∞ b
1
[a,b]. Also note that ∫
−∞
f X ( x)dx = ∫
a
b−a
dx = 1.

Distribution function FX ( x)

For x < a
FX ( x) = 0
For a ≤ x ≤ b
x


−∞
f X (u )dx

x
du
=∫
a
b−a
x−a
=
b−a
For x > b,
FX ( x) = 1

Mean and Variance of a Uniform Random Variable


∞ b
x
µ X = EX = ∫ xf X ( x)dx = ∫ dx
−∞ a
b − a
a+b
=
2
∞ b
x2
EX = ∫ x f X ( x)dx = ∫
2 2
dx
−∞ a
b−a
b 2 + ab + a 2
=
3
b 2 + ab + a 2 (a + b) 2
∴σ = EX − µ =
2
X
2 2
X −
3 4
(b − a ) 2
=
12
The characteristic function of the random variable X ~ U (a, b) is given by
b e jwx
φ X ( w) = Ee jwx = ∫ dx
ab−a

e jwb − e jwa
=
b−a
Example:
Suppose a random noise voltage X across an electronic circuit is uniformly distributed between -4 V
and 5 V. What is the probability that the noise voltage will lie between 2 v and 4 V? What is the
variance of the voltage?
3 dx 1
P (2 < X ≤ 3) = ∫ = .
2 5 − ( −4) 9
(5 + 4)2 27
σ X2 = = .
12 4
Remark
• The uniform distribution is the simplest continuous distribution
• Used, for example, to to model quantization errors. If a signal is discritized into steps of ∆,
−∆ ∆
then the quantization error is uniformly distributed between and .
2 2
• The unknown phase of a sinusoid is assumed to be uniformly distributed over [0, 2π ] in many
applications. For example, in studying the noise performance of a communication receiver, the
carrier signal is modeled as
X (t ) = A cos( wc t + Φ )
where Φ ~ U (0, 2π ).
• A random variable of arbitrary distribution can be generated with the help of a routine to
generate uniformly distributed random numbers. This follows from the fact that the distribution
function of a random variable is uniformly distributed over [0,1]. (See Example)
Thus if X is a contnuous raqndom variable, then FX ( X ) ~ U (0,1).

Normal or Gaussian Random Variable

The normal distribution is the most important distribution used to model natural and man made
phenomena. Particular, when the random variable is the sum of a large number of random variables, it
can be modeled as a normal random variable.
A continuous random variable X is called a normal or a Gaussian random variable with parameters µ X
and σ X 2 if its probability density function is given by,

2
1 ⎛ x−µX ⎞
− ⎜ ⎟
1 2⎝ σ X ⎠
f X ( x) = e , −∞ < x < ∞
2πσ X

where µ X and σ X > 0 are real numbers.


We write that X is N ( µ X , σ X 2 ) distributed.

If µ X = 0 and σ X 2 = 1 ,

1 − 12 x2
f X ( x) = e

and the random variable X is called the standard natural variable.

• f X ( x) , is a bell-shaped function, symmetrical about x = µ X .


• σ X , determines the spread of the random variable X. If σ X 2 is small X is more concentrated
around the mean µ X .

FX ( x) = P ( X ≤ x )
2
x 1 ⎛ t −µX ⎞
− ⎜ ⎟
1
∫e
2⎝ σ X ⎠
= dt
2πσ X −∞

t − µX
Substituting, u = , we get
σX

x−µX
σX 1
1 − u2
FX ( x) =
2π ∫
−∞
e 2
du

⎛ x − µX ⎞
= Φ⎜ ⎟
⎝ σX ⎠

where Φ( x) is the distribution function of the standard normal variable.

Thus FX ( x) can be computed from tabulated values of Φ( x). . The table Φ( x) was very useful in the
pre-computer days.

In communication engineering, it is customary to work with the Q function defined by,

Q( x) = 1 − Φ ( x)
∞ u2
1 −
=
2π ∫e
x
2
du

1
Note that Q (0) = , Q ( − x ) = Q ( x )
2

If X is N ( µ X , σ X 2 ) distributed, then

EX = µ X
var( X ) = σ X 2

Proof:
∞ ∞ 1 ⎛ x−µX ⎞
− ⎜ ⎟
1
∫ xf ∫ xe
2⎝ σ X ⎠
EX = ( x)dx = dx
2πσ X
X
−∞ −∞
1∞
Substituting,
=
1 − u2
σ X ∫ ( uσ X + µ X ) e 2 du x − µX
=u
2π −∞ σX
∞ ∞
1 µ 1
− u2 so that x = uσ X + µ X
= σ X ∫ udu + X ∫e 2
du
2π −∞ 2π −∞
∞ u2
1 −
= 0 + µX
2π ∫e
−∞
2
du


µ
2
u

= XX 2∫ e 2 du = µ X
2π 0

∞ x2

Evaluation of ∫e
−∞
2
dx

∞ x2

Suppose I = ∫e
−∞
2
dx

Then
2
⎛ ∞ −x ⎞
2

I = ⎜ ∫ e 2 dx ⎟
2
⎜ ⎟
⎝ −∞ ⎠
∞ x2 ∞ y2
− −
=
−∞
∫e 2
dx ∫ e
−∞
2
dy

∞ ∞ x2 + y 2

= ∫ ∫e
−∞ −∞
2
dydx

Substituting x = r cos θ and y = r sin θ we get


π ∞ r2

I = ∫ ∫e r dθ dr
2 2

−π 0

= 2π ∫ e − r r dr
2

0

( r2 = s )
= 2π ∫ e ds −s

= 2π × 1 = 2π

∴ I = 2π
Var ( X ) = E ( X − µ X )
2

2
∞ 1 ⎛ x−µX ⎞
− ⎜ ⎟
1
∫ (x − µ )
2⎝ σ X ⎠
=
2
e dx
2πσ X
X
−∞
∞ 1
1 − u2
= ∫ σX u e σ X du
2 2
x − µX
2

2πσ X −∞ Put = u So that dx = σ X du


σX
σ X 2 ∞ 2 − 12 u 2

= 2×
2π 0
∫ u e du Put
1
σ 2 ∞ 1 t = u 2 so that dt = udu
= 2× X 2 ∫ t e dt
2 −t 2
2π 0

= 2×
σX 3 2
Note the definition and properties
π 2
σ 21 1
of the gamma function
= 2× X
π 2 2 ∞
σ 2
= X × π Γn = ∫ t n −1e − t dt
π 0
=σX2
Γn = (n − 1)Γ(n − 1)
1
Γ = π
2
Exponential Random Variable

A continuous random variable X is called exponentially distributed with the parameter λ > 0 if
the probability density function is of the form

⎧λ e − λ x x≥0
f X ( x) = ⎨
⎩ 0 otherwise
The corresponding probabilty distribution function is
x
FX ( x) = ∫
−∞
f X (u )du

⎧0 x<0
=⎨ −λ x
⎩1 − e x≥0

1
We have µ X = EX = ∫ xλ e − λ x dx =
0
λ

1 1
σ X2 = E ( X − µ X ) 2 = ∫ ( x − ) 2 λ e − λ x dx = 2
0
λ λ
The following figure shows the pdf of an exponential RV.
• The time between two consecutive occurrences of independent events can be modeled by the
exponential RV. For example, the exponential distribution gives probability density function
of the time between two successive counts in Poisson distribution
• Used to model the service time in a queing system.
• In reliability studies, the expected life-time of a part, the average time between successive
failures of a system etc.,are determined using the exponential distribution.

Memoryless Property of the Exponential Distribution

For an exponential RV X with parameter λ ,

P( X > t + t0 / X > t0 ) = P( X > t ) for t > 0, t0 > 0


Proof:
P[( X > t + t0 ) ∩ PX > t0 )]
P ( X > t + t0 / X > t0 ) =
P ( X > t0 )
P ( X > t + t0 )
=
P ( X > t0 )
1 − FX (t + t0 )
=
1 − FX (t0 )
e − λ ( t + t0 )
= − λt0
e
− λt
= e = P( X > t )
Hence if X represents the life of a component in hours, the probability that the component will last
more than t + t0 hours given that it has lasted t0 hours, is same as the probability that the component
will last t hours. The information that the component has already lasted for t0 hours is not used. Thus
the life expectancy of an used component is same as that for a new component. Such a model cannot
represent a real-world situation, but used for its simplicity.

Laplace Distribution
A continuous random variable X is called Laplace distributed with the parameter λ > 0 with the
probability density function is of the form

λ −λ x
f X ( x) = e λ > 0, -∞ < x < ∞
2

λ
∫ x2e
−λ x
We have µ X = EX = dx = 0
−∞

λ
σ X2 = E ( X − µ X ) 2 = ∫ x 2 e
−λ x
dx = λ

2
Chi-square random variable
A random variable is called a Chi-square random variable with n degrees of freedom if its
PDF is given by
⎧ x n / 2−1
e − x / 2σ x > 0
2

⎪ n/2 n
f X ( x) = ⎨ 2 σ Γ(n / 2)
⎪0 x<0

with the parameter σ > 0 and Γ(.) denoting the gamma function. A Chi-square random

variable with n degrees of freedom is denoted by χ n2 .

Note that a χ 2 RV is the exponential RV.


2

The pdf of χ n2 RVs with different degrees of freedom is shown in Fig. below:
Mean and variance of the chi-square random variable

µ X = ∫ xf X ( x)dx
−∞
∞ x n / 2−1
=∫x e − x / 2σ dx
2

0 2 σ Γ(n / 2)
n/2 n

∞ xn / 2
=∫ e − x / 2σ dx
2

0 2 σ Γ(n / 2)
n/2 n

∞ (2σ 2 ) n / 2 u n / 2 − u
=∫ e (2σ 2 )du ( Substituting u = x / 2σ 2 )
0 2 σ Γ(n / 2)
n/2 n

2σ 2 Γ[(n + 2) / 2]
=
Γ(n / 2)
2σ 2 n / 2Γ(n / 2)
=
Γ(n / 2)
= nσ 2
Similarly,

EX 2 = ∫ x 2 f X ( x)dx
−∞
∞ x n / 2−1
= ∫ x2 e − x / 2σ dx
2

0 2 σ Γ(n / 2)
n/2 n

∞ x ( n + 2) / 2
=∫ e− x / 2σ dx
2

0 2 σ Γ(n / 2)
n/2 n

∞ (2σ 2 )( n + 2) / 2 u n / 2 − u
=∫ e (2σ 2 )du ( Substituting u = x / 2σ 2 )
0 2 σ
n/2 n
Γ ( n / 2)
4σ 4 Γ[(n + 4) / 2]
=
Γ(n / 2)
4σ 4 [(n + 2) / 2]n / 2Γ(n / 2)
=
Γ(n / 2)
= n(n + 2)σ 4
σ X2 = EX 2 − µ X2 = n(n + 2)σ 4 − nσ 4 = 2nσ 4

A random variable Let X 1 , X 2 . . . X n be independent zero-mean Gaussian variables each with

variance σ X2 . Then Y = X 12 + X 22 + . . . + X n2 has χ n2 distribution with mean nσ 2 and varaiance

2nσ 4 . This result gives an application of the chi-square random variable.


Relation between the chi-square distribution and the Gaussian distribution

A random variable Let X 1 , X 2 . . . X n be independent zero-mean Gaussian variables each with

variance σ X2 . Then Y = X 12 + X 22 + . . . + X n2 has χ n2 distribution with mean

Rayleigh Random Variable

A Rayleigh random variable is characterized by the PDF


⎧ x − x 2 / 2σ 2
⎪ e x >0
f X ( x ) = ⎨σ 2
⎪⎩0 x<0
where σ is the parameter of the random variable.
The probability density functions for the Rayleigh RVs are illustrated in Fig.

Mean and variance of the Rayleigh distribution



EX = ∫ xf
−∞
X ( x)dx


x
=∫x e− x / 2σ 2
2
dx
0 σ 2


2π x2

/ 2σ 2
= e− x
2
dx
σ 0 2πσ
2π σ 2
=
σ 2
π
= σ
2
Similarly

EX 2 = ∫x
2
f X ( x)dx
−∞

x
= ∫ x2 e− x / 2σ 2
2
dx
0 σ 2


x2
= 2σ 2 ∫ ue − u du ( Substituting u = )
0 2σ 2

= 2σ 2 ∫ ue du is the mean of the exponential RV with λ =1)
−u
( Noting that
0

π
∴σ X2 = 2σ 2 − ( σ )2
2
π
= (2 − )σ 2
2

Relation between the Rayleigh distribution and the Gaussian distribution

A Rayleigh RV is related to Gaussian RVs as follow:


Let X 1 and X 2 be two independent zero-mean Gaussian RVs with a common variance σ 2 . Then
X = X 12 + X 22 has the Rayleigh pdf with the parameter σ .
We shall prove this result in a later class. This important result also suggests the cases where the
Rayleigh RV can be used.

Application of the Rayleigh RV


• Modelling the root mean square error-
• Modelling the envelope of a signal with two orthogonal components as in the case of a signal
of the following form:
s (t ) = X 1 cos wt + Y1 sin wt

If X 1 ~ N (0, σ 2 ) and X 2 ~ N (0, σ 2 ) are independent, then the envelope X = X 12 + X 22 has the
Rayleigh distribution.
Simulation of Random Variables
• In many fields of science and engineering, computer simulation is used to study
random phenomena in nature and the performance of an engineering system in a
noisy environment. For example, we may study through computer simulation the
performance of a communication receiver. Sometimes a probability model may
not be analytically tractable and computer simulation is used to calculate
probabilities.
• The heart of all these application is that it is possible to simulate a random
variable with an empirical CDF or pdf that fits well with the theoretical CDF or
pdf.

Generation of Random Numbers

Generation of random numbers means producing a sequence of independent random


numbers with a specified CDF or pdf. All the random number generators rely on a
routine to generate random numbers with the uniform pdf. Such routine is of vital
importance because the quality of the generated random numbers with any other
distribution depends on it. By the quality of the generated random numbers we mean how
closely the empirical CDF or PDF fits the true one.

There are several algorithms to generate U [0 1] random numbers. Note that these
algorithms generate random number by a reproducible deterministic method. These
numbers are pseudo random numbers because they are reproducible and the same
sequence of numbers repeats after some period of count specific to the generating
algorithm. This period is very high and a finite sample of data within the period appears
to be uniformly distributed. We will not discuss about these algorithms. Software
packages provide routines to generate such numbers.

Method of Inverse transform

Suppose we want to generate a random variable X with a prescribed distribution


function FX ( x ). We have observed that the random variable Y , defined by
Y = FX ( X ) ~ U [ 0, 1 ].. Thus given U [ 0, 1 ]. random number Y ,
the inverse transform X = FX−1 (Y ) will have the CDF FX ( x ).
The algorithmic steps for the inverse transform method are as follows:

1. Generate a random number from Y ~ U [0, 1]. Call it y. .


2. Compute the value x such that FX ( x) =y.
3. Take x to be the random number generated.

Example Suppose, we want to generate a random variable with the pdf f X ( x) ribution
given by
⎧2
⎪ x 0≤ x≤3
f X ( x) = ⎨ 9
⎪⎩0 otherwise
The CDF of X is given by

⎧0
⎪1

FX ( x) = ⎨ x 2 0≤ x≤3
⎪9
⎪⎩1 x>0
FX(x)

x2 /9

x
3

Therefore, we generate a random number y from the U [0, 1] distribution and set
FX ( x) = y.
We have
1 2
x =y
9
⇒ x= 9y

Example Suppose, we want to generate a random variable with the exponential


distribution given by f X ( x ) = λ e − λ x λ > 0, x ≥ 0. Then
FX ( x ) = 1 − e − λ x
Therefore given y, we can get x by the mapping

1 − e−λ x = y
log e (1 − y )
x = −
λ
Since 1 − y is also uniformly distributed over [0, 1], the above expression can be
simplified as,
log y
x = −
λ
Generation of Gaussian random numbers

Generation of discrete random variables

We observed that the CDF of a discrete random variable is also U [0, 1] distributed.
Suppose X is a discrete random variable with the probability mass function
p X ( xi ), i = 1, 2,..n. Given y = F X ( x ) , the inverse mapping is defined as shown in the
Fig. below.

Y = FX ( X )

x k = FX− 1 ( y ) X

Thus if FX ( xk −1 ) ≤ y < FX ( xk ), then


FX−1 ( y ) = xk

The algorithmic steps for the inverse transform method for simulating discrete random
variables are as follows:

1. Generate a random number from Y ~ U [0, 1]. Call it y.


2. Compute the value xk such that FX ( xk −1 ) ≤ y < FX ( xk ),
3. Take xk to be the random number generated.

Example Generation of Bernoulli random numbers


Suppose we want to generate a random number from X ~ Br ( p). Generate y from the
U [0, 1] distribution. Set

⎧0 for y ≤ 1 − p
x=⎨
⎩1 otherwise
y

1− p

0 x
1
Jointly distributed random variables

We may define two or more random variables on the same sample space. Let X and
Y be two real random variables defined on the same probability space ( S , F, P). The
mapping S→ 2
such that for s ∈ S , ( X ( s ), Y ( s )) ∈ 2
is called a joint random
variable. ( X ( s ), Y ( s ))
R2
Y (s)
s •

Figure Joint Random Variable


Remark
• The above figure illustrates the mapping corresponding to a joint random
variable. The joint random variable in the above case id denoted by ( X , Y ).
• We may represent a joint random variable as a two-dimensional vector
X = [ X Y]′.
• We can extend the above definition to define joint random variables of any
dimension.The mapping S → n such that for
s ∈ S , ( X 1 ( s ), X 2 ( s ),...., X n ( s )) ∈ n is called a n-dimesional random variable
and denoted by the vector X = [ X 1 X 2 .... X n ]′.

Example1: Suppose we are interested in studying the height and weight of the students
in a class. We can define the joint RV ( X , Y ) where X represents height and
Y represents the weight.

Example 2 Suppose in a communication system X is the transmitted signal and


Y is the corresponding noisy received signal. Then ( X , Y ) is a joint random
variable.

Joint Probability Distribution Function


Recall the definition the distribution of a single random variable. The event
{ X ≤ x} was used to define the probability distribution function FX ( x). Given
FX ( x), we can find the probability of any event involving the random variable.
X and Y, the event
Similarly, for two random variables
{ X ≤ x, Y ≤ y} = { X ≤ x} ∩ {Y ≤ y} is considered as the representative event.
The probability P{ X ≤ x, Y ≤ y} ∀( x, y ) ∈ 2 is called the joint distribution function
of the random variables X and Y and denoted by FX ,Y ( x, y ).

Y
( x, y )

FX ,Y ( x, y ) satisfies the following properties:


• FX ,Y ( x1 , y1 ) ≤ FX ,Y ( x2 , y2 ) if x1 ≤ x2 and y1 ≤ y2
If x1 < x2 and y1 < y2 ,
{ X ≤ x1 , Y ≤ y1 } ⊆ { X ≤ x2 , Y ≤ y2 }
∴ P{ X ≤ x1 , Y ≤ y1 } ≤ P{ X ≤ x2 , Y ≤ y2 }
∴ FX ,Y ( x1 , y1 ) ≤ FX ,Y ( x2 , y2 )
Y
( x2 , y2 )

( x1 , y1 )

• FX ,Y (−∞, y) = FX ,Y ( x, −∞) = 0
Note that
{ X ≤ −∞, Y ≤ y} ⊆ { X ≤ −∞}
• FX ,Y (∞, ∞) = 1.
• FX ,Y ( x, y ) is right continuous in both the variables.
• If x < x2 and y1 < y2 ,
I 1
P{x1 < X ≤ x2 , y1 < Y ≤ y2 } = FX ,Y ( x2 , y2 ) − FX ,Y ( x1 , y2 ) − FX ,Y ( x2 , y1 ) + FX ,Y ( x1 , y1 ) ≥ 0.

Y ( x2 , y2 )

( x1 , y1 )

X
Given FX ,Y ( x, y ), -∞ < x < ∞, -∞ < y < ∞, we have a complete description of
the random variables X and Y .

• FX ( x) = FXY ( x,+∞).

To prove this

( X ≤ x ) = ( X ≤ x ) ∩ ( Y ≤ +∞ )
∴ F X ( x ) = P (X ≤ x ) = P (X ≤ x , Y ≤ ∞ )= F XY ( x , +∞ )

Similarly FY ( y ) = FXY (∞, y ).

• Given FX ,Y ( x, y ), -∞ < x < ∞, -∞ < y < ∞, each of FX ( x) and FY ( y ) is called


a marginal distribution function.

Example

Consider two jointly distributed random variables X and Y with the joint CDF
⎧(1 − e −2 x )(1 − e − y ) x ≥ 0, y ≥ 0
FX ,Y ( x, y ) = ⎨
⎩0 otherwise

(a) Find the marginal CDFs


(b) Find the probability P{11 < X ≤ 2, 1 < Y ≤ 2}

⎧1 − e−2 x x ≥ 0
FX ( x) = lim FX ,Y ( x, y ) = ⎨
y →∞
⎩0 elsewhere
(a)
⎧1 − e − y y ≥ 0
FY ( y ) = lim FX ,Y ( x, y ) = ⎨
x →∞
⎩0 elsewhere

(b)
P{11 < X ≤ 2, 1 < Y ≤ 2} = FX ,Y (2, 2) + FX ,Y (1,1) − FX ,Y (1, 2) − FX ,Y (2,1)
= (1 − e −4 )(1 − e −2 ) + (1 − e −2 )(1 − e −1 ) − (1 − e−2 )(1 − e−2 ) − (1 − e−4 )(1 − e −1 )
=0.0272

Jointly distributed discrete random variables


If X and Y are two discrete random variables defined on the same probability space
( S , F , P) such that X takes values from the countable subset RX and Y takes values
from the countable subset RY . Then the joint random variable ( X , Y ) can take values from
the countable subset in RX × RY . The joint random variable ( X , Y ) is completely
specified by their joint probability mass function
p X ,Y ( x, y ) = P{s | X ( s ) = x, Y ( s ) = y}, ∀( x, y ) ∈ RX × RY .
Given p X ,Y ( x, y ), we can determine other probabilities involving the random variables
X and Y .
Remark
• p X ,Y ( x, y ) = 0 for ( x, y ) ∉ RX × RY

• ∑ ∑ p X ,Y ( x, y ) = 1
( x , y )∈ RX × RY

This is because
∑ ∑ p X ,Y ( x, y ) = P ∪ {x, y}
( x , y )∈ RX × RY x , y )∈RX × RY

=P ( RX × RY )
=P{s | ( X ( s ), Y ( s )) ∈ ( RX × RY }
=P ( S ) = 1
• Marginal Probability Mass Functions: The probability mass functions
p X ( x ) and pY ( y ) are obtained from the joint probability mass function as
follows
p X ( x) = P{ X = x} ∪ RY
= ∑ p X ,Y ( x, y )
y∈RY

and similarly
p Y ( y ) = ∑ p X ,Y ( x , y )
x∈ R X

These probability mass functions p X ( x ) and pY ( y ) obtained from the joint


probability mass functions are called marginal probability mass functions.
Example Consider the random variables X and Y with the joint probability
mass function as tabulated in Table . The marginal probabilities are as shown in
the last column and the last row

X 0 1 2 pY ( y )
Y
0 0.25 0.1 0.15 0.5
1 0.14 0.35 0.01 0.5
p X ( x) 0.39 0.45
Joint Probability Density Function

If X and Y are two continuous random variables and their joint distribution function
is continuous in both x and y, then we can define joint probability density function
f X ,Y ( x, y ) by
∂2
f X ,Y ( x, y ) = FX ,Y ( x, y ), provided it exists.
∂x∂y
x y
Clearly FX ,Y ( x, y ) = ∫ ∫ f X ,Y (u , v)dudv
−∞ −∞
Properties of Joint Probability Density Function

• f X ,Y ( x, y ) is always a non-negaitive quantity. That is,

f X ,Y ( x, y ) ≥ 0 ∀( x, y ) ∈ 2

∞ ∞
• ∫ ∫ f X ,Y ( x, y )dxdy = 1
−∞ −∞

• The probability of any Borel set B can be obtained by


P( B) = ∫∫ f X ,Y ( x, y )dxdy
( x , y )∈B

Marginal density functions


The marginal density functions f X ( x) and fY ( y ) of two joint RVs X and Y are given by
the derivatives of the corresponding marginal distribution functions. Thus
fX (x) = d
dx
FX ( x)
= d
dx
FX ( x, ∞ )
x ∞
= d
dx ∫ ( ∫ f X ,Y ( u , y ) d y ) d u
−∞ −∞

= ∫ f X ,Y ( x , y ) d y
−∞

a n d s im ila rly f Y ( y ) = ∫ f X ,Y ( x , y ) d x
−∞
Remark
• The marginal CDF and pdf are same as the CDF and pdf of the concerned
single random variable. The marginal term simply refers that it is derived
from the corresponding joint distribution or density function of two or more
jointly random variables.
• With the help of the two-dimensional Dirac Delta function, we can define the
joint pdf of two discrete jointly random variables. Thus for discrete jointly
random variables X and Y .
f X ,Y ( x, y ) = ∑ ∑ p X ,Y ( x, y )δ ( x − x j , y − y j )
( xi , y j )∈R X × RY .

Example The joint density function f X ,Y ( x, y ) in the previous example are

∂2
f X ,Y ( x, y ) = FX ,Y ( x, y )
∂x∂y
∂2
= [(1 − e −2 x )(1 − e − y )] x ≥ 0, y ≥ 0
∂x∂y
= 2e −2 x e − y x ≥ 0, y ≥ 0

Example: The joint pdf of two random variables X and Y are given by
f X ,Y ( x, y ) = cxy 0 ≤ x ≤ 2, 0 ≤ y ≤ 2
= 0 otherwise
(i) Find c.
(ii) Find FX , y ( x, y )
(iii) Find f X ( x) and fY ( y ).
(iv) What is the probability P (0 < X ≤ 1, 0 < Y ≤ 1) ?
∞ ∞ 2 2
∫ ∫
−∞ −∞
f X ,Y ( x, y )dydx = c ∫
0 ∫0
xydydx
1
∴c =
4
1 y x
4 ∫0 ∫0
FX ,Y ( x, y ) = uvdudv

x2 y2
=
16
2
xy
f X ( x) = ∫ dy 0 ≤ y ≤ 2
0
4
x
= 0≤ y≤2
2
Similarly
y
fY ( y ) = 0≤ y≤2
2
P (0 < X ≤ 1, 0 < Y ≤ 1)
= FX ,Y (1,1) + FX ,Y (0, 0) − FX ,Y (0,1) − FX ,Y (1, 0)
1
= +0−0−0
16
1
=
16
Jointly distributed random variables

We may define two or more random variables on the same sample space. Let X and
Y be two real random variables defined on the same probability space ( S , F, P). The
mapping S→ 2
such that for s ∈ S , ( X ( s ), Y ( s )) ∈ 2
is called a joint random
variable. ( X ( s ), Y ( s ))
R2
Y (s)
s •

Figure Joint Random Variable


Remark
• The above figure illustrates the mapping corresponding to a joint random
variable. The joint random variable in the above case id denoted by ( X , Y ).
• We may represent a joint random variable as a two-dimensional vector
X = [ X Y]′.
• We can extend the above definition to define joint random variables of any
dimension.The mapping S → n such that for
s ∈ S , ( X 1 ( s ), X 2 ( s ),...., X n ( s )) ∈ n is called a n-dimesional random variable
and denoted by the vector X = [ X 1 X 2 .... X n ]′.

Example1: Suppose we are interested in studying the height and weight of the students
in a class. We can define the joint RV ( X , Y ) where X represents height and
Y represents the weight.

Example 2 Suppose in a communication system X is the transmitted signal and


Y is the corresponding noisy received signal. Then ( X , Y ) is a joint random
variable.

Joint Probability Distribution Function


Recall the definition the distribution of a single random variable. The event
{ X ≤ x} was used to define the probability distribution function FX ( x). Given
FX ( x), we can find the probability of any event involving the random variable.
X and Y, the event
Similarly, for two random variables
{ X ≤ x, Y ≤ y} = { X ≤ x} ∩ {Y ≤ y} is considered as the representative event.
The probability P{ X ≤ x, Y ≤ y} ∀( x, y ) ∈ 2 is called the joint distribution function
of the random variables X and Y and denoted by FX ,Y ( x, y ).

Y
( x, y )

FX ,Y ( x, y ) satisfies the following properties:


• FX ,Y ( x1 , y1 ) ≤ FX ,Y ( x2 , y2 ) if x1 ≤ x2 and y1 ≤ y2
If x1 < x2 and y1 < y2 ,
{ X ≤ x1 , Y ≤ y1 } ⊆ { X ≤ x2 , Y ≤ y2 }
∴ P{ X ≤ x1 , Y ≤ y1 } ≤ P{ X ≤ x2 , Y ≤ y2 }
∴ FX ,Y ( x1 , y1 ) ≤ FX ,Y ( x2 , y2 )
Y
( x2 , y2 )

( x1 , y1 )

• FX ,Y (−∞, y) = FX ,Y ( x, −∞) = 0
Note that
{ X ≤ −∞, Y ≤ y} ⊆ { X ≤ −∞}
• FX ,Y (∞, ∞) = 1.
• FX ,Y ( x, y ) is right continuous in both the variables.
• If x < x2 and y1 < y2 ,
I 1
P{x1 < X ≤ x2 , y1 < Y ≤ y2 } = FX ,Y ( x2 , y2 ) − FX ,Y ( x1 , y2 ) − FX ,Y ( x2 , y1 ) + FX ,Y ( x1 , y1 ) ≥ 0.

Y ( x2 , y2 )

( x1 , y1 )

X
Given FX ,Y ( x, y ), -∞ < x < ∞, -∞ < y < ∞, we have a complete description of
the random variables X and Y .

• FX ( x) = FXY ( x,+∞).

To prove this

( X ≤ x ) = ( X ≤ x ) ∩ ( Y ≤ +∞ )
∴ F X ( x ) = P (X ≤ x ) = P (X ≤ x , Y ≤ ∞ )= F XY ( x , +∞ )

Similarly FY ( y ) = FXY (∞, y ).

• Given FX ,Y ( x, y ), -∞ < x < ∞, -∞ < y < ∞, each of FX ( x) and FY ( y ) is called


a marginal distribution function.

Example

Consider two jointly distributed random variables X and Y with the joint CDF
⎧(1 − e −2 x )(1 − e − y ) x ≥ 0, y ≥ 0
FX ,Y ( x, y ) = ⎨
⎩0 otherwise

(a) Find the marginal CDFs


(b) Find the probability P{11 < X ≤ 2, 1 < Y ≤ 2}

⎧1 − e−2 x x ≥ 0
FX ( x) = lim FX ,Y ( x, y ) = ⎨
y →∞
⎩0 elsewhere
(a)
⎧1 − e − y y ≥ 0
FY ( y ) = lim FX ,Y ( x, y ) = ⎨
x →∞
⎩0 elsewhere

(b)
P{11 < X ≤ 2, 1 < Y ≤ 2} = FX ,Y (2, 2) + FX ,Y (1,1) − FX ,Y (1, 2) − FX ,Y (2,1)
= (1 − e −4 )(1 − e −2 ) + (1 − e −2 )(1 − e −1 ) − (1 − e−2 )(1 − e−2 ) − (1 − e−4 )(1 − e −1 )
=0.0272

Jointly distributed discrete random variables


If X and Y are two discrete random variables defined on the same probability space
( S , F , P) such that X takes values from the countable subset RX and Y takes values
from the countable subset RY . Then the joint random variable ( X , Y ) can take values from
the countable subset in RX × RY . The joint random variable ( X , Y ) is completely
specified by their joint probability mass function
p X ,Y ( x, y ) = P{s | X ( s ) = x, Y ( s ) = y}, ∀( x, y ) ∈ RX × RY .
Given p X ,Y ( x, y ), we can determine other probabilities involving the random variables
X and Y .
Remark
• p X ,Y ( x, y ) = 0 for ( x, y ) ∉ RX × RY

• ∑ ∑ p X ,Y ( x, y ) = 1
( x , y )∈ RX × RY

This is because
∑ ∑ p X ,Y ( x, y ) = P ∪ {x, y}
( x , y )∈ RX × RY x , y )∈RX × RY

=P ( RX × RY )
=P{s | ( X ( s ), Y ( s )) ∈ ( RX × RY }
=P ( S ) = 1
• Marginal Probability Mass Functions: The probability mass functions
p X ( x ) and pY ( y ) are obtained from the joint probability mass function as
follows
p X ( x) = P{ X = x} ∪ RY
= ∑ p X ,Y ( x, y )
y∈RY

and similarly
p Y ( y ) = ∑ p X ,Y ( x , y )
x∈ R X

These probability mass functions p X ( x ) and pY ( y ) obtained from the joint


probability mass functions are called marginal probability mass functions.
Example Consider the random variables X and Y with the joint probability
mass function as tabulated in Table . The marginal probabilities are as shown in
the last column and the last row

X 0 1 2 pY ( y )
Y
0 0.25 0.1 0.15 0.5
1 0.14 0.35 0.01 0.5
p X ( x) 0.39 0.45
Joint Probability Density Function

If X and Y are two continuous random variables and their joint distribution function
is continuous in both x and y, then we can define joint probability density function
f X ,Y ( x, y ) by
∂2
f X ,Y ( x, y ) = FX ,Y ( x, y ), provided it exists.
∂x∂y
x y
Clearly FX ,Y ( x, y ) = ∫ ∫ f X ,Y (u , v)dudv
−∞ −∞
Properties of Joint Probability Density Function

• f X ,Y ( x, y ) is always a non-negaitive quantity. That is,

f X ,Y ( x, y ) ≥ 0 ∀( x, y ) ∈ 2

∞ ∞
• ∫ ∫ f X ,Y ( x, y )dxdy = 1
−∞ −∞

• The probability of any Borel set B can be obtained by


P( B) = ∫∫ f X ,Y ( x, y )dxdy
( x , y )∈B

Marginal density functions


The marginal density functions f X ( x) and fY ( y ) of two joint RVs X and Y are given by
the derivatives of the corresponding marginal distribution functions. Thus
fX (x) = d
dx
FX ( x)
= d
dx
FX ( x, ∞ )
x ∞
= d
dx ∫ ( ∫ f X ,Y ( u , y ) d y ) d u
−∞ −∞

= ∫ f X ,Y ( x , y ) d y
−∞

a n d s im ila rly f Y ( y ) = ∫ f X ,Y ( x , y ) d x
−∞
Remark
• The marginal CDF and pdf are same as the CDF and pdf of the concerned
single random variable. The marginal term simply refers that it is derived
from the corresponding joint distribution or density function of two or more
jointly random variables.
• With the help of the two-dimensional Dirac Delta function, we can define the
joint pdf of two discrete jointly random variables. Thus for discrete jointly
random variables X and Y .
f X ,Y ( x, y ) = ∑ ∑ p X ,Y ( x, y )δ ( x − x j , y − y j )
( xi , y j )∈R X × RY .

Example The joint density function f X ,Y ( x, y ) in the previous example are

∂2
f X ,Y ( x, y ) = FX ,Y ( x, y )
∂x∂y
∂2
= [(1 − e −2 x )(1 − e − y )] x ≥ 0, y ≥ 0
∂x∂y
= 2e −2 x e − y x ≥ 0, y ≥ 0

Example: The joint pdf of two random variables X and Y are given by
f X ,Y ( x, y ) = cxy 0 ≤ x ≤ 2, 0 ≤ y ≤ 2
= 0 otherwise
(i) Find c.
(ii) Find FX , y ( x, y )
(iii) Find f X ( x) and fY ( y ).
(iv) What is the probability P (0 < X ≤ 1, 0 < Y ≤ 1) ?
∞ ∞ 2 2
∫ ∫
−∞ −∞
f X ,Y ( x, y )dydx = c ∫
0 ∫0
xydydx
1
∴c =
4
1 y x
4 ∫0 ∫0
FX ,Y ( x, y ) = uvdudv

x2 y2
=
16
2
xy
f X ( x) = ∫ dy 0 ≤ y ≤ 2
0
4
x
= 0≤ y≤2
2
Similarly
y
fY ( y ) = 0≤ y≤2
2
P (0 < X ≤ 1, 0 < Y ≤ 1)
= FX ,Y (1,1) + FX ,Y (0, 0) − FX ,Y (0,1) − FX ,Y (1, 0)
1
= +0−0−0
16
1
=
16
Conditional probability mass functions
pY / X ( y / x) = P({Y = y}/{ X = x})
P ({ X = x}{Y − y})
=
P{ X = x}
p X ,Y ( x, y )
= provided p X ( x) ≠ 0
p X ( x)
Similarly we can define the conditional probability mass function
pX / Y ( x / y)
• From the definition of conditional probability mass functions, we can
define two independent random variables. Two discrete random variables
X and Y are said to be independent if and only if
pY / X ( y / x ) = pY ( y )
so that

p X , y ( x, y ) = p X ( x ) pY ( y )

• Bayes rule:
p X / Y ( x / y ) = P({ X = x}/{Y = y})
P({ X = x}{Y − y})
=
P{Y = y}
p X ,Y ( x, y )
=
pY ( y )
p X ,Y ( x, y )
=
∑ p X ,Y ( y )
x∈RX

Example Consider the random variables X and Y with the joint probability
mass function as tabulated in Table .

X 0 1 2 pY ( y )
Y
0 0.25 0.1 0.15 0.5
1 0.14 0.35 0.01 0.5
p X ( x) 0.39 0.45
The marginal probabilities are as shown in the last column and the last row
p X ,Y (0,1)
pY / X (0 /1) =
p X (1)
0.14
=
0.39

Conditional Probability Density Function

f Y / X ( y / X = x) = f Y / X ( y / x) is called conditional density of Y given X .


Let us define the conditional distribution function .

We cannot define the conditional distribution function for the continuous


random variables X and Y by the relation
F ( y / x) = P(Y ≤ y / X = x)
Y/X

P(Y ≤ y, X = x)
=
P ( X = x)

as both the numerator and the denominator are zero for the above expression.

The conditional distribution function is defined in the limiting sense as follows:

F ( y / x) = lim∆x →0 P (Y ≤ y / x < X ≤ x + ∆x)


Y/X

P (Y ≤ y , x < X ≤ x + ∆x)
=lim∆x →0
P ( x < X ≤ x + ∆x)
y
∫ f X ,Y ( x, u )∆xdu

=lim∆x →0
f X ( x ) ∆x
y
∫ f X ,Y ( x, u )du
=∞
f X ( x)
The conditional density is defined in the limiting sense as follows
f Y / X ( y / X = x) = lim ∆y →0 ( FY / X ( y + ∆y / X = x) − FY / X ( y / X = x)) / ∆y
= lim ∆y →0,∆x→0 ( FY / X ( y + ∆y / x < X ≤ x + ∆x) − FY / X ( y / x < X ≤ x + ∆x)) / ∆y (

Because ( X = x) = lim ∆x →0 ( x < X ≤ x + ∆x)

The right hand side in equation (1) is


lim ∆y →0,∆x→0 ( FY / X ( y + ∆y / x < X < x + ∆x) − FY / X ( y / x < X < x + ∆x)) / ∆y
= lim ∆y →0, ∆x →0 ( P( y < Y ≤ y + ∆y / x < X ≤ x + ∆x)) / ∆y
= lim ∆y →0, ∆x →0 ( P( y < Y ≤ y + ∆y, x < X ≤ x + ∆x)) / P ( x < X ≤ x + ∆x)∆y
= lim ∆y →0, ∆x →0 f X ,Y ( x, y )∆x∆y / f X ( x)∆x∆y
= f X ,Y ( x, y ) / f X ( x)

∴ fY / X ( x / y ) = f X ,Y ( x, y ) / f X ( x)
(2)

Similarly we have

∴ f X / Y ( x / y ) = f X ,Y ( x, y ) / fY ( y )
(3)

• Two random variables are statistically independent if for all ( x, y ) ∈ \ 2 ,


fY / X ( y / x ) = fY ( y )
or equivalently
• f X ,Y ( x, y ) = f X ( x) fY ( y ) (4)

Bayes rule for continuous random variables:

From (2) and (3) we get Baye’s rule


f X ,Y (x, y)
∴ f X /Y (x / y) =
fY ( y)
f X (x) fY / X ( y / x)
=
fY ( y)
f (x, y)
= ∞ X ,Y
∫ f X ,Y (x, y)dx
−∞
f ( y / x) f X (x)
= ∞ Y/X
(4)
∫ f X (u) fY / X ( y / x)du
−∞

Given the joint density function we can find out the conditional density function.

Example: For random variables X and Y, the joint probability density function is given
by
1 + xy
f X ,Y ( x, y ) = X ≤ 1, Y ≤ 1
4
= 0 otherwise
Find the marginal density f X ( x), fY ( y ) and fY / X ( y / x). Are X and Y independent?

1 + xy
1
f X ( x) = ∫
−1
4
dy

1
=
2
Similarly
1
fY ( y ) = -1 ≤ y ≤ 1
2
and
f X ,Y ( x, y )
fY / X ( y / x ) =
f X ( x)
=
Baye’s Rule for mixed random variables

Let X be a discrete random variable with probability mass function p X ( x) and Y be a


continuous random variable with the conditional probability density function fY / X ( y / x).
In practical problem we may have to estimate X from observed Y . Then
pX /Y ( x / y ) = lim ∆y→ 0 P(X = x / y < Y ≤ y + ∆y)
P ( X = x, y < Y ≤ y + ∆y)
= lim ∆y→ 0
P(y < Y ≤ y + ∆y)
p X ( x ) fY / X ( y / x ) ∆ y
= lim ∆y→ 0
fY ( y ) ∆ y
p X ( x ) fY / X ( y / x )
=
fY ( y )
p X ( x ) fY / X ( y / x)
=
∑ p X ( x ) fY / X ( y / x)
x

Example

X Y
+

X is a binary random variable with


⎧1 with probability p
X =⎨
⎩ −1 with probability 1-p
V is the Gaussian noise with mean
0 and variance σ .2

Then

p X ( x ) fY / X ( y / x )
p X / Y ( x = 1/ y ) =
∑ p X ( x ) fY / X ( y / x )
x
− ( y −1)2 / 2σ 2
pe
=
− ( y −1)2 / 2σ 2 / 2σ 2
+ (1 − p)e − ( y +1)
2
pe

Independent Random Variables

Let X and Y be two random variables characterised by the joint distribution function

FX ,Y ( x, y ) = P{ X ≤ x, Y ≤ y}
and the corresponding joint density function f X ,Y ( x, y ) = ∂2
∂x∂y
FX ,Y ( x, y )

Then X and Y are independent if ∀( x, y ) ∈ \ 2 , { X ≤ x} and {Y ≤ y} are


independent events. Thus,
FX ,Y ( x, y ) = P{ X ≤ x, Y ≤ y}
=P{ X ≤ x}P{Y ≤ y}
= FX ( x) FY ( y )
∂ 2 FX ,Y ( x, y )
∴ f X ,Y ( x, y ) =
∂x∂y
= f X ( x ) fY ( y )

and equivalently
fY / X ( y) = fY ( y)
Remark:
Suppose X and Y are two discrete random variables with joint probability mass function
p X ,Y ( x, y ). Then X and Y are independent if
p X ,Y ( x, y ) = p X ( x ) pY ( y ) ∀(x,y) ∈ RX × RY
Transformation of two random variables:

We are often interested in finding out the probability density function of a function of
two or more RVs. Following are a few examples.
• The received signal by a communication receiver is given by
Z = X +Y
where Z is received signal which is the superposition of the message signal X
and the noise Y .

X Z
+

• The frequently applied operations on communication signals like modulation,


demodulation, correlation etc. involve multiplication of two signals in the
form Z = XY.
We have to know about the probability distribution of Z in any analysis of Z .
More formally, given two random variables X and Y with joint probability density
function f X ,Y ( x, y ) and a function Z = g ( X , Y ) , we have to find f Z ( z ) .
In this lecture, we will try to address this problem.

Probability density of the function of two random variables:

We consider the transformation g : R 2 → R.


Consider the event {Z ≤ z} corresponding to each z. We can find a variable subset
DZ ⊆ \ 2 such that DZ = {( x, y ) | g ( x, y ) ≤ z} .
Z
Y

DZ

{Z ≤ z}

X
∴ FZ ( z ) = P({Z ≤ z})
= P {( x, y ) | ( x, y ) ∈ Dz }
= ∫∫ f X ,Y ( x, y ) dydx
( x , y )∈Dz

dFZ ( z )
∴ fZ ( z ) =
dz

Example: Suppose Z = X+Y. Find the PDF f Z ( z ) .

Z≤z
=> X + Y ≤ Z

Therefore DZ is the shaded region in the fig below.

∴ FZ ( z ) = ∫∫ f X ,Y ( x, y ) dxdy
( x , y )∈Dz

⎡ z−x ⎤
= ∫−∞ ⎢⎣ −∞∫ f X ,Y ( x, y )dy ⎥⎦ dx

⎡z ⎤
= ∫ ⎢ ∫ f X ,Y ( x, u − x ) du ⎥ dx substituting y = u - x
−∞ ⎣ −∞ ⎦
z
⎡ ∞

= ∫ ⎢ ∫ f X ,Y ( x, u − x ) dx ⎥ du interchanging the order of integration
−∞ ⎣ −∞ ⎦

d ⎡ ⎤
z ∞
∴ fZ ( z ) = ∫ ⎣ −∞∫
⎢ f X ,Y ( x , u − x ) dx ⎥ du
dz −∞ ⎦

= ∫ f X ,Y ( x, u − x ) dx
−∞

If X and Y are independent

f X ,Y ( x, z − x ) = f X ( x ) fY ( z − x )

∴ fZ ( z ) = ∫ f X ( x ) fY ( z − x ) dx
−∞

= f X ( z ) * fY ( z )

Where * is the convolution operation.


Example:

Suppose X and Y are independent random variables and each uniformly distributed over
(a, b). f X ( x ) and fY ( y ) are as shown in the figure below.

The PDF of z is a triangular probability density function as shown in the figure.

Probability density function of Z = XY

FZ ( z ) = ∫∫ f X ,Y ( x, y ) dydx
( x , y )∈Dz
z
∞ x
= ∫ (∫ f X ,Y ( x, y ) dy )dx
−∞ −∞

∞ ∞
1 ⎛ u⎞ Substituting u = xy
= ∫
−∞

x −∞
f X ,Y ⎜ x, ⎟ dudx
⎝ x⎠
du = xdy

d 1 ⎛ z⎞
∴ fZ ( z ) = FZ ( z ) = ∫ f X ,Y ⎜ x, ⎟ dx
dz −∞
x ⎝ x⎠


1 ⎛z ⎞
= ∫
−∞
y
f X ,Y ⎜ , y ⎟ dy
⎝y ⎠

Y
Probability density function of Z =
X

Z≤z
Y
=> ≤ z
X
=> Y ≤ xz

∴ Dz = {( x, y ) | Z ≤ z}
= {−∞ < x ≤ ∞, y ≤ xz}

∴ FZ ( z ) = ∫∫ f X ,Y ( x, y ) dydx
( x , y )∈Dz
∞ zx
= ∫∫ f XY ( x, y ) dydx
−∞ −∞
∞ z
= ∫ x∫ f X ,Y ( x, ux ) dudx
−∞ −∞

Suppose, X and Y are independent random variables. Then, f X ,Y ( x, zx ) = f X ( x ) fY ( zx )


fZ ( z ) = ∫ x f X ( x ) fY ( zx ) dx
−∞
d
∴ fZ ( z ) = FZ ( z )
dz

= ∫ x f X ,Y ( x, z , k ) dx
−∞

= ∫ y f X ,Y ( z , y, u ) dy
−∞

Example:
Suppose X and Y are independent zero mean Gaussian random variable with unity
Y
standard derivationand Z = . Then
X

∞ 2 2 2
1 − x2 1 − 3 2x
fZ ( z ) = ∫ x e e dx
−∞ 2π 2π


1 1
(
− x 2 1+ z 2 )
=
2π ∫
−∞
xe 2
dx


1 1
(
− x 2 1+ z 2 )
=
π ∫ xe
0
2
dx

1
=
π (1 + z 2 )

which is the Cauchy probability density function.

Probability density function of Z = X2 +Y2


∴ Dz = {( x, y ) | Z ≤ z}

= {( x, y ) | x2 + y 2 ≤ z }
= {( r , θ ) | 0 ≤ r ≤ z , 0 ≤ θ ≤ 2π }
∴ FZ ( z ) = ∫∫ f X ,Y ( x, y ) dydx
( x , y )∈Dz
2π z
= ∫ ∫ f ( r cos θ , r sin θ ) rdrdθ
0 0
XY


d
∴ fZ ( z ) = FZ ( z ) = ∫ f XY ( z cos θ , z sin θ ) zdθ
dz 0

Example Suppose X and Y are two independent Gaussian random variables each with
mean 0 and variance σ 2 and Z = X 2 + Y 2 . Then

fZ ( z ) = ∫ f XY ( z cos θ , z sin θ ) zdθ
0


= z ∫ f X ( z cos θ ) fY ( z sin θ ) dθ
0
z2 z2
− cos2 θ − sin 2 θ
2π 2σ 2 2σ 2
e .e
= z∫ dθ
0 2πσ 2
z2

ze 2σ 2
= z≥0
σ2
The above is the Rayleigh density function we discussed earlier.

Rician Distribution:

Suppose X and Y are independent Gaussian variables with non zero mean µ X and µY
respectively and constant variance. We have to find the joint density function of the
random variable Z = X 2 + Y 2 .
ƒ Envelope of a sinusoidal + a narrow band Gaussian noise.
ƒ Received noise in a multipath situation.

Here
1 ⎛ 2⎞
⎜ ( x − µ X ) + ( y − µY ) ⎟
2

1 2σ 2 ⎜⎝ ⎟
f X ,Y ( x, y ) = e ⎠

2πσ 2
and
Z = X2 +Y2
We have shown that

fZ ( z ) = ∫ f XY ( z cos θ , z sin θ ) zdθ
0

Suppose µ X = µ cos φ and µY = µ sin φ . Then


1 ⎛ 2⎞
⎜ ( z cos θ − µ cos φ ) + ( z sin θ − µ csin φ ) ⎟
2

1 2σ 2 ⎜⎝ ⎟
f X ,Y ( z cos θ , z sin θ ) = e ⎠

2πσ 2
( z2 +µ 2 )
1 −
= e 2σ 2 e z µ cos(θ −φ )
2πσ 2
2π ( z 2 + µ 2 ) z µ cos(θ −φ
1 −
∴ fZ ( z ) =
)

∫ 2πσ 2 e
0
2σ 2 e σ 2 zdθ

( z2 + µ 2 )
− z µ cos(θ −φ
ze 2σ 2 2π )
=
2πσ 2 ∫ 0
e σ2

Joint Probability Density Function of two functions of two random variables
We consider the transformation ( g1 , g 2 ) : R 2 → R 2 . We have to find out the joint
probability density function f Z1 , Z2 ( z1 , z2 ) where Z1 = g1 ( X , Y ) and Z 2 = g 2 ( X , Y ) . We
hve to find out the joint probability density function f Z1 , Z2 ( z1 , z2 ) where z1 = g1 ( x, y )
and z2 = g 2 ( x, y ) . Suppose the inverse mapping relation is
x = h1 ( z1 , z2 ) and y = h2 ( z1 , z2 )

Consider a differential region of area dz1 dz2 at point ( z1 , z2 ) in the Z1 − Z 2 plane.


Let us see how the corners of the differential region are mapped to the X − Y plane.
Observe that
δh δh
h1 ( z1 + dz1 , z2 ) = h1 ( z1 , z2 ) + 1 dz1 = x + 1 dz1
δ z1 δ z1
δh δh
h2 ( z1 + dz1 , z2 ) = h1 ( z1 , z2 ) + 2 dz1 = y + 2 dz1
δ z1 δ z1
Therefore,
δh δh
The point ( z1 + dz1 , z2 ) is mapped to the point ( x + 1 dz1 , y + 2 dz ) in the
δ z1 δ z1
X − Y plane.

Z2
δ h1 δh
(x + dz1 , y + 2 dz )
( x, y ) δ z1 δ z1
( z1 , z2 )
( z1 + dz1 , z2 )
X

Z1

We can similarly find the points in the X − Y plane corresponding to ( z1 , z2 + dz2 ) and
( z1 + dz1 , z2 + dz2 ). The mapping is shown in Fig. We notice that each differential region
in the X − Y plane is a parallelogram. It can be shown the differential parallelogram at
( x, y ) has a area J ( z1 , z2 ) dz1 dz2 where J ( z1 , z2 ) is the Jacobian of the transformation
defined as the determinant
δ h1 δ h1
δz δ z2
J ( z1 , z2 ) = 1
δ h2 δ h2
δ z1 δ z2

Further, it can be shown that the absolute values of the Jacobians of the forward and the
inverse transform are inverse of each other so that
1
J ( z1 , z2 ) =
J ( x, y )
where

δ g1 δ g1
δx δy
J ( x, y ) =
δ g2 δ g2
δx δy
dz1 dz2
Therefore, the differential parallelogram in Fig. has an area of .
J ( x, y )
Suppose the transformation z1 = g1 ( x, y ) and z2 = g 2 ( x, y ) has n roots and let
( xi , yi ), i = 1, 2,..n be the roots. The inverse mapping of the differential region in the
X − Y plane will be n differential regions corresponding to n roots. The inverse
mapping is illustrated in the following figure for n = 4. As these parallelograms are non-
overlapping,

n
dz1 dz2
f Z1 , Z2 ( z1 , z2 ) dz1 dz2 = ∑ f X ,Y ( x, y )
i =1 J ( xi , yi )
n f X ,Y ( x, y )
∴ f Z1 , Z 2 ( z1 , z2 ) = ∑
i =1 J ( xi , yi )
Remark
• If z1 = g1 ( x, y ) and z2 = g 2 ( x, y ) does not have a root in ( x, y ),
then f Z1 , Z2 ( z1 , z2 ) = 0.
δx δx δy δy
(x + dz1 + dz2 , y + dz1 + dz )
δ z1 δ z2 δ z1 δ z2 2
Y
δx δy
(x + dz z , y + dz )
Z2 δ z2 δ z2 2
( z1 , z2 + dz2 ) ( z1 + dz1 , z2 + dz2 )

δx δy
(x + dz1 , y + dz )
( z1 , z2 ) ( z1 + dz1 , z2 ) ( x, y ) δ z1 δ z1 1

Z1 X

Z2

( z1 , z2 )
( z1 + dz1 , z2 )

X
Z1

Example: pdf of linear transformation


Z1 = aX + bY
Z 2 = cX + dY
Then
dz1 − bz2 az − cz1
x= , y= 2
ad − bc ad − bc
a b
J ( x, y ) = = ad − bc
c d
Suppose X and Y are two independent Gaussian random variables each with mean 0 and
Y
variance σ 2 . Given R = X 2 + Y 2 and θ = tan −1 , find f R (r ) and fθ (θ ) .
X

Solution:

We have x = r cos θ and y = r sin θ so that r 2 = x 2 + y 2 …………. (1)

y
and tan θ = ……………… (2)
x

From (1)
∂r x
= = cos θ
∂x r
∂r y
and = = sin θ
∂y r

From (2)
∂θ −y sin θ
= 2 =−
∂x x + y 2
r
∂θ x cos θ
= 2 =
∂y x + y 2
r

⎡ cos θ sin θ ⎤
1
∴ J = det ⎢ sin θ cos θ ⎥⎥ =
⎢− r
⎣ r r ⎦

f X ( x, y) ⎤
∴ f R,θ (r ,θ ) = ⎥
J ⎥⎦ x =r cosθ
y = r sinθ

2 2
r − r 2 cos2 θ − r 2 sin 2 θ
= e 2σ . e 2σ
2πσ 2
2
r − r 2
= e 2σ
2πσ 2

∴ f R (r ) = ∫
0
f R ,θ (r , θ )dθ
2
r − r 2
= e 2σ 0≤r <∞
σ2

fθ (θ ) = ∫ f R,θ (r ,θ )dr
0
∞ 2
− r 2
1
2πσ 2 ∫0
= re 2σ dr
1
= 0 ≤ θ ≤ 2π

Rician Distribution:

• X and Y are independent Gaussian variables with non zero mean µ X and µY
respectively and constant variance.

ƒ We have to find the joint density function of the random variable Z = X 2 + Y 2 .


ƒ Envelope of a sinusoidal + a narrow band Gaussian noise.
ƒ Received noise in a multipath situation.

Z= X 2 +Y2
Y
φ = tan −1
X

We have to find J(x, y) corresponding to z and φ .


⎡ ∂z ∂z ⎤
⎢ ∂x ∂y ⎥
J ( x, y ) = det ⎢ ⎥
⎢ ∂φ ∂φ ⎥
⎢ ∂x ∂y ⎥
⎣ ⎦

y
From z = x2 + y2 and tan φ =
x
y
We have, z 2 = x2 + y2 and tan φ =
x

Therefore,
∂z x
= = cos φ
∂x z
∂z y
and = = sin φ
∂y z

∂φ y y
also =− 2 = − 2 cos 2 φ
∂x x sec φ2
x
∂φ y 1
and = = cos 2 φ
∂y x sec φ x
2

⎡ cos φ sin φ ⎤
∴ J ( x, y ) = det ⎢ y 2 cos 2 φ 1

⎢− 2 ⎥
cos x
⎣⎢ x2 x ⎦⎥
cos3 φ y sin φ cos 2 φ
= +
x x2
⎛ x cos φ + y sin φ ⎞
= cos 2 φ ⎜ ⎟
⎝ x2 ⎠
z cos φ 1
2
= =
x2 z

Consider the transformation as shown in the diagram below:


⎛ 2⎞
− 1 2 ⎜⎜ ( x − µ X ) + ( y − µY ) ⎟⎟
2
1 2σ ⎝
f X ,Y ( x, y) = e ⎠

2πσ 2

We have to find the density at (x, y) corresponding to z and φ .

From the above figure,

X − µ X = Z cos φ − µ cos φ0

and Y − µY = Z sin φ − µ sin φ0

⎛ 2⎞
− 1 2 ⎜⎜ ( z cosφ − µ cosφ0 ) + ( z sin φ − µ sin φ ) ⎟⎟
2
1 2σ ⎝
f X ,Y ( x, y) = e ⎠

2πσ 2
1 ⎛ ⎞

= 1 e− 2σ 2 ⎜⎝ z2 −2 zµ cos(φ −φ0 )+ µ 2 ⎟⎠
2πσ 2

− z + µ2
2 2 zµ
1 cos(φ −φ0 )
= e 2 σ e σ2
2πσ 2
Expected Values of Functions of Random Variables
Recall that
• If Y = g ( X ) is a function of a continuous random variable X , then

EY = Eg ( X ) = ∫ g ( x) f X ( x)dx
−∞

• If Y = g ( X ) is a function of a discrete random variable X , then


EY = Eg ( X ) = ∑ g ( x) p X ( x)
x∈RX

Suppose Z = g ( X , Y ) is a function of continuous random variables X and Y , then


the expected value of Z is given by

EZ = Eg ( X , y ) = ∫ zf Z ( z )dz
−∞
∞ ∞
= ∫ ∫ g ( x, y ) f X ,Y ( x, y )dxdy
−∞ −∞

Thus EZ can be computed without explicitly determining f Z ( z ).


We can establish the above result as follows.
Suppose Z = g ( X , Y ) has n roots ( xi , yi ), i = 1, 2,.., n at Z = z. Then
n
{ z < Z ≤ z + ∆z} = ∪ {( xi , yi ) ∈ ∆Di }
i =1
where
∆Di is the differential region containing ( xi , yi ). The mapping is illustrated in Fig. for
n = 3.

P ({ z < Z ≤ z + ∆z}) = f Z ( z )∆z = ∑


( xi , yi )∈Di
f X ,Y ( xi , yi )∆xi ∆ yi

∴ zf Z ( z )∆z = ∑
( xi , yi )∈∆Di
zf X ,Y ( xi , yi )∆xi ∆ yi

= ∑
( xi , yi )∈∆Di
g ( xi , yi ) f X ,Y ( xi , yi )∆xi ∆ y

As z is varied over the entire the entire Z axis, the corresponding (nonoverlapping)
differential regions in X − Y plane covers the entire plane.
∞ ∞ ∞
∴ ∫ zf Z ( z )dz = ∫ ∫ g ( x, y ) f X ,Y ( x, y )dxdy
−∞ −∞ −∞

Thus,
∞ ∞
Eg ( X , y ) = ∫ ∫ g ( x, y ) f X ,Y ( x, y )dxdy
−∞ −∞

Z
Y

{z < Z ≤ z + ∆z}

∆D1 ∆D2

∆D3

If Z = g ( X , Y ) is a function of discrete s random variables X and Y , We can


similarly show that
EZ = Eg ( X , Y ) = ∑ ∑ g ( x, y ) pX ,Y ( x, y )
x , y∈R X × RY

Example: The joint pdf of two random variables X and Y are given by
1
f X ,Y ( x, y ) = xy 0 ≤ x ≤ 2, 0 ≤ y ≤ 2
4
= 0 otherwise
Find the joint expectation of g ( X , Y ) = X 2Y

Eg ( X , Y ) = EX 2Y
∞ ∞
= ∫ ∫ g ( x, y ) f X ,Y ( x, y )dxdy
−∞ −∞
22 1
= ∫ ∫ x 2 y xydxdy
00 4
1 2 2
= ∫ x 3 dx ∫ y 2 dy
40 0

1 2 23 4
= × ×
4 4 3
8
=
3
Example: If Z = aX + bY , where a and b are constants, then
EZ = aEX + bEY

Proof:
∞ ∞
EZ = ∫ ∫ (ax + by ) f X ,Y ( x, y )dxdy
−∞ −∞
∞ ∞ ∞ ∞
= ∫ ∫ axf X ,Y ( x, y ) dxdy + ∫ ∫ byf X ,Y ( x, y ) dxdy
−∞ −∞ −∞ −∞
∞ ∞ ∞ ∞
= ∫ ax ∫ f X ,Y ( x, y )dydx + ∫ by ∫ f X ,Y ( x, y )dxdy
−∞ −∞ −∞ −∞
∞ ∞
= a ∫ xf X ( x)dx + b ∫ yfY ( y )dy
−∞ −∞

= aEX + bEY
Thus, expectation is a linear operator.
Example:
Consider the discrete random variables X and Y discussed in Example .The
joint probability mass function of the random variables are tabulated in Table .
Find the joint expectation of g ( X , Y ) = XY

X 0 1 2 pY ( y )
Y
0 0.25 0.1 0.15 0.5
1 0.14 0.35 0.01 0.5
p X ( x) 0.39 0.45

Clearly EXY = ∑ ∑ g ( x, y ) p X ,Y ( x, y )
x , y∈R X × RY

= 1×1× 0.35 + 1× 2 × 0.01


= 0.37

Remark
(1) We have earlier shown that expectation is a linear operator. We can generally write
E[a1 g1 ( X , Y ) + a2 g 2 ( X , Y )] = a1 Eg1 ( X , Y ) + a2 Eg 2 ( X , Y )
Thus E ( XY + 5log e XY ) = EXY + 5E log e XY
(2) If X and Y are independent random variables and g ( X , Y ) = g1 ( X ) g 2 (Y ), then
Eg ( X , Y ) = Eg1 ( X ) g 2 (Y )
∞ ∞
= ∫ ∫ g1 ( X ) g 2 (Y ) f X ,Y ( x, y )dx
−∞ −∞
∞ ∞
= ∫ ∫ g1 ( X ) g 2 (Y ) f X ( x) fY ( y )dxdy
−∞ −∞
∞ ∞ ∞
= ∫ g1 ( X ) f X ( x)dx ∫ ∫ g 2 (Y ) fY ( y )dxdy
−∞ −∞ −∞

= Eg1 ( X ) Eg 2 (Y )
Joint Moments of Random Variables
Just like the moments of a random variable provides a summary description of the
random variable, so also the joint moments provide summary description of two random
variables.
For two continuous random variables X and Y the joint moment of order m + n is
defined as
∞ ∞
E ( X m Y n ) = ∫ ∫ x m y n f X ,Y ( x, y )dxdy
−∞ −∞ and
the joint central moment of order m + n is defined as
∞ ∞
E ( X − µ X ) m (Y − µY ) n = ∫ ∫ ( x − µ X ) m ( y − µY ) n f X ,Y ( x, y )dxdy
−∞ −∞

µ X = EX µY = EY
where and
Remark
(1) If X and Y are discrete random variables, the joint expectation of order m and
n is defined as

E ( X mY n ) = ∑ ∑ x y p X ,Y ( x, y )
m n
( x , y )∈RX ,Y

E ( X − µ X ) m (Y − µY ) n = ∑ ∑ ( x − µ X ) ( y − µY ) p X ,Y ( x, y )
m n
( x , y )∈R X ,Y

(2) If m = 1 and n = 1 , we have the second-order moment of the random variables


X and Y given by
⎧∞ ∞
⎪ ∫ ∫ xyf X ,Y ( x, y ) dxdy if X and Y are continuous
E ( XY ) = ⎨ −∞ −∞
⎪ ∑ ∑ xyp X ,Y ( x, y ) if X and Y are discrete
⎩( x , y )∈RX ,Y
(3) If X and Y are independent, E ( XY ) = EXEY

Covariance of two random variables


The covariance of two random variables X and Y is defined as
Cov ( X , Y ) = E ( X − µ X )(Y − µY )

Expanding the right-hand side, we get


Cov( X , Y ) = E ( X − µ X )(Y − µY )
= E ( XY − µY X − µ X Y + µ X µY )
= EXY − µY EX − µ X EY + µ X µY
= EXY − µ X µY

Cov( X , Y )
The ratio ρ ( X , Y ) = is called the correlation coefficient. We will give an
σ X σY

interpretation of Cov( X , Y ) and ρ ( X , Y ) later on.


We will show that ρ ( X , Y ) ≤ 1. To establish the relation, we prove the following result:

For two random variables X and Y , E 2 ( XY ) ≤ EX 2 EY 2

Proof:
Consider the random variable Z = aX + Y
E (aX + Y ) 2 ≥ 0
.
⇒ a 2 EX 2 + EY 2 + 2aEXY ≥ 0
Non-negatively of the left-hand side => its minimum also must be nonnegative.
For the minimum value,
dEZ 2 EXY
= 0 => a = −
da EX 2
E 2 XY E 2 XY
so the corresponding minimum is + EY 2
− 2
EX 2 EX 2
Minimum is nonnegative =>
E 2 XY
EY 2 − ≥0
EX 2
⇒ E 2 XY ≤ EX 2 EY 2
⇒ EXY ≤ EX 2 EY 2
Now
Cov( X , Y ) E ( X − µ X )(Y − µY )
ρ( X ,Y ) = =
σ Xσ X E ( X − µ X ) 2 E (Y − µY ) 2
E ( X − µ X )(Y − µY )
∴ ρ( X ,Y ) =
E ( X − µ X ) 2 E (Y − µY ) 2

E ( X − µ X ) 2 E (Y − µY ) 2

E ( X − µ X ) 2 E (Y − µY ) 2
=1

ρ ( X ,Y ) ≤ 1

Uncorrelated random variables


Two random variables X and Y are called uncorrelated if
Cov ( X , Y ) = 0
which also means
E ( XY ) =µ X µY

Recall that if X and Y are independent random variables, then f X ,Y ( x, y ) = f X ( x) fY ( y ).


∞ ∞
EXY = ∫ ∫ xyf X ,Y ( x, y ) dxdy assuming X and Y are continuous
−∞ −∞
∞ ∞
= ∫ ∫ xyf X ( x) fY ( y )dxdy
Then −∞ −∞
∞ ∞
= ∫ xf X ( x)dx ∫ yfY ( y )dy
−∞ −∞
= EXEY
Thus two independent random variables are always uncorrelated.
The converse is not always true.
(3) Two random variables may be dependent, but still they may be uncorrelated. If
there exists correlation between two random variables, one may be represented as
a linear regression of the others. We will discuss this point in the next section.

Linear prediction of Y from X

Yˆ = aX + b Regression
Prediction error Y − Yˆ
Mean square prediction error
E (Y − Yˆ ) 2 = E (Y − aX − b) 2

For minimising the error will give optimal values of a and b. Corresponding to the
optimal solutions for a and b, we have
∂ E (Y − aX − b) 2 = 0
∂a
∂ E (Y − aX − b) 2 = 0
∂b
1
Solving for a and b , Yˆ − µY = 2 σ X ,Y ( x − µ X )
σX
σ
so that Yˆ − µY = ρ X ,Y y ( x − µ X )
σx
σ XY
where ρ X ,Y = is the correlation coefficient.
σ XσY

Yˆ − µY
σy
slope = ρ X ,Y
σx

x − µX
Remark
If ρ X ,Y > 0, then X and Y are called positively correlated.

If ρ X ,Y < 0, then X and Y are called negatively correlated

If ρ X ,Y = 0 then X and Y are uncorrelated.

=> Yˆ − µ Y = 0
=> Yˆ = µ Y is the best prediction.

( To be labeled and animated)


If ρ X ,Y = 0 then X and Y are uncorrelated.

=> Yˆ − µ Y = 0
=> Yˆ = µ Y is the best prediction.
Note that independence => Uncorrelatedness. But uncorrelated generally does not imply
independence (except for jointly Gaussian random variables).

Example :
Y = X 2 and f X (x) is uniformly distributed between (1,-1).
X and Y are dependent, but they are uncorrelated.
Cov( X , Y ) = σ X = E ( X − µ X )(Y − µ Y )
Because = EXY = EX 3 = 0
= EXEY (∵ EX = 0)

In fact for any zero- mean symmetric distribution of X, X and X 2 are uncorrelated.

(4) is a linear estimator Xˆ = aY


Jointly Gaussian Random variables
Many practically occurred random variables are modeled as jointly Gaussian random variables.
For example, noise occurring in the communication systems is modeled as jointly Gaussian
random variables.
Two random variables X and Y are called jointly Gaussian if their joint density function is

⎡ ( x− µ X )2 ( x− µ )( y − µ ) ( y − µ ) 2 ⎤
− 1
⎢ − 2 ρ XY σ σ
X Y + Y

2 (1− ρ 2X ,Y ) 2 σ Y2
⎣⎢ σ X ⎦⎥
f X ,Y ( x, y ) = 1 X Y
e
2πσ X σ Y 1− ρ X2 ,Y

-∞ < x < ∞, -∞ < y < ∞


The joint pdf is determined by 5 parameters

- means µ X and µ Y

2 2
- variances σ X and σ Y

- correlation coefficient ρ X , Y .

We denote the jointly Gaussian random variables X and Y with these parameters as

( X , Y ) ~ N ( µ X , µY , σ X2 , σ Y2 , ρ X ,Y )

The pdf has a bell shape centred at ( µ X , µY ) as shown in the Fig. below. The variances

σ X2 and σ Y2 determine the spread of the pdf surface and ρ X ,Y determines the orientation of the
surface in the X − Y plane.
Properties of jointly Gaussian random variables
(1) If X and Y are jointly Gaussian, then X and Y are both Gaussian.
We have

f X ( x) = ∫ f X ,Y ( x, y )dy
−∞
⎡ ( x− µ )2 ( x− µ )( y − µ ) ( y − µ )2 ⎤
∞ − 1 ⎢ X − 2 ρ XY σ σ
X Y + Y ⎥
2(1− ρ 2X ,Y ) ⎢ σ 2X σ Y2 ⎥⎦
= ∫ ⎣
1 X Y

2 πσ X σ Y 1− ρ X2 ,Y
e dy
−∞
2 ⎡ ρ 2 ( x − µ )2 ( x − µ )( y − µ ) ( y − µ )2 ⎤
1 ⎛ x−µ X ⎞
⎢ X ,Y Y ⎥
− ⎜ ⎟
∞ − 1 X
− 2 ρ XY X
σ Xσ
Y +
2⎜ σ ⎟ 2 (1− ρ X ,Y ) ⎢
2 σX2 σ Y2 ⎥
= e ⎝ X ⎠ ⎣ ⎦

1 Y
2πσ X 2 πσ Y 1− ρ X
2
e dy
−∞ ,Y

1 ⎛ x−µ X ⎞
2
⎡ ρ X ,Y σ ( x − µ X ) ⎤2
− ⎜ ⎟
∞ − 1
⎢( y − µY − Y

2⎜ σ ⎟ 2 2
2 σ Y (1− ρ X ,Y ) ⎣⎢ σ
⎦⎥
= e ⎝ X ⎠

X
1
e dy Area under a Gaussian
2πσ X 2 πσ Y 1− ρ X
2
−∞ ,Y

2
1 ⎛ x−µ X ⎞
− ⎜ ⎟
2 ⎜⎝ σ X ⎟
= 1
2 πσ X
e ⎠

Similarly
2
1 ⎛ y − µY ⎞
− ⎜ ⎟
2 ⎜⎝ σ Y ⎟
fY ( y ) = 1
2πσ Y
e ⎠

(2) The converse of the above result is not true. If each of X and Y is Gaussian, X and Y
are not necessarily jointly Gaussian. Suppose
⎡ ( x−µ X )2 ( y − µ )2 ⎤
− 12 ⎢ 2
+ Y

⎢⎣ σ X σ Y2 ⎥⎦
f X ,Y ( x, y ) = 1
2πσ X σ Y e (1 + sin x sin y )
f X ,Y ( x, y ) in this example is non-Gaussian and qualifies to be a joint pdf. Because,

f X ,Y ( x, y ) ≥ 0 and

∞ ∞ ⎡ ( x− µ X )2 ( y − µ )2 ⎤
− 12 ⎢ + Y

∫∫
2 σ Y2
⎣⎢ σ X ⎦⎥
1
2πσ X σ Y e (1 + sin x sin y )dydx
−∞ −∞

∞ ∞ ⎡ ( x−µ X )2 ( y − µ )2 ⎤
∞ ∞ ⎡ ( x−µ X )2 ( y − µ )2 ⎤
− 12 ⎢ + Y
⎥ − 12 ⎢ + Y

∫∫ ∫∫
2 σ Y2 2 σ Y2
⎣⎢ σ X ⎦⎥ ⎣⎢ σ X ⎦⎥
= 1
2πσ X σ Y e dydx + 1
2πσ X σ Y e sin x sin ydydx
−∞ −∞ −∞ −∞

∞ ( x− µ )2 ∞ ( y − µ )2
− 12 X − 12 Y

= 1 + 2πσ X σ Y ∫e ∫e
1 σ 2X σ Y2
sin xdx sin ydy
−∞


−∞

= 1+ 0
=1

Odd function in y

The marginal density f X ( x ) is given by

∞ ⎡ ( x− µ X )2 ( y − µ )2 ⎤
− 12 ⎢ + Y


2 σ Y2
⎣⎢ σ X ⎦⎥
f X ( x) = 1
2πσ X σ Y e (1 + sin x sin y )dy
−∞

∞ ⎡ ( x−µ X )2 ( y − µ )2 ⎤
∞ ⎡ ( x−µ X )2 ( y − µ )2 ⎤
− 12 ⎢ + Y
⎥ − 1
⎢ + Y

∫ ∫
2 σ Y2 2 ) 2 σ Y2
⎣⎢ σ X ⎦⎥ 2 (1− ρ X
⎣⎢ σ X ⎦⎥
= 1
e dy + 1
e sin x sin ydy
,Y
2πσ X σ Y 2πσ X σ Y 

−∞ −∞
2
1 ⎛ x−µ X ⎞
− ⎜ ⎟
2 ⎜⎝ σ X ⎟
= 1
2π σ X
e ⎠
+0
2
1 ⎛ x−µ X ⎞
− ⎜ ⎟
1 2 ⎜⎝ σ X ⎟
⎠ Odd function in y
= 2π σ X
e
2
1 ⎛ y − µY ⎞
− ⎜ ⎟
2 ⎜⎝ σ Y ⎟
Similarly, fY ( y ) = 1
2πσ Y
e ⎠

Thus X and Y are both Gaussian, but not jointly Gaussian.


(3) If X and Y are jointly Gaussian, then for any constants a and b, then the random variable
Z , given by Z = aX + bY is Gaussian with mean µ Z = aµ X + bµY and variance
σ Z 2 = a 2σ X 2 + b 2σ Y 2 + 2abσ X σ Y ρ X ,Y
(4) Two jointly Gaussian RVs X and Y are independent if and only if X and Y are
uncorrelated ( ρ X ,Y = 0). Observe that if X and Y are uncorrelated, then

⎡ ( x−µ X )2 ( y − µ )2 ⎤
− 12 ⎢ 2
+ Y

⎣⎢ σ X σ Y2 ⎦⎥
f X ,Y ( x, y ) = 1
2πσ X σ Y e
( x− µ )2 ( y − µ )2
X Y
1 1
= e σ 2X
e σY2
.
2πσ X 2πσ Y
= f X ( x ) fY ( y )

Joint Characteristic Functions of Two Random Variables


The joint characteristic function of two random variables X and Y is defined by

φ X ,Y (ω1 , ω 2 ) = Ee jω x + jω y
1 2

If X and Y are jointly continuous random variables, then


∞ ∞
φ X ,Y (ω1 , ω 2 ) = ∫∫
−∞ −∞
f X ,Y ( x, y )e jω1x + jω2 y dydx

Note that φ X ,Y (ω1 , ω 2 ) is same as the two-dimensional Fourier transform with the basis function
e jω1x + jω 2 y instead of e − ( jω1x + jω 2 y ) .

f X , y ( x, y ) is related to the joint characteristic function by the Fourier inversion formula

∞ ∞
1
f X , y ( x, y ) =
4π 2 ∫ ∫φ
−∞ −∞
X ,Y (ω1 , ω 2 )e− jω1x − jω2 y d ω1 d ω 2

If X and Y are discrete random variables, we can define the joint characteristic function in terms
of the joint probability mass function as follows:

φ X ,Y (ω1 , ω 2 ) = ∑ ∑p
( x , y )∈R X ×RY
X ,Y ( x, y )e jω1x + jω2 y

Properties of Joint Characteristic Functions of Two Random Variables

The joint characteristic function has properties similar to the properties of the chacteristic function
of a single random variable. We can easily establish the following properties:
1. φ X (ω ) = φ X ,Y (ω , 0)
2. φY (ω ) = φ X ,Y (0, ω )
3. If X and Y are independent random variables, then
φ X ,Y (ω1 , ω2 ) = Ee jω X + jω Y 1 2

= E (e jω1 X e jω2Y )
= Ee jω1 X Ee jω2Y
= φ X (ω1 )φY (ω2 )
4. We have,

φ X ,Y (ω1 , ω2 ) = Ee jω X + jω Y 1 2

j 2 (ω1 X + jω2Y ) 2
= E (1 + jω1 X + jω2Y + + ..............)
2
j 2ω12 EX 2 j 2ω2 2 EY 2
= 1 + jω1 EX + jω2 EY + + + ω1ω2 EXY + .....
2 2
Hence,
1=φ X ,Y (0, 0)
1 ∂
EX = φ X ,Y (ω1 , ω 2 )
j ∂ω1 ω =0 1

1 ∂
EY = φ X ,Y (ω1 , ω 2 )
j ∂ω 2 ω 2 =0

1 ∂ φ X ,Y (ω1 , ω 2 )
2

EXY =
j2 ∂ω1∂ω 2 ω = 0,ω 1 2 =0

In general, the ( m + n )th order joint moment is given by

1 ∂ ∂ φ X ,Y (ω1 , ω2 )
m n

EX Y = m + n
m n

j ∂ω1m ∂ω2n ω = 0,ω 1 2 =0

Example Joint characteristic function of the jointly Gaussian random variables X and Y with the
joint pdf

1 ⎡⎛ x − µ ⎞ 2 ⎛ x−µX ⎞⎛ y − µ y ⎞ ⎛ y − µY ⎞ ⎤
2
− ⎢⎜ ⎟ − 2 ρ X , y ⎜⎜ ⎥
⎟⎟⎜⎜ σ ⎟⎟ + ⎜⎜ σ ⎟⎟ ⎥
X
2(1− ρ X2 ,Y ) ⎢⎝⎜ σ X ⎠⎟ ⎝ σX ⎠⎝ Y ⎠ ⎝ Y ⎠ ⎦

e
f X ,Y ( x, y ) =
2πσ X σ Y 1 − ρ X2 ,Y
Let us recall the characteristic function of a Gaussian random variable
X ~ N ( µ X , σ X2 )
φ X (ω ) = Ee jω X
2
∞ 1 ⎛⎜ x − µ X ⎞
− ⎟
1 2 ⎜⎝ σ X ⎟
= ∫e ⎠
.e jω x dx
2πσ X −∞
∞ 1 x2 − 2( µ X −σ X2 jω ) x + ( µ X −σ X2 jω )2 −( µ X −σ X2 jω )2 + µ X2

1 σX2
∫e
2
= dx
2πσ X −∞
1 ( −σ X2 ω 2 + 2 µ X σ X2 jω ) 1 ⎛ x − µ X −σ X2 jω ⎞
2 Area under a Gaussian
∞ − ⎜ ⎟
σX2 1 2 ⎜⎝ ⎟
∫e
2 σX
=e ⎠
dx
2πσ X −∞

= e µ X jω −σ ω2 / 2
2
X
×1
= e µ X jω −σ ω2 / 2
42
X

If X and Y are jointly Gaussian,


1 ⎡⎛ x − µ ⎞
2
⎛ x−µX ⎞⎛ y − µ y ⎞ ⎛ y − µY ⎞
2⎤
− ⎢⎜ X
⎟⎟ − 2 ρ X , y ⎜⎜ ⎟⎟⎜⎜ ⎟⎟ + ⎜⎜ ⎟⎟ ⎥
2(1− ρ X2 ,Y ) ⎢⎜⎝ σ X ⎠ ⎝ σX ⎠⎝ σ Y ⎠ ⎝ σY ⎠ ⎥
⎣ ⎦
e
f X ,Y ( x, y ) =
2πσ X σ Y 1 − ρ X2 ,Y

we can similarly show that


φ X ,Y (ω1 , ω 2 ) = Ee j ( X ω +Yω ) 1 2

j µ X ω1 + j µY ω2 − 1 (σ X2 ω12 + 2 ρ X ,Y σ X σ Y ω1ω2 +σ Y2ω22 )


=e 2

We can use the joint characteristic functions to simplify the probabilistic analysis as illustrated
below:

Example 2 Suppose Z = aX + bY . then

φZ (ω ) = Ee jω Z = Ee j ( aX +bY )ω = φ X ,Y (aω , bω )
If X and Y are jointly Gaussian, then
φZ (ω ) = φ X ,Y (aω , bω )
j ( µ X + µY )ω − 1 (σ X2 + 2 ρ X ,Y σ X σ Y +σ Y2 )ω 2
=e 2

which is the characteristic function of a Gaussian random variable with

mean µZ = µX + µ Y and

σZ σX + 2 ρ X ,Y σ X σ Y + σ Y
2 2 2
variance =

Example 3 If Z=X+Y and X and Y are independent, then

φZ (ω ) = φ X ,Y (ω , ω )
= φ X (ω )φY (ω )
Using the property of the Fourier transform, we get

f Z ( z ) = f X ( z ) * fY ( z )
Conditional Expectation
Recall that
• If X and Y are continuous random variables, then the conditional density
function of Y given X = x is given by
fY / X ( x / y ) = f X ,Y ( x, y ) / f X ( x)

• If X and Y are discrete random variables, then the probability mass function
Y given X = x is given by
pY / X ( y / x) = p X ,Y ( x, y ) / p X ( x)

The conditional expectation of Y given X = x is defined by

⎧∞
⎪ ∫ yfY / X ( x / y ) if X and Y are continuous
µY / X = x = E (Y / X = x) = ⎨ −∞
⎪ ∑ ypY / X ( x / y ) if X and Y are discrete
⎩ y∈RY

Remark
The conditional expectation of Y given X = x is also called the conditional mean

of Y given X = x.

• We can similarly define the conditional expectation of


X given X = x, denoted by E ( X / Y = y )

• Higher-order conditional moments can be defined in a similar manner.


Y given X = x
• Particulaly, the conditional variance of is given by

σ Y2 / X = x = E[(Y − µY / X = x ) 2 / X = x]

Example:

Consider the discrete random variables X and Y discussed in example .The


joint probability mass function of the random variables are tabulated in Table .
Find the joint expectation of E (Y / X = 2)
X 0 1 2 pY ( y )
Y
0 0.25 0.1 0.15 0.5
1 0.14 0.35 0.01 0.5
p X ( x) 0.39 0.45 0.16

The conditional probability mass function is given by


pY / X ( y / 2) = p X ,Y (2, y ) / p X (2)
∴ pY / X (0 / 2) = p X ,Y (2, 0) / p X (2)
0.15
= = 15 /16
0.16
and
pY / X (1/ 2) = p X ,Y (2,1) / p X (2)
0.01
= = 1/16
0.16
E (Y / x = 2) = 0 × pY / X (0 / 2) + 1 × pY / X (1/ 2)
= 1/16

Example
Suppose X and Y are jointlyuniform random variables with the joint probability
density function given by
⎧1
⎪ x ≥ 0, y ≥ 0, x + y ≤ 2
f X ,Y ( x, y ) = ⎨ 2
⎪⎩0 otherwise
Find E (Y / X = x)

1
From the figure, f X ,Y ( x, y ) = in the shaded area.
2

We have
2− x
∴ f X ( x) = ∫ f X ,Y ( x, y )dy
0
2− x 1
= ∫ dy
0 2
1
= (2 − x) 0 ≤ x ≤ 2
2

∴ fY / X ( y / x ) = f X ,Y ( x, y ) / f X ( x)
1
=
2− x

∴ E (Y / X = x) = ∫ yfY / X ( y / x)dy
−∞
2− x 1
= ∫ y dy
0 2− x
2- x
=
2

Example

Suppose X and Y are jointly Gaussian random variables with the joint probability
density function given by

⎡ ( x−µ X )2 ( x−µ )( y − µ ) ( y−µ )2 ⎤


− 1
⎢ − 2 ρ XY X
σ σ
Y
+ Y

2(1− ρ 2 ⎢⎣ σ X
2 σ Y2 ⎥⎦
f X ,Y ( x, y ) =
) X
1 Y
e X ,Y
.
2πσ xσ y 1− ρ X2 ,Y

We have to find E (Y / X = x).

We have fY / X ( y / x) = f X ,Y ( x, y ) / f X ( x)
⎡ ( x−µ X )2 ( x−µ )( y − µ ) ( y − µ )2 ⎤
− 1
⎢ − 2 ρ XY X
σ Xσ
Y + Y

2 (1− ρ 2X ,Y ) 2 σ Y2
1 ⎣ σX Y ⎦
e
2πσ X σ Y 1− ρ X2 ,Y
= ( x−µ )2
− 12 2
X
1 σX
2π σ X
e
2
⎡ σ Y ρ X ,Y ⎤
− ⎢( y − µ Y ) − ( x − µ x )
1

2 σ Y2 (1− ρ 2X ,Y ) σX ⎦
= 1
e ⎣
2π σ Y 1− ρ X2 ,Y

Therefore,
σ Y ρ X ,Y
2
⎡ ⎤
∞ − ⎢( y − µ Y ) − ( x − µ x )⎥
1
2 σ Y2 (1− ρ X
2 )
σX
E (Y / X = x) = ∫ y 1
e ,Y ⎣ ⎦
dy
−∞ 2π σ Y 1− ρ X2 ,Y

σ Y ρ X ,Y
= µY + (x − µx )
σX
Conditional Expectation as a random variable

Note that E (Y / X = x) is a function of x.

φ ( X ) = E (Y / X )
Using this function, we may define a random variable Thus we may consider
EY / X X E (Y / X = x) E (Y / X )
as a function of the random variable and as the value of at
X = x.

E (Y / X ) is a random variable, E (Y / X = x) is the value of E (Y / X ) at X = x

EE (Y / X ) = EY

EE (Y / X ) = EY
Proof:

EE (Y / X ) = ∫ E (Y / X = x) f X ( x)dx
−∞
∞ ∞
= ∫ ∫ yfY / X ( x / y )dy f X ( x)dx
−∞ −∞
∞ ∞
= ∫ ∫ yf X ( x) fY / X ( x / y )dydx
−∞ −∞
∞ ∞
= ∫ ∫ yf X ,Y ( x, y )dydx
−∞ −∞
∞ ∞
= ∫ y ∫ f X ,Y ( x, y )dxdy
−∞ −∞

= ∫ yfY ( y )dy
−∞

= EY

Thus EE (Y / X ) = EY and similarly

EE ( X / Y ) = EX

Baysean Estimation theory and conditional expectation


Consider two random variables X and Y with joint pdf f X ,Y ( x, y ). Suppose

Y is observable and f X ( x ) is known. We have to estimate X for a given value in

some optimal sense. I


in a sense that some values of θ are more likely (a priori information). We can
represent this prior information in the form of a prior density function.
In the folowing we omit the suffix in density functions just for notational simplicity.

Obervation Y = y
fY / X ( y / x )

Random
variable X
with density
f X ( x)

The conditional density function fY / X ( y / x) is called likelihood function in

estimation terminology.
f X ,Y ( x, y ) = f X ( x ) fY / X ( y )

Also we have the Bayes rule


f X ( x ) fY / X ( y / x )
f X /Y ( x / y) =
fY ( y )
where f Θ / X (θ ) is the a posteriori density function

Suppose the optimam estimator Xˆ (Y ) is a function of the random variable Y such that
it minimizes the mean-square error
The Estimation error E ( Xˆ (Y ) − X ) 2 . Such an estimator is known as the minimum mean-
square error estimator.

Estimation problem is
∞ ∞

∫ ∫ ( Xˆ ( y) − x)
2
Minimize f X ,Y (x, y )dx dy
−∞ −∞

with respect to X̂(y) .


This is equivalent to minimizing
∞ ∞

∫ ∫ ( Xˆ ( y) − x)
2
fY (y ) f X / Y ( x / y )dx dy
−∞ −∞
∞ ∞
= ∫ ∫ ( Xˆ ( y) − x)
2
f X / Y ( x / y )dx fY (y )dy
−∞ −∞

Since fY (y ) is always +ve, the above integral will be minimum if the inner integral is
minimum. This results in the problem:

∫ ( Xˆ ( y) − x)
2
Minimize f X / Y ( x / y )dx
−∞

with respect to Xˆ ( y ). The minimum is given by





∂X ( y ) −∞
ˆ
( Xˆ ( y ) − x)2 f X / Y ( x / y )dx = 0


⇒ −2 ∫ ( Xˆ ( y ) − x) f X / Y ( x / y )dx = 0
−∞
∞ ∞
⇒ ∫
−∞
Xˆ ( y ) f X / Y ( x / y )dx = ∫xf
−∞
X /Y ( x / y )dx =E ( X / Y = y )
Multiple Random Variables
In many applications we have to deal with many random variables. For example, in the
navigation problem, the position of a space craft is represented by three random variables
denoting the x, y and z coordinates. The noise affecting the R, G, B channels of colour
video may be represented by three random variables. In such situations, it is convenient to
define the vector-valued random variables where each component of the vector is a
random variable.

In this lecture, we extend the concepts of joint random variables to the case of multiple
random variables. A generalized analysis will be presented n random variables defined
on the same sample space.

Joint CDF of n random variables

Consider n random variables X 1 , X 2 ,.., X n defined on the same probability space


( S , F, P). We define the random vector X as,

⎡ X1 ⎤
⎢X ⎥
⎢ 2⎥
X =⎢ . ⎥ or X ' = [ X1 X2 . . Xn ]
⎢ ⎥
⎢ . ⎥
⎢⎣ X n ⎥⎦

where ' indicates the transpose operation.


Thus an n − dimensional random vector X. is defined by the mapping X : S → R n . A
particular value of the random vector X. is denoted by

x=[ x1 x2 .. xn ]'.

The CDF of the random vector X. is defined as the joint CDF of X 1 , X 2 ,.., X n . Thus
FX1 , X 2 ,.., X n ( x1 , x2 ,..xn ) = FX (x)
= P ({ X 1 ≤ x1 , X 2 ≤ x2 ,.. X n ≤ xn })

Some of the most important properties of the joint CDF are listed below. These properties
are mere extensions of the properties of two joint random variables.

Properties of the joint CDF of n random variables

(a) FX1 , X 2 ,.. X n ( x1 , x2 ,..xn ) is a non-decreasing function of each of its arguments.


(b) FX1 , X 2 ,.. X n ( −∞, x2 ,..xn ) = FX1 , X 2 ,.. X n ( x1 , −∞,..xn ) = ... = FX1 , X 2 ,.. X n ( x1 , x2 ,.. − ∞ ) = 0
(c) FX1 , X 2 ,.. X n (∞, ∞,.., ∞ ) = 1
(d) FX1 , X 2 ,.. X n ( x1 , x2 ,..xn ) is right-continuous in each of its arguments.

1
(e) The marginal CDF of a random variable X i is obtained from FX1 , X 2 ,.. X n ( x1 , x2 ,..xn ) by
letting all random variables except X i tend to ∞. Thus
FX1 ( x1 ) = FX1 , X 2 ,.. X n ( x1 , ∞,.., ∞),
FX 2 ( x2 ) = FX1 , X 2 ,.. X n (∞, x2 ,.., ∞)
and so on.

Joint pmf of n discrete random variables


Suppose X is a discrete random vector defined on the same probability space
( S , F , P). Then X is completely specified by the joint probability mass function
PX1 , X 2 ,.., X n ( x1 , x2 ,..xn ) = PX (x)
= P ({ X 1 = x1 , X 2 = x2 ,.. X n = xn })

Given PX1 , X 2 ,.., X n ( x1 , x2 ,..xn ) we can find the marginal probability mass function

p X1 ( x1 ) = ∑∑ ....∑ p X 2 , X 3 ,..., X n ( x1 , x2 ,..., xn )


X 2 X3 Xn

n -1 summations

Joint PDF of n random variables


If X is a continuous random vector, that is, FX1 , X 2 ,.. X n ( x1 , x2 ,..xn ) is continuous in each of

its argument, then X can be specified by the joint probability density function
f X (x) = f X1 , X 2 ,.. X n ( x1 , x2 ,..xn )
∂n
= FX , X ,.. X ( x1 , x2 ,..xn )
∂x1∂x2 ...∂xn 1 2 n

Properties of joint PDF of n random variables


The joint pdf of n random variables satisfies the following important properties
(1) f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) is always a non-negaitive quantity. That is,

f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) ≥ 0 ∀( x1 , x2 ,..xn ) ∈ \ n


∞ ∞ ∞
(2) ∫ ∫ ... ∫ f X1 , X 2 ,.. X n ( x1 , x2 ,..xn )dx1 dx2 ...dxn = 1
−∞ −∞ −∞

(3) Given f X (x) = f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) for all ( x1 , x2 ,..xn ) ∈ R n , we can find the
probability of a Borel set (region ) B ⊆ R n ,
P({( x1 , x2 ,..xn ) ∈ B}) = ∫ ∫ .. ∫ f X1 , X 2 ,.. X n ( x1 , x2 ,..xn )dx1 dx2 ...dxn
B

(4) The marginal CDF of a random variable X i is related to f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) by

the ( n − 1) − fold integral

2
∞ ∞ ∞
f X i ( xi ) = ∫ ∫ ... ∫ f X1 , X 2 ,.. X n ( x1 , x2 ,..xi ..xn )dx1 dx2 ...dxn
−∞ −∞ −∞

where the integral is performed over all the arguments except xi .


Similarly,
∞ ∞ ∞
f X i , X j ( xi , x j ) = ∫ ∫ ... ∫ f X1 , X 2 ,.. X n ( x1 , x2 ,..xi ..xn )dx1 dx2 ...dxn ( n − 2) − fold integral
−∞ −∞ −∞

and so on.
The conditional density functions are defined in a similar manner. Thus
f X , X ,..., X ( x1 , x2 ,........., xn )
f X m +1 , X m + 2 ,......, X n / X1 , X 2 ,......., X m ( xm +1 , xm + 2 ,...... xn / x1 , x2 ,........., xm ) = 1 2 n
f X1 , X 2 ,..., X m ( x1 , x2 ,........., xm )

Independent random variables:


The random variables X 1 , X 2 ,..............., X n are called (mutually) independent if and only
if

n
f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) = ∏ f X i ( xi )
i =1

For example, if X 1 , X 2 ,..............., X n are independent Gaussian random variables, then

1 ( xi − µi )
2
n −
1 2 σ i2
f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) = ∏ e
i =1 2πσ i

Remark, X 1 , X 2 ,................, X n may be pair wise independent, but may not be mutually
independent.

Identically distributed random variables:

The random variables X 1 , X 2 ,....................., X n are called identically distributed if each


random variable has the same marginal distribution function, that is,

FX1 ( x ) = FX 2 ( x ) = ......... = FX n ( x ) ∀x

An important subclass of independent random variables is the independent and identically


distributed (iid) random variables. The random variables X 1 , X 2 ,....., X n are called iid if
X 1 , X 2 ,......., X n are mutually independent and each of X 1 , X 2 ,......., X n has the same
marginal distribution function.

Example: If X 1 , X 2 ,......., X n may be iid random variables generated by n independent


throwing of a fair coin and each taking values 0 and 1, then X 1 , X 2 ,......., X n are iid and

n
⎛1⎞
p X (1,1,......,1) = ⎜ ⎟
⎝2⎠
3
Moments of Multiple random variables
Consider n jointly random variables represented by the random vector
X = [ X 1 , X 2 ,....., X n ] '. The expected value of any scalar-valued function g ( X) is defined

using the n − fold integral as


∞ ∞ ∞
E ( g ( X) = ∫ ∫ ... ∫ g ( x1 , x2 ,..xn ) f X1 , X 2 ,.. X n ( x1 , x2 ,..xn )dx1 dx2 ...dxn
−∞ −∞ −∞

The mean vector of X, denoted by µ X , is defined as


µ X = E ( X)
= [ E ( X 1 ) E ( X 2 )......E ( X n ) ] '.
= ⎡⎣ µ X1 µ X 2 ... µ X n ⎤⎦ '.

Similarly for each (i, j ) i = 1, 2,.., n, j = 1, 2,.., n we can define the covariance
Cov( X i , X j ) = E ( X i − µ X i )( X j − µ X j )
All the possible covariances can be represented in terms of a matrix called the covariance
matrix CX defined by
CX = E ( X − µ X )( X − µ X )′
⎡ var( X i ) cov( X 1 , X 2 )" cov( X 1 , X n ) ⎤
⎢cov( X , X ) var( X )". cov( X , X ) ⎥
=⎢ 2 1 2 2 n ⎥

⎢# # # ⎥
⎢ ⎥
⎣ cov( X n , X 1 ) cov( X n , X 2 )" var( X n ) ⎦

Properties of the Covariance Matrix


• CX is a symmetric matrix because Cov ( X i , X j ) = Cov ( X j , X i )

• CX is a non-negative definite matrix in the sense that for any real vector z ≠ 0,
the quadratic form z ′CX z ≥ 0. The result can be proved as follows:
z ′C X z = z ′E (( X − µ X )( X − µ X )′)z
= E (z ′( X − µ X )( X − µ X )′z )
= E (z ′( X − µ X )) 2
≥0

The covariance matrix represents second-order relationship between each pair of the
random variables and plays an important role in applications of random variables.
• The n random variables X 1 , X 2 ,.., X n are called uncorrelated if for each
(i, j ) i = 1, 2,.., n, j = 1, 2,.., n
Cov( X i , X j ) = 0
If X 1 , X 2 ,.., X n are uncorrelated, CX will be a diagonal matrix.

4
Multiple Jointly Gaussian Random variables
For any positive integer n, X 1 , X 2 ,....., X n represent n jointly random variables.

These n random variables define a random vector X = [ X 1 , X 2 ,....., X n ]'. These


random variables are called jointly Gaussian if the random variables X 1 , X 2 ,....., X n
have joint probability density function given by
1
− X'C−X1 X
e 2
f X1 , X 2 ,....., X n ( x1 , x2 ,...xn ) =
( )
n
2π det(CX )

where C X = E ( X − µ X )( X − µ X ) ' is the covariance matrix and


µ X = E ( X) = [ E ( X 1 ), E ( X 2 )......E ( X n ) ] ' is the vector formed by the means of the random

variables.

5
Vector space Interpretation of Random Variables

Consider a set V with elements called vectors and the field of real numbers \.
V is called a vector space if and only if

1. An operation vector addition '+' is defined in V such that ( V ,+) is a


commutative group. Thus ( V ,+) satisfies the following properties.

(i) For any pair of elements v, w ∈ V , there exists a unique element


(v + w) ∈ V.
(ii) Vector addition is associative: v + (w + z) = (v + w) + z for any three vectors
v, w, z ∈ V
(iii) There is a vector 0 ∈ V such that v + 0 = 0 + v for any v ∈ V.
(iv) For any v ∈ V there is a vector − v ∈ V such that v + (-v) = 0 = (-v) + v.
(v) for any v, w ∈ V , v + w = w + v.

2. For any element v ∈ V and any r ∈ \. the scalar product rv ∈ V


This scalar product has the following properties for any r , s ∈ \. and any
v, w ∈ V ,
3. r ( sv ) = ( rs ) v for r , s ∈ \ and v ∈ V
4. r(v + w) = rv + rw
5. (r + s)v = rv + sv
6. 1v = v

It is easy to verify that the set of all random variables defined on a probability space
( S , F, P ) forms a vector space with respect to addition and scalar multiplication. Similarly
the set of all n − dim ensional random vectors forms a vector space. Interpretation of
random variables as elements of a vector space help in understanding many operations
involving random variables.

Linear Independence
Consider n random vectors v1 , v 2 , ...., v N .

If c1 v1 + c2 v 2 + .... + cN v N = 0 implies that

c1 = c2 = .... = cN = 0, then v1 , v 2 , ...., v N are linearly independent.


For N random vectors X1 , X 2 , ...., X N , if c1X1 + c2 X 2 + .... + cN X N = 0 implies that

c1 = c2 = .... = cN = 0, then X1 , X 2 , ...., X N are linearly independent.

Inner Product
If v and w are real vectors in a vector space V defined over the field \ , the inner product
< v, w > is a scalar such that
∀v, w, z ∈ V and r ∈ \
1. < v, w > = < w, v >
2
2. < v, v > = v ≥ 0, where v is a norm inducedby the inner product
3. < v + w , z > = < v, z > + < w, z >
4. < rv , w > = r < v, w >

In the case of two random variables X and Y , the joint expectation EXY defines an
inner product between X and Y . Thus
< X, Y > = EXY

We can easily verify that EXY satisfies the axioms of inner product.
The norm of a a random variable X is given by
2
X = EX 2

⎡ X1 ⎤ ⎡Y1 ⎤
⎢X ⎥ ⎢Y ⎥
For two n − dimensional random vectors X = ⎢ 2⎥
and Y = ⎢ ⎥ , the inner product is
2

⎢# ⎥ ⎢# ⎥
⎢ ⎥ ⎢ ⎥
⎣Xn ⎦ ⎣Yn ⎦
n
< X, Y > = EX′Y = ∑ EX iYi
i

The norm of a a random vector X is given by


n
X =< X, X > = EX′X = ∑ EX i2
2

• The set of RVs along with the inner product defined through the joint expectation
operation and the corresponding norm defines a Hilbert Space.

Schwarz Inequality
For any two vectors v and w belonging to a Hilbert space V
|< v, w >|≤ v w

This means that for any two random variables X and Y

E ( XY ) ≤ EX 2 EY 2

Similarly for any two random vectors X and Y


E ( X′Y) ≤ EX′X EY′Y

Orthogonal Random Variables and Orthogonal Random Vectors


Two vectors v and w are called orthogonal if < v, w > = 0
Two random variables X and Y are called orthogonal if EXY = 0 .
Similarly two random vectors X and Y are called orthogonal if
n
EX′Y = ∑ EX iYi = 0
i

Just like the of independent random variables and the uncorrelated random variables, the
orthogonal random variables form an important class of random variables.

Remark
If X and Y are uncorrelated, then
E ( X − µ X )(Y − µY ) = 0
∴ ( X − µ X ) is orthogonal to (Y − µY )
If each of X and Y is zero-mean
Cov( X ,Y ) = EXY
In this case, EXY = 0 ⇔ Cov ( XY ) = 0.
Minimum Mean-square-error Estimation
Suppose X is a random variable which is not observable and Y is another observable
random variable which is statistically dependent on X through the joint probability
density function f X ,Y ( x, y ). We pose the followingproblem.

Given a value of Y what is the best guess for X ?


This problem is known as the estimation problem and has many practical applications.
One application is the signal estimation from noisy observations as illustrated in the Fig.
below:

Signal Estimated
Noisy observation Y
X + Estimation Signal X̂

Noise

Let Xˆ (Y ) be the estimate of the random variable X based on the random variable

Y . Clearly Xˆ (Y ) is a function of Y . We have to find best estimate be Xˆ (Y ) in some


meaningful sense. Observe that
• X is the unknown random variable
• Xˆ (Y ) is the estimate of X .

• X − Xˆ (Y ) is the estimation error.

• E ( X − Xˆ (Y )) 2 is the mean of the square error.

One meaningful criterion is to minimize E ( X − Xˆ (Y )) 2 with respect to Xˆ (Y ) and the


corresponding estimation principle is called minimum mean square error principle. Such
a function which we want to minimize is called a cost function in optimization theory. For
finding Xˆ (Y ), we have to minimize the cost function
∞ ∞
E ( X − Xˆ (Y )) 2 = ∫ ∫ ( x − Xˆ ( y )) 2 f X ,Y ( x, y )dydx
−∞ −∞
∞ ∞
= ∫ ∫ ( x − Xˆ ( y )) 2 fY ( y ) f X / Y ( x)dydx
−∞ −∞
∞ ∞
= ∫ fY ( y )( ∫ ( x − Xˆ ( y )) 2 f X / Y ( x)dx)dy
−∞ −∞

Since fY ( y ) is always positive, therefore the minimization of

E ( X − Xˆ (Y )) 2 with respect to Xˆ (Y ) is equivalent to minimizing the inside



integral ∫ ( x − Xˆ ( y ))2 f X / Y ( x)dx with respect to Xˆ (Y ). The condition for the
−∞

minimum is

∫ ( x − Xˆ ( y )) f X / Y ( x / y )dx = 0
∂ 2
∂Xˆ
−∞

Or 2 ∫ ( x − Xˆ ( y )) f X / Y ( x / y )dx = 0
-∞
∞ ∞
⇒ ∫ Xˆ ( y ) f X / Y ( x / y )dx = ∫ xf X / Y ( x / y )dx
−∞ −∞

⇒ Xˆ ( y ) = E ( X / Y = y )

Thus, the minimum mean-square error estimation involves conditional expectation


E ( X / Y = y ). To find E ( X / Y = y ), we have to determine the a posteriori probability

density f X / Y ( x / y ) and perform ∫ xf X / Y ( x / y )dx. These operations are computationally
−∞

exhaustive when we have to perform these operations numerically.

Example Consider two zero-mean jointly Gaussian random variables X and Y with the
joint pdf
⎡ x2 y2 ⎤
− ⎢ σ 2 − 2 ρ XY σ σ + σ 2 ⎥
1 xy
2 (1− ρ 2X ,Y )
f X ,Y ( x, y ) = ⎣ X Y⎦
1 X Y
e
2πσ X σ Y 1− ρ X2 ,Y

-∞ < x < ∞, -∞ < y < ∞


The marginal density fY ( y ) is a Gaussian random variable and given by
y2

fY ( y ) = 1
2π σ Y
e 2 σ Y2

f X ,Y ( x, y )
∴ f X /Y ( x / y) =
fY ( y )
2
⎡ ρ XY σ X ⎤
− 1
⎢x− σ y⎥
2 σ 2X (1− ρ 2X ,Y )
= 1
e ⎣ Y ⎦
2πσ X 1− ρ X2 ,Y
ρ XY σ
X
which is Gaussian with mean σ y . Therefore, the MMSE estimator of X given
Y

Y = y is given by

Xˆ ( y ) = E ( X / Y = y )
=
ρ XY σ
X
σ y
Y

This example illustrates that in the case of jointly Gaussian random variables X and Y ,
the mean-square estimator of X given Y = y , is linearly related with y. This important result
gives us a clue to have simpler version of the mean-square error estimation problem discussed
below.

Linear Minimum Mean-square-error Estimation and the Orthogonality Principle


We assume that X and Y are both zero-mean and Xˆ ( y ) = ay. The estimation problem
is now to find the optimal value for a. Thus we have the linear minimum mean-square
error criterion which minimizes E ( X − aY )2 with respect to a.

Estimated
Signal Noisy observation Y
+ Estimation Signal X̂
X
X̂ = aY

Noise
d
da E ( X − aY ) 2 = 0
⇒ E dad ( X − aY ) 2 = 0
⇒ E ( X − aY )Y = 0
⇒ EeY = 0
where e is the estimation error.

Thus the optimum value of a is such that the estimation error ( X − aY ) is orthogonal to
the observed random variable Y and the optimal estimator aY is the orthogonal
projection of X on Y . This orthogonality principle forms the heart of a class of
estimation problem called Wiener filtering. The orthogonality principle is illustrated
geometrically in the following figure (Fig. ).

X
e is orthogonal to Y

aY Y

Orothogonal
projection

The optimum value of a is given by

E ( X − aY )Y = 0
⇒ EXY − aEY 2 = 0
EXY
⇒ a=
EY 2
The corresponding minimum linear mean-square error (LMMSE) is
LMMSE = E ( X − aY ) 2
= E ( X − aY ) X − aE ( X − aY )Y
= E ( X − aY ) X − 0
( E ( X − aY )Y = 0, using the orthogonality principle)
= EX 2 − aEXY
The orthogonality principle can be applied to optimal estimation of a random variable
from more than one observation. We illustrate this in the following example.

Example Suppose X is a zero-mean random variable which is to be estimated from two


zero-mean random variables Y1 and Y2 . Let the LMMSE estimator be Xˆ = a1Y1 + a2Y2 .
Then the optimal values of a1 and a2 are given by
∂E ( X − a1Y1 − a2Y2 ) 2
= 0 i = 1, 2.
∂ai
This results in the orthogonality conditions

E ( X − a1Y1 − a2Y2 )Y1 = 0


and
E ( X − a1Y1 − a2Y2 )Y2 = 0
Rewriting the above equations we get
a1 EY12 + a2 EY2Y1 = EXY1
and
a1 EY2Y1 + a2 EY22 + = EXY2
Solving these equations we can find a1 and a2 .
Further the corresponding minimum linear mean-square error (LMMSE) is

LMMSE = E ( X − a1Y1 − a2Y2 ) 2


= E ( X − a1Y1 − a2Y2 ) X − a1 E ( X − a1Y1 − a2Y2 )Y1 − a2 E ( X − a1Y1 − a2Y2 )Y2
= E ( X − a1Y1 − a2Y2 ) X − 0
( using the orthogonality principle)
= EX 2 − a1 EXY1 − a2 EXY2
Convergence of a sequence of random variables
Let X 1 , X 2 ,..., X n be a sequence n independent and identically distributed random

variables. Suppose we want to estimate the mean of the random variable on the basis of
the observed data by means of the relation
1 N
µn = ∑ Xi
n i =1
How closely does µ n represent the true mean µ X as n is increased? How do we

measure the closeness between µ n and µ X ?


Notice that µ n is a random variable. What do we mean by the statement µ n converges

to µ X ?

• Consider a deterministic sequence of real numbers x1 , x 2 ,....x n .... The sequence

converges to a limit x if correspond to every ε > 0 we can find a positive integer


N such that x − xn < ε for n > N . For example, the sequence

1, 12 ,..., 1n ,... converges to the number 0.

• The Cauchy criterion gives the condition for convergence of a sequence without
actually finding the limit. The sequence x1 , x 2 ,....x n .... converges if and only if ,

for every ε > 0 there exists a positive integer N such that


xn + m − xn < ε for all n > N and all m > 0.

Convergence of a random sequence X 1 , X 2 ,.... X n .... cannot be defined as above. Note

that for each s ∈ S , X 1 ( s ), X 2 ( s ),.... X n ( s ).... represent a sequence of numbers . Thus

X 1 , X 2 ,.... X n .... represents a family of sequences of numbers. Convergence of a random

sequence is to be defined using different criteria. Five of these criteria are explained
below.

Convergence Everywhere
A sequence of random variables is said to converge everywhere to X if
X ( s ) − X n ( s ) → 0 for n > N and ∀s ∈ S .

Note here that the sequence of numbers for each sample point is convergent.

Almost sure (a.s.) convergence or convergence with probability 1


A random sequence X 1 , X 2 ,.... X n ,.... may not converge for every s ∈ S .

Consider the event {s | X n ( s ) → X }

1
The sequence X 1 , X 2 ,.... X n ,.... is said to converge to X almost sure or with probability
1 if
P{s | X n ( s ) → X ( s )} = 1 as n → ∞,
or equivalently for every ε >0 there exists N such that
P{s X n ( s ) − X ( s ) < ε for all for n ≥ N } = 1

We write X n ⎯⎯→
a .s.
X in this case

One important application is the Strong Law of Large Numbers(SLLN):


If X 1 , X 2 ,.... X n .... are independent and identically distributed random variables with a

1 n
finite mean µ X , then ∑ X i → µ X with probability 1as n → ∞.
n i =1
Remark:
1 n
• µn = ∑ X i is called the sample mean.
n i =1
• The strong law of large number states that the sample mean converges to the true
mean as the sample size increases.
• The SLLN is one of the fundamental theorems of probability. There is a weaker
versions of the law that we will discuss letter
Convergence in mean square sense
A random sequence X 1 , X 2 ,.... X n .... is said to converge in the mean-square sense (m.s) to

a random variable X if
E ( X n − X )2 → 0 as n→∞

X is called the mean-square limit of the sequence and we write


l.i.m. X n = X

where l.i.m. means limit in mean-square. We also write


X n ⎯⎯→
m.s .
X

• The following Cauchy criterion gives the condition for m.s. convergence of a
random sequence without actually finding the limit. The sequence
X 1 , X 2 ,.... X n .... converges in m.s. if and only if , for every ε > 0 there exists a

positive integer N such that


2
E ⎡⎣ xn + m − xn ⎤⎦ → 0 as n → ∞ for all m > 0.

Example :

2
If X 1 , X 2 ,.... X n .... are iid random variables, then

1N
∑ X i → µ X in the mean square 1as n → ∞.
n i =1
1N
We have to show that lim E ( ∑ X i − µ X )2 = 0
n →∞ n i =1
Now,
1N 1 N
E ( ∑ X i − µ X ) 2 = E ( ( ∑ ( X i − µ X )) 2
n i =1 n i =1
1 N 1 n n
= 2 ∑ E ( X i − µ X ) 2 + 2 ∑ ∑ E ( X i − µ X )( X j − µ X )
n i=1 n i=1 j=1,j≠i
nσ X2
= +0 ( Because of independence)
n2
σ X2
=
n
1N
∴ lim E ( ∑ X i − µ X ) 2 = 0
n→∞ n i =1

Convergence in probability
Associated with the sequence of random variables X 1 , X 2 ,.... X n ,...., we can define a

sequence of probabilities P{ X n − X > ε }, n = 1, 2,... for every ε > 0.

The sequence X 1 , X 2 ,.... X n .... is said to convergent to X in probability if this sequence


of probability is convergent that is
P{ X n − X > ε } → 0 as n→∞

for every ε > 0. We write X n ⎯⎯


P
→ X to denote convergence in probability of the

sequence of random variables X 1 , X 2 ,.... X n .... to the random variable X .

If a sequence is convergent in mean, then it is convergent in probability also, because


P{ X n − X 2
> ε 2 } ≤ E ( X n − X )2 / ε 2 (Markov Inequality)
We have
P{ X n − X > ε } ≤ E ( X n − X ) 2 / ε 2

If E ( X n − X ) 2 → 0 as n → ∞, (mean square convergent) then

P{ X n − X > ε } → 0 as n → ∞.

Example:
Suppose { X n } be a sequence of random variables with

3
1
P{ X n = 1} = 1 −
n
and
1
P{ X n = −1} =
n
Clearly
1
P{ X n − 1 > ε } = P{ X n = −1} = →0
n
as n → ∞.

Therefore { X n } ⎯⎯
P
→{ X = 0}

Thus the above sequence converges to a constant in probability.


Remark:
Convergence in probability is also called stochastic convergence.
Weak Laws of Large numbers
If X 1 , X 2 ,.... X n .... are independent and identically distributed random variables, with
1 n
sample mean µn = ∑ X i . Then µn ⎯⎯
P
→ µ X as n → ∞.
n i =1
We have
1 n
µn = ∑ Xi
n i =1
1 n
∴ E µn = ∑ X i = µ X
n i =1
and
σ X2
E ( µn − µ X )2 = (as shown above)
n
P{ µn − µ X ≥ ε } ≤ E ( µn − µ X )2 / ε 2
σ X2
= 2

∴ P{ µn − µ X ≥ ε } → 0 as n → ∞.

Convergence in distribution
Consider the random sequence X 1 , X 2 ,.... X n .... and a random variable X . Suppose

FX n ( x ) and FX ( x ) are the distribution functions of X n and X respectively. The sequence

is said to converge to X in distribution if


FX n ( x ) → FX ( x ) as n → ∞.

4
for all x at which FX ( x ) is continuous. Here the two distribution functions eventually

coincide. We write X n ⎯⎯
d
→ X to denote convergence in distribution of the random

sequence X 1 , X 2 ,.... X n .... to the random variable X .

Example: Suppose X 1 , X 2 ,.... X n .... is a sequence of RVs with each random variable X i
having the uniform density
⎧1
⎪ x≤b
f X i ( x) = ⎨ b
⎪⎩0 other wise
Define Z n = max( X 1 , X 2 ,.... X n )
We can show that
⎧= 0, z < 0
⎪ n
⎪z
FZn ( z ) = ⎨ n , 0 ≤ z < a
⎪a
⎪⎩1 otherwise
Clearly,
⎧0, z < a
lim FZn ( z ) = FZ ( z ) = ⎨
n →∞
⎩1 z ≥ a
∴{ zn } Converges to Z in distribution.

Relation between Types of Convergence

X n ⎯⎯
as
→X
Convergence almost sure

X n ⎯⎯
p
→X X n ⎯⎯
d
→X
Convergence in probability Convergence in distribution

X n ⎯⎯→
m. s .
X
Convergence in mean-square

5
Central Limit Theorem
Consider n independent random variables X 1 , X 2 ,.... X n The mean and variance of each

of the random variables are known. Suppose E ( X i ) = µ X i and var( X i ) = σ X2 i .

Form a random variable


Yn = X 1 + X 2 + ... X n

The mean and variances of Yn are given by

EYn = µYn = µ X1 + µ X 2 + ...µ X n

and
n
var(Yn ) = σ Y2n = E{∑ ( X i − µ i )}2
i =1
n n n
=∑ E ( X i − µi )2 + ∑ ∑ E ( X i − µ i )( X j − µ j )
i =1 i =1 j =1, j ≠ i

= σ X2 1 + σ X2 2 + ... + σ X2 n
∵ X i and X j are independent for (i ≠ j )

Thus we can determine the mean and variances of Yn . Can we guess about the probability

distribution of Yn ?
The central limit theorem (CLT) provides an answer to this question.
n
The CLT states that under very general conditions {Yn = ∑ X i } converges in distribution
i =1

to Y ~ N ( µY , σ Y2 ) as n → ∞. The conditions are:

1. The random variables X , X ,..., X are independent with same mean and
1 2 n

variance, but not identically distributed.


2. The random variables X , X ,..., X are independent with different means and
1 2 n

same variance and not identically distributed.


3. The random variables X , X ,..., X are independent with different means and
1 2 n

each variance being neither too small nor too large.


We shall consider the first condition only. In this case, the central-limit theorem can
be stated as follows:
Suppose X 1 , X 2 .... X n is a sequence of independent and identically distributed random
n
( X − µX )
variables each with mean µ X and variance σ X2 and Yn = ∑ i . Then, the
i =1 n
sequence {Yn } converges in distribution to a Gaussian random variable Y with mean
0 and variance σ X2 . That is,
y 1
lim FYn ( y ) = ∫ e −u 2σ X2
2
du
n →∞ −∞
2π σ X

Remarks
• The central limit theorem is really a property of convolution, Consider the
sum of two statistically independent random variables, say, Y = X 1 + X 2 . Then
the pdf fY ( y ) the convolution of f X1 ( x ) and f X 2 ( x ) . This can be shown with
the help of the characteristic functions as follows:

φY (ω ) = E ⎡⎣e jω ( x + x ) ⎤⎦
1 2

= E ( e jω x1 ) E ( e jω x2 ) = φ x1 (ω )φ x2 (ω )
∴ fY ( x ) = f X1 ( y ) * f X 2 ( y )

= ∫
−∞
f X1 (τ ) f X 2 ( y − τ )dτ

where * is the convolution operation.


We can illustrate this by convolving two uniform distributions repeatedly. The
convolution of two uniform distributions gives a triangular distribution. Further
convolution gives a parabolic distribution and so on.
(To be animated)

Proof of the central limit theorem


We give a less rigorous proof of the theorem with the help of the characteristic function.
Further we consider each of X 1 , X 2 .... X n to have zero mean. Thus,

Yn = ( X 1 + X 2 + ... X n ) / n .
Clearly,
µY = 0,
n

σ = σ X2 ,
2
Yn

E (Yn3 ) = E ( X 3 ) / n and so on.


The characteristic function of Yn is given by
⎡ ⎛⎜ jω 1 ∑n X i ⎞⎟ ⎤
φY (ω ) = E ( e )
⎜ n i =1 ⎟⎠ ⎥
jω Yn
= E ⎢ e⎝
n
⎢ ⎥
⎣ ⎦

We will show that as n → ∞ the characteristic function φY is of the form of the n

characteristic function of a Gaussian random variable.

Expanding e jwY in power series


n
( jω ) 2 2 ( jω )3 3
e jωYn = 1 + jωYn +
Yn + Yn + ....
2! 3!
Assume all the moments of Yn to be finite. Then

( jω ) 2 ( jω ) 2
φY (ω ) = E ( e jωY ) = 1 + jωµY +
n
E (Yn
2
) + E (Yn3 ) + ...
n n
2! 3!
Substituting µYn = 0 and E (Yn2 ) = σ Y2n = σ X2 , we get
Therefore,
φYn (ω ) = 1 − (ω 2 / 2!)σ X2 + R(ω , n)
where R(ω , n) is the average of terms involving ω 3 and higher powers of ω .

∴ φYn ( w) = 1 − (ω 2 / 2!)σ X2 + R(ω , n).


Note also that each term in R(ω , n) involves a ratio of a higher moment and
a power of n and therefore,
lim R(ω , n) = 0
n →∞
ω 2σ X2

∴ lim In φYn (ω ) = e 2
n →∞

which is the characteristic function of a Gaussian random variable with 0 mean and
variance σ X2 .
Yn ⎯⎯d
→ N (0, σ X2 )

Remark:

(1) Under the conditions of the CLT, the sample mean


1N σ X2
µˆ X = ∑ X i converges in distribution to N ( µ X , ). In other words, if samples are
n i =1 n
taken from any distribution with mean µ X and variance σ X2 , as the sample size n
increases, the distribution function of the sample mean approaches to the distribution
function of a Gaussian random variable.
(2) The CLT states that the distribution function FX n ( x ) converges to a Gaussian

distribution function. The theorem does not say that the pdf fYn ( y ) is a Gaussian pdf

in the limit. For example, suppose each X i has a Bernoulli distribution. Then the pdf
of Y consists of impulses and can never approach the Gaussian pdf.
(3) The the Cauchy distribution does not meet the conditions for the central limit
theorem to hold. As we have noted earlier, this distribution does not have a finite
mean or a variance.
Suppose a random variable X i has the Cauchy distribution

1
f X i ( x) = -∞ < x < ∞.
π (1 + x 2 )
The characteristic function of X i is given by

φ X ( w) = e − w
i

1N
The sample mean µˆ X = ∑ X i will have the chacteristic function
n i =1

φY ( w) = e − w
n

Thus the sum of large number of Cauchy random variables will not follow a
Gaussian distribution.
(4) The central-limit theorem one of the most widely used results of probability. If a
random variable is result of several independent causes, then the random variable can
be considered to be Gaussian. For example,
-the thermal noise in a resistor is result of the independent motion of billions electrons
and is modelled as a Gaussian.
-the observation error/ measurement error of any process is modeled as a Gaussian.
(5) The CLT can be used to simulate a Gaussian distribution given a
routine to simulate a particular random variable

Normal approximation of the Binomial distribution

One of the application of the CLT is in approximation of the Binomial coefficients.

Suppose X1, X2, X3, ………… Xn,….. is a sequence of Bernoulli(p) random variables with
P{ Xi =1}= p and P{ xi =0}= 1- p.
n
Then yn = ∑ X i is a Binomial distribution with µ y n = np and σ y2n = n p ( 1 − p ) .
i =1

Yn − np
Thus, ⎯⎯
d
→ N ( 0, 1)
np(1 − p)
or Yn ⎯⎯
d
→ N ( np, np(1 − p) )
1 ( y − np )2
k −1 1 − .
∴ P ( k − 1 < Yn ≤ k ) = ∫ e 2 np (1− p )
.dy
k −1
2π np(1 − p)
1 ( y − np )2
1 − .
∴ P (Yn = k ) = e 2 np (1− p ) ( assume the integrand interval = 1 )
2π np (1 − p)

This is normal approximation to the Binomial coefficients and known as the De-Moirre-
Laplace approximation.
RANDOM PROCESSES
In practical problems we deal with time varying waveforms whose value at a time is
random in nature. For example, the speech waveform, the signal received by
communication receiver or the daily record of stock-market data represents random
variables that change with time. How do we characterize such data? Such data are
characterized as random or stochastic processes. This lecture covers the fundamentals of
random processes..
Random processes
Recall that a random variable maps each sample point in the sample space to a point in
the real line. A random process maps each sample point to a waveform.
Consider a probability space {S , F, P}. A random process can be defined on {S , F, P} as
an indexed family of random variables { X ( s, t ), s ∈ S,t ∈ Γ} where Γ is an index set which
may be discrete or continuous usually denoting time. Thus a random process is a function
of the sample point ξ and index variable t and may be written as X (t , ξ ).
Remark
• For a fixed t (= t 0 ), X (t 0 , ξ ) is a random variable.

• For a fixed ξ (= ξ 0 ), X (t , ξ 0 ) is a single realization of the random process and


is a deterministic function.
• For a fixed ξ (= ξ 0 ) and a fixed t (= t 0 ), X (t , ξ 0 ) is a single number.

• When both t and ξ are varying we have the random process X (t , ξ ).


The random process { X ( s, t ), s ∈ S , t ∈ T } is normally denoted by { X (t )}. Following
figure illustrates a random procee.
A random process is illustrated below.
X (t , s3 )

s3 X (t , s2 )

s1
s2
X (t , s1 )
t
Figure Random Process

( To Be animated)

Example Consider a sinusoidal signal X (t ) = A cos ω t where A is a binary random


variable with probability mass functions p A (1) = p and p A (−1) = 1 − p.

Clearly, { X (t ), t ∈ Γ} is a random process with two possible realizations X 1 (t ) = cos ω t

and X 2 (t ) = − cos ω t. At a particular time t0 X (t0 ) is a random variable with two values

cos ω t0 and − cos ω t0 .

Continuous-time vs. discrete-time process


If the index set Γ is continuous, {X(t),t∈Γ} is called a continuous-time process.

Example Suppose X (t ) = A cos( w0 t + φ ) where A and w0 are constants and φ is


uniformly distributed between 0 and 2π . X (t ) is an example of a continuous-time
process.

4 realizations of the process is illustrated below.


(TO BE ANIMATED)
φ = 0.8373π

φ = 0.9320π

φ = 1.6924π

φ = 1.8636π
If the index set Γ is a countable set, { X (t ), t ∈ Γ} is called a discrete-time process.
Such a random process can be represented as { X [n], n ∈ Z } and called a random sequence.

Sometimes the notation { X n , n ≥ 0} is used to describe a random sequence indexed by the

set of positive integers.


We can define a discrete-time random process on discrete points of time. Particularly,
we can get a discrete-time random process { X [n], n ∈ Z } by sampling a continuous-time

process { X (t ), t ∈ Γ} at a uniform interval T such that X [n] = X (nT ).


The discrete-time random process is more important in practical implementations.
Advanced statistical signal processing techniques have been developed to process this
type of signals.
Example Suppose X n = √ 2cos(ω0 n + Y ) where ω 0 is a constant and Y is a random
variable uniformly distributed between π and - π .
X n is an example of a discrete-time process.

φ = 0.4623π
φ = 1.9003π

φ = 0.9720π

Continuous-state vs. discrete-state process:

The value of a random process X (t ) is at any time t can be described from its probabilistic
model.

The state is the value taken by X (t ) at a time t, and the set of all such states is called the
state space. A random process is discrete-state if the state-space is finite or countable. It
also means that the corresponding sample space is also finite countable. Other-wise the
random process is called continuous state.
Example Consider the random sequence { X n , n ≥ 0} generated by repeated tossing of a
fair coin where we assign 1 to Head and 0 to Tail.
Clearly X n can take only two values- 0 and 1. Hence { X n , n ≥ 0} is a discrete-time two-
state process.
How to describe a random process?

As we have observed above that X (t ) at a specific time t is a random variable and can be
described by its probability distribution function FX (t ) ( x) = P ( X (t ) ≤ x). This distribution

function is called the first-order probability distribution function. We can similarly


dFX ( t ) ( x)
define the first-order probability density function f X (t ) ( x) = .
dx
To describe { X (t ), t ∈ Γ} we have to use joint distribution function of the random
variables at all possible values of t . For any positive integer n , X (t1 ), X (t 2 ),..... X (t n )

represents n jointly distributed random variables. Thus a random process


{ X (t ), t ∈ Γ} can thus be described by specifying the n-th order joint distribution
function
FX ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x2 .....xn ) = P ( X (t1 ) ≤ x1 , X (t2 ) ≤ x2 ..... X (tn ) ≤ xn ), ∀ n ≥ 1 and ∀tn ∈ Γ

or th the n-th order joint density function


∂n
f X (t1 ), X (t2 )..... X (tn ) ( x1 , x2 .....xn ) = FX (t1 ), X (t2 )..... X (tn ) ( x1 , x2 .....xn )
∂x1∂x2 ...∂xn

If { X (t ), t ∈ Γ} is a discrete-state random process, then it can be also specified by the


collection of n-th order joint probability mass function

p X ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x2 .....xn ) = P ( X (t1 ) = x1 , X (t2 ) = x2 ..... X (tn ) = xn ), ∀ n ≥ 1 and ∀tn ∈ Γ

If the random process is continuous-state, it can be specified by


Moments of a random process
We defined the moments of a random variable and joint moments of random variables.
We can define all the possible moments and joint moments of a random process
{ X (t ), t ∈ Γ}. Particularly, following moments are important.

• µ x (t ) = Mean of the random process at t = E ( X (t )


• RX (t1 , t2 ) = autocorrelation function of the process at times t1 , t2 = E ( X (t 1 ) X (t2 ))
Note that
RX (t1 , t2 ) = RX (t2 , t1 , ) and
RX (t , t ) = EX 2 (t ) = sec ond moment or mean - square value at time t.

• The autocovariance function C X (t1 , t2 ) of the random process at time

t1 and t2 is defined by

C X (t1 , t2 ) = E ( X (t 1 ) − µ X (t1 ))( X (t2 ) − µ X (t2 ))


=RX (t1 , t2 ) − µ X (t1 ) µ X (t2 )
C X (t , t ) = E ( X (t ) − µ X (t )) 2 = variance of the process at time t.

These moments give partial information about the process.


CX (t1, t2 )
The ratio ρX (t1, t2 ) = is called the correlation coefficient.
CX (t1, t1 ) CX (t2 , t2 )

The autocorrelation function and the autocovariance functions are widely used to
characterize a class of random process called the wide-sense stationary process.

We can also define higher-order moments


R X (t1 , t 2 , t 3 ) = E ( X (t 1 ), X (t 2 ), X (t 3 )) = Triple correlation function at t1 , t2 , t3 etc.

The above definitions are easily extended to a random sequence { X n , n ≥ 0}.

Example

(a) Gaussian Random Process


For any positive integer n, X (t1 ), X (t 2 ),..... X (t n ) represent n jointly random

variables. These n random variables define a random vector


X = [ X (t1 ), X (t2 ),..... X (tn )]'. The process X (t ) is called Gaussian if the random vector

[ X (t1 ), X (t2 ),..... X (tn )]' is jointly Gaussian with the joint density function given by
1
− X'C−X1X
e 2
f X (t1 ), X (t2 )... X (tn ) ( x1 , x2 ,...xn ) =
( )
n
2π det(CX )

where C X = E ( X − µ X )( X − µ X ) '
and µ X = E ( X) = [ E ( X 1 ), E ( X 2 )......E ( X n ) ] '.
The Gaussian Random Process is completely specified by the autocovariance matrix and
hence by the mean vector and the autocorrelation matrix R X = EXX ' .
(b) Bernoulli Random Process
A Bernoulli process is a discrete-time random process consisting of a sequence of
independent and identically distributed Bernoulli random variables. Thus the discrete –
time random process { X n , n ≥ 0} is Bernoulli process if
P{ X n = 1} = p and
P{ X n = 0} = 1 − p
Example
Consider the random sequence { X n , n ≥ 0} generated by repeated tossing of a fair coin

where we assign 1 to Head and 0 to Tail. Here { X n , n ≥ 0} is a Bernoulli process where

each random variable X n is a Bernoulli random variable with

1
p X (1) = P{ X n = 1} = and
2
1
p X (0) = P{ X n = 0} =
2
(c) A sinusoid with a random phase
X (t ) = A cos( w0 t + φ ) where A and w0 are constants and φ is uniformly distributed
between 0 and 2π . Thus
1
f Φ (φ ) =

X (t ) at a particular t is a random variable and it can be shown that

⎧ 1
⎪ x<A
f X ( t ) ( x ) = ⎨ π A2 − x 2
⎪0
⎩ otherwise
The pdf is sketched in the Fig. below:
The mean and autocorrelation of X (t ) :
µ X (t ) = EX (t )
= EA cos( w0t + φ )
∞ 1
= ∫ A cos( w0t + φ ) dφ
−∞ 2π
=0
RX (t1 , t2 ) = EA cos( w0t1 + φ ) A cos( w0t2 + φ )
= A2 E cos( w0t1 + φ ) cos( w0t2 + φ )
A2
= E (cos( w0 (t1 − t2 )) + cos( w0 (t1 + t2 + 2φ )))
2
A2 A2 π 1
= cos( w0 (t1 − t2 )) + ∫ cos( w0 (t1 + t2 + 2φ )) dφ
2 2 −π 2π
A2
= cos( w0 (t1 − t2 ))
2

Two or More Random Processes

In practical situations we deal with two or more random processes. We often deal with
the input and output processes of a system. To describe two or more random processes
we have to use the joint distribution functions and the joint moments.
Consider two random processes { X (t ), t ∈ Γ} and {Y (t ), t ∈ Γ}. For any positive integer

n , X (t1 ), X (t2 ),..... X (tn ), Y (t1/ ), Y (t2/ ),.....Y (t n/ ) represent 2n jointly distributed random

variables. Thus these two random processes can be described by the ( n + m)th order joint
distribution function
FX (t ), X (t ( tn ),Y ( t1/ ),Y ( t2/ ),.....Y ( t m/ )
( x1 , x2 .....xn , y1 , y2 ..... ym )
1 2 )..... X

= P( X (t1 ) ≤ x1 , X (t2 ) ≤ x2 ..... X (tn ) ≤ xn , Y (t1/ ) ≤ y1 , Y (t 2/ ) ≤ y2 .....Y (tm/ ) ≤ ym )


or the corresponding ( n + m )th order joint density function
f X (t ), X (t ( tn ),Y ( t1/ ),Y ( t2/ ),.....Y ( tn/ )
( x1 , x2 .....xn , y1 , y2 ..... yn )
1 2 )..... X

∂ 2n
= F / ( x1 , x2 ..... xn , y1 , y2 ..... yn )
∂x1∂x2 ...∂xn ∂y1∂y2 ...∂yn X ( t1 ), X (t2 )..... X ( tn ),Y (t1 ),Y (t2 ),.....Y (t n )
/ /

Two random processes can be partially described by the joint moments:

Cross − correlation function of the processes at times t1 , t2



RXY (t1 , t2 ) = E ( X (t 1 )Y (t2 )) = E ( X (t 1 )Y (t2 ))

Cross − cov ariance function of the processes at times t1 , t2


C XY (t1 , t2 ) == E ( X (t 1 ) − µ X (t1 ))( X (t2 ) − µ X (t2 ))

= RXY (t1 , t2 ) − µ X (t1 ) µY (t2 )
.
Cross-correlation coefficient
• CXY (t1, t2 )
ρXY (t1, t2 ) =
CX (t1, t1 ) CY (t2 , t2 )

On the basis of the above definitions, we can study the degree of dependence between
two random processes

¾ Independent processes: Two random processes { X (t ), t ∈ Γ} and {Y (t ), t ∈ Γ}.

are called independent if

¾ Uncorrelated processes: Two random processes { X (t ), t ∈ Γ} and {Y (t ), t ∈ Γ}.

are called uncorrelated if


C XY (t1 , t2 ) = 0 ∀ t1 , t2 ∈ Γ

This also implies that for such two processes


RXY (t1 , t2 ) = µ X (t1 ) µY (t2 )
.

¾ Orthogonal processes: Two random processes { X (t ), t ∈ Γ} and {Y (t ), t ∈ Γ}.

are called orthogonal if


R XY (t1 , t2 ) = 0 ∀ t1 , t2 ∈ Γ

Example Suppose X (t ) = A cos( w0 t + φ ) and Y (t ) = A sin( w0t + φ ) where A and w0 are


constants and φ is uniformly distributed between 0 and 2π .
RANDOM PROCESSES
In practical problems we deal with time varying waveforms whose value at a time is
random in nature. For example, the speech waveform, the signal received by
communication receiver or the daily record of stock-market data represents random
variables that change with time. How do we characterize such data? Such data are
characterized as random or stochastic processes. This lecture covers the fundamentals of
random processes..
Random processes
Recall that a random variable maps each sample point in the sample space to a point in
the real line. A random process maps each sample point to a waveform.
Consider a probability space {S , F, P}. A random process can be defined on {S , F, P} as
an indexed family of random variables { X ( s, t ), s ∈ S,t ∈ Γ} where Γ is an index set which
may be discrete or continuous usually denoting time. Thus a random process is a function
of the sample point ξ and index variable t and may be written as X (t , ξ ).
Remark
• For a fixed t (= t 0 ), X (t 0 , ξ ) is a random variable.

• For a fixed ξ (= ξ 0 ), X (t , ξ 0 ) is a single realization of the random process and


is a deterministic function.
• For a fixed ξ (= ξ 0 ) and a fixed t (= t 0 ), X (t , ξ 0 ) is a single number.

• When both t and ξ are varying we have the random process X (t , ξ ).


The random process { X ( s, t ), s ∈ S , t ∈ T } is normally denoted by { X (t )}. Following
figure illustrates a random procee.
A random process is illustrated below.
X (t , s3 )

s3 X (t , s2 )

s1
s2
X (t , s1 )
t
Figure Random Process

( To Be animated)

Example Consider a sinusoidal signal X (t ) = A cos ω t where A is a binary random


variable with probability mass functions p A (1) = p and p A (−1) = 1 − p.

Clearly, { X (t ), t ∈ Γ} is a random process with two possible realizations X 1 (t ) = cos ω t

and X 2 (t ) = − cos ω t. At a particular time t0 X (t0 ) is a random variable with two values

cos ω t0 and − cos ω t0 .

Continuous-time vs. discrete-time process


If the index set Γ is continuous, {X(t),t∈Γ} is called a continuous-time process.

Example Suppose X (t ) = A cos( w0 t + φ ) where A and w0 are constants and φ is


uniformly distributed between 0 and 2π . X (t ) is an example of a continuous-time
process.

4 realizations of the process is illustrated below.


(TO BE ANIMATED)
φ = 0.8373π

φ = 0.9320π

φ = 1.6924π

φ = 1.8636π
If the index set Γ is a countable set, { X (t ), t ∈ Γ} is called a discrete-time process.
Such a random process can be represented as { X [n], n ∈ Z } and called a random sequence.

Sometimes the notation { X n , n ≥ 0} is used to describe a random sequence indexed by the

set of positive integers.


We can define a discrete-time random process on discrete points of time. Particularly,
we can get a discrete-time random process { X [n], n ∈ Z } by sampling a continuous-time

process { X (t ), t ∈ Γ} at a uniform interval T such that X [n] = X (nT ).


The discrete-time random process is more important in practical implementations.
Advanced statistical signal processing techniques have been developed to process this
type of signals.
Example Suppose X n = √ 2cos(ω0 n + Y ) where ω 0 is a constant and Y is a random
variable uniformly distributed between π and - π .
X n is an example of a discrete-time process.

φ = 0.4623π
φ = 1.9003π

φ = 0.9720π

Continuous-state vs. discrete-state process:

The value of a random process X (t ) is at any time t can be described from its probabilistic
model.

The state is the value taken by X (t ) at a time t, and the set of all such states is called the
state space. A random process is discrete-state if the state-space is finite or countable. It
also means that the corresponding sample space is also finite countable. Other-wise the
random process is called continuous state.
Example Consider the random sequence { X n , n ≥ 0} generated by repeated tossing of a
fair coin where we assign 1 to Head and 0 to Tail.
Clearly X n can take only two values- 0 and 1. Hence { X n , n ≥ 0} is a discrete-time two-
state process.
How to describe a random process?

As we have observed above that X (t ) at a specific time t is a random variable and can be
described by its probability distribution function FX (t ) ( x) = P ( X (t ) ≤ x). This distribution

function is called the first-order probability distribution function. We can similarly


dFX ( t ) ( x)
define the first-order probability density function f X (t ) ( x) = .
dx
To describe { X (t ), t ∈ Γ} we have to use joint distribution function of the random
variables at all possible values of t . For any positive integer n , X (t1 ), X (t 2 ),..... X (t n )

represents n jointly distributed random variables. Thus a random process


{ X (t ), t ∈ Γ} can thus be described by specifying the n-th order joint distribution
function
FX ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x2 .....xn ) = P ( X (t1 ) ≤ x1 , X (t2 ) ≤ x2 ..... X (tn ) ≤ xn ), ∀ n ≥ 1 and ∀tn ∈ Γ

or th the n-th order joint density function


∂n
f X (t1 ), X (t2 )..... X (tn ) ( x1 , x2 .....xn ) = FX (t1 ), X (t2 )..... X (tn ) ( x1 , x2 .....xn )
∂x1∂x2 ...∂xn

If { X (t ), t ∈ Γ} is a discrete-state random process, then it can be also specified by the


collection of n-th order joint probability mass function

p X ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x2 .....xn ) = P ( X (t1 ) = x1 , X (t2 ) = x2 ..... X (tn ) = xn ), ∀ n ≥ 1 and ∀tn ∈ Γ

If the random process is continuous-state, it can be specified by


Moments of a random process
We defined the moments of a random variable and joint moments of random variables.
We can define all the possible moments and joint moments of a random process
{ X (t ), t ∈ Γ}. Particularly, following moments are important.

• µ x (t ) = Mean of the random process at t = E ( X (t )


• RX (t1 , t2 ) = autocorrelation function of the process at times t1 , t2 = E ( X (t 1 ) X (t2 ))
Note that
RX (t1 , t2 ) = RX (t2 , t1 , ) and
RX (t , t ) = EX 2 (t ) = sec ond moment or mean - square value at time t.

• The autocovariance function C X (t1 , t2 ) of the random process at time

t1 and t2 is defined by

C X (t1 , t2 ) = E ( X (t 1 ) − µ X (t1 ))( X (t2 ) − µ X (t2 ))


=RX (t1 , t2 ) − µ X (t1 ) µ X (t2 )
C X (t , t ) = E ( X (t ) − µ X (t )) 2 = variance of the process at time t.

These moments give partial information about the process.


CX (t1, t2 )
The ratio ρX (t1, t2 ) = is called the correlation coefficient.
CX (t1, t1 ) CX (t2 , t2 )

The autocorrelation function and the autocovariance functions are widely used to
characterize a class of random process called the wide-sense stationary process.

We can also define higher-order moments


R X (t1 , t 2 , t 3 ) = E ( X (t 1 ), X (t 2 ), X (t 3 )) = Triple correlation function at t1 , t2 , t3 etc.

The above definitions are easily extended to a random sequence { X n , n ≥ 0}.

Example

(a) Gaussian Random Process


For any positive integer n, X (t1 ), X (t 2 ),..... X (t n ) represent n jointly random

variables. These n random variables define a random vector


X = [ X (t1 ), X (t2 ),..... X (tn )]'. The process X (t ) is called Gaussian if the random vector

[ X (t1 ), X (t2 ),..... X (tn )]' is jointly Gaussian with the joint density function given by
1
− X'C−X1X
e 2
f X (t1 ), X (t2 )... X (tn ) ( x1 , x2 ,...xn ) =
( )
n
2π det(CX )

where C X = E ( X − µ X )( X − µ X ) '
and µ X = E ( X) = [ E ( X 1 ), E ( X 2 )......E ( X n ) ] '.
The Gaussian Random Process is completely specified by the autocovariance matrix and
hence by the mean vector and the autocorrelation matrix R X = EXX ' .
(b) Bernoulli Random Process
A Bernoulli process is a discrete-time random process consisting of a sequence of
independent and identically distributed Bernoulli random variables. Thus the discrete –
time random process { X n , n ≥ 0} is Bernoulli process if
P{ X n = 1} = p and
P{ X n = 0} = 1 − p
Example
Consider the random sequence { X n , n ≥ 0} generated by repeated tossing of a fair coin

where we assign 1 to Head and 0 to Tail. Here { X n , n ≥ 0} is a Bernoulli process where

each random variable X n is a Bernoulli random variable with

1
p X (1) = P{ X n = 1} = and
2
1
p X (0) = P{ X n = 0} =
2
(c) A sinusoid with a random phase
X (t ) = A cos( w0 t + φ ) where A and w0 are constants and φ is uniformly distributed
between 0 and 2π . Thus
1
f Φ (φ ) =

X (t ) at a particular t is a random variable and it can be shown that

⎧ 1
⎪ x<A
f X ( t ) ( x ) = ⎨ π A2 − x 2
⎪0
⎩ otherwise
The pdf is sketched in the Fig. below:
The mean and autocorrelation of X (t ) :
µ X (t ) = EX (t )
= EA cos( w0t + φ )
∞ 1
= ∫ A cos( w0t + φ ) dφ
−∞ 2π
=0
RX (t1 , t2 ) = EA cos( w0t1 + φ ) A cos( w0t2 + φ )
= A2 E cos( w0t1 + φ ) cos( w0t2 + φ )
A2
= E (cos( w0 (t1 − t2 )) + cos( w0 (t1 + t2 + 2φ )))
2
A2 A2 π 1
= cos( w0 (t1 − t2 )) + ∫ cos( w0 (t1 + t2 + 2φ )) dφ
2 2 −π 2π
A2
= cos( w0 (t1 − t2 ))
2

Two or More Random Processes

In practical situations we deal with two or more random processes. We often deal with
the input and output processes of a system. To describe two or more random processes
we have to use the joint distribution functions and the joint moments.
Consider two random processes { X (t ), t ∈ Γ} and {Y (t ), t ∈ Γ}. For any positive integer

n , X (t1 ), X (t2 ),..... X (tn ), Y (t1/ ), Y (t2/ ),.....Y (t n/ ) represent 2n jointly distributed random

variables. Thus these two random processes can be described by the ( n + m)th order joint
distribution function
FX (t ), X (t ( tn ),Y ( t1/ ),Y ( t2/ ),.....Y ( t m/ )
( x1 , x2 .....xn , y1 , y2 ..... ym )
1 2 )..... X

= P( X (t1 ) ≤ x1 , X (t2 ) ≤ x2 ..... X (tn ) ≤ xn , Y (t1/ ) ≤ y1 , Y (t 2/ ) ≤ y2 .....Y (tm/ ) ≤ ym )


or the corresponding ( n + m )th order joint density function
f X (t ), X (t ( tn ),Y ( t1/ ),Y ( t2/ ),.....Y ( tn/ )
( x1 , x2 .....xn , y1 , y2 ..... yn )
1 2 )..... X

∂ 2n
= F / ( x1 , x2 ..... xn , y1 , y2 ..... yn )
∂x1∂x2 ...∂xn ∂y1∂y2 ...∂yn X ( t1 ), X (t2 )..... X ( tn ),Y (t1 ),Y (t2 ),.....Y (t n )
/ /

Two random processes can be partially described by the joint moments:

Cross − correlation function of the processes at times t1 , t2



RXY (t1 , t2 ) = E ( X (t 1 )Y (t2 )) = E ( X (t 1 )Y (t2 ))

Cross − cov ariance function of the processes at times t1 , t2


C XY (t1 , t2 ) == E ( X (t 1 ) − µ X (t1 ))( X (t2 ) − µ X (t2 ))

= RXY (t1 , t2 ) − µ X (t1 ) µY (t2 )
.
Cross-correlation coefficient
• CXY (t1, t2 )
ρXY (t1, t2 ) =
CX (t1, t1 ) CY (t2 , t2 )

On the basis of the above definitions, we can study the degree of dependence between
two random processes

¾ Independent processes: Two random processes { X (t ), t ∈ Γ} and {Y (t ), t ∈ Γ}.

are called independent if

¾ Uncorrelated processes: Two random processes { X (t ), t ∈ Γ} and {Y (t ), t ∈ Γ}.

are called uncorrelated if


C XY (t1 , t2 ) = 0 ∀ t1 , t2 ∈ Γ

This also implies that for such two processes


RXY (t1 , t2 ) = µ X (t1 ) µY (t2 )
.

¾ Orthogonal processes: Two random processes { X (t ), t ∈ Γ} and {Y (t ), t ∈ Γ}.

are called orthogonal if


R XY (t1 , t2 ) = 0 ∀ t1 , t2 ∈ Γ

Example Suppose X (t ) = A cos( w0 t + φ ) and Y (t ) = A sin( w0t + φ ) where A and w0 are


constants and φ is uniformly distributed between 0 and 2π .
Important Classes of Random Processes

Having characterized the random process by the joint distribution ( density) functions and
joint moments we define the following two important classes of random processes.

(a) Independent and Identically Distributed Process


Consider a discrete-time random process { X n }. For any finite choice of time instants
n1 , n2 ,...., nN , if the random variables X n1 , X n2 ,..., X nN are jointly independent with a
common distribution, then { X n } is called an independent and identically distributed (iid)
random process. Thus for an iid random process { X n },
FX n , X n ,.., X n ( x1 , x2 .....xn ) = FX ( x1 ) FX ( x2 ).....FX ( xn )
1 2 N

and equivalently
p X n , X n ,.., X n ( x1 , x2 .....xn ) = p X ( x1 ) p X ( x2 )..... p X ( xn )
1 2 N

Moments of the IID process:


It is easy to verify that for an iid process { X n }
• Mean EX n = µ X = constant
• Variance E ( X n − µ X ) 2 = σ X2 =constant
• Autocovariance C X (n, m) = E ( X n − µ X )( X m − µ X )
= E( X n − µ X )E( X m − µ X )
⎧0 for n ≠ m
=⎨ 2
⎩σ X otherwise
= σ X2 δ [n, m]
where δ [n, m] = 1 for n = m and 0 otherwise.
• Autocorrelation RX (n, m) = C X (n, m) + µ X2 = σ X2 δ (n, m) + µ X2

Example Bernoulli process: Consider the Bernoulli process { X n } with


p X (1) = p and
p X (0) = 1 − p
This process is an iid process.
Using the iid property, we can obtain the joint probability mass functions of any
order in terms of p. For example,
p X1 , X 2 (1, 0) = p (1 − p )
p X1 , X 2 , X 3 (0, 0,1) = (1 − p ) 2 p
and so on.
Similarly, the mean, the variance and the autocorrelation function are given by
µ X = EX n = p
n

var( X n ) = p (1 − p )
RX (n1 , n2 ) = EX n1 X n2
= EX n1 EX n2
= p2

(b) Independent Increment Process

A random process { X (t )} is called an independent increment process if for any


n > 1 and t1 < t2 < ... < tn ∈ Γ, the set of n random variables
X (t1 ), X (t2 ) − X (t1 ),... ,X (tn ) − X (tn −1 ) are jointly independent random variables.

If the probability distribution of X (t + r ) − X (t '+ r ) is same as that of X (t ) − X (t ′), for


any choice of t , t 'and r , the { X (t )} is called stationary increment process.
• The above definitions of the independent increment process and the stationary
increment process can be easily extended to discrete-time random processes.
• The independent increment property simplifies the calculation of joint
probability distribution, density and mass functions from the corresponding
first-order quantities. As an example,for t1 < t2 , x1 < x2 ,

FX (t1 ), X (t2 ) ( x1 , x2 ) = P ({ X (t1 ) ≤ x1 , X (t2 ) ≤ x2 }


= P ({ X (t1 ) ≤ x1}) P({ X (t2 ) ≤ x2 }/{ X (t1 ) ≤ x1})
= P ({ X (t1 ) ≤ x1 ) P({ X (t2 ) − X (t1 ) ≤ x2 − x1})
= FX (t1 ) ( x1 ) FX ( t2 ) − X (t1 ) ( x2 − x1 )
• The independent increment property simplifies the computation of the
autocovariance function.
For t1 < t2 , the autocorrelation function of X (t ) is given by
RX (t1 , t2 ) = EX (t1 ) X (t2 )
= EX (t1 )( X (t1 ) + X (t2 ) − X (t1 ))
= EX 2 (t1 ) + EX (t1 ) E ( X (t2 ) − X (t1 ))
= EX 2 (t1 ) + EX (t1 ) EX (t2 ) − ( EX (t1 )) 2
= var( X (t1 )) + EX (t1 ) EX (t2 )
∴ C X (t1 , t2 ) = EX (t1 ) X (t2 ) − EX (t1 ) EX (t2 ) = var( X (t1 ))
Similarly, for t1 > t2 ,
C X (t1 , t2 ) = var( X (t2 ))
Therefore
C X (t1 , t2 ) = var( X (min(t1 , t2 )))
Example: Two continuous-time independent increment processes are widely studied.
They are:
(a) Wiener process with the increments following Gaussian distribution and
(b) Poisson process with the increments following Poisson distribution. We shall
discuss these processes shortly.

Random Walk process


Consider an iid process {Z n } having two states Z n = 1 Z n = −1 with the probability
mass functions
pZ (1) = p and pZ (−1) = q = 1 − p.
Then the sum process { X n } given by
n
X n = ∑ Z i = X n −1 + Z n
i =1

with X 0 = 0 is called a Random Walk process.


• This process is one of the widely studied random processes.
• It is an independent increment process. This follows from the fact that
X n − X n −1 = Z n and {Z n } is an iid process.
n
• If we call Z n = 1 as success and Z n = −1 as failure, then X n = ∑ Z i
i =1

represents the total number of successes in n independent trials.


1
• If p = , { X n } is called a symmetrical random walk process.
2

Probability mass function of the Random Walk Process


At an instant n, X n can take integer values from −n to n
Suppose X n = k .
Clearly k = n1 − n−1
where n1 = number of successes and n−1 = number of failures in n trials of Z n such
that n1 + n−1 = n.
n+k n−k
∴ n1 = and n−1 =
2 2
Also n1 and n−1 are necessarily non-negative integers.
⎧n n+k n−k
n+k n−k
⎪ n+k
C p 2
(1 − p ) 2
if and are non-negative integers
∴ pX n (k ) = ⎨ 2 2 2
⎪0
⎩ otherwise
Mean, Variance and Covariance of a Random Walk process
Note that
EZ n = 1× p − 1× (1 − p) = 2 p − 1
EZ n2 = 1× p + 1× (1 − p) = 1
and
var( Z n ) = EZ n2 − (EZ n )2
= 1- 4 p 2 + 4 p − 1
= 4 pq
n
∴ EX n = ∑ EZ i = n(2 p − 1)
i =1

and
n
var( X n ) = ∑ var( Z i ) ∵ Z i s are independent random variables
i =1

= 4npq

Since the random walk process { X n } is an independent increment process, the


autocovariance function is given by
C X (n1 , n2 ) = 4 pq min(n1 , n2 )

Three realizations of a random walk process is as shown in the Fig. below:

Remark If the increment Z n of the random walk process takes the values of s and − s,
then
n
∴ EX n = ∑ EZ i = n(2 p − 1) s
i =1

and
n
var( X n ) = ∑ var( Z i )
i =1

= 4npqs 2

(c) Markov process


A process { X (t )} is called a Markov process if for any sequence of time t1 < t2 < ..... < tn ,
P ({ X (tn ) ≤ x | X (t1 ) = x1 , X (t2 ) = x2 ,..., X (tn −1 ) = xn −1 }) = P ({ X (tn ) ≤ x | X (tn −1 ) = xn −1 })
• Thus for a Markov process “the future of the process, given present, is
independent of the past.”
• A discrete-state Markov process is called a Markov Chain. If { X n } is a discrete-
time discrete-state random process, the process is Markov if
P ({ X n = xn | X 0 = x0 , X 1 = x1 ,..., X n −1 = xn −1 }) = P ({ X n = xn | X n −1 = xn −1 })
• An iid random process is a Markov process.
• Many practical signals with strong correlation between neighbouring samples are
modelled as Markov processes

Example Show that the random walk process { X n } is Markov.


Here,
P({ X n = xn | X 0 = 0, X 1 = x1 ,..., X n −1 = xn −1 })
= P({ X n −1 + Z n = xn | X 0 = 0, X 1 = x1 ,..., X n −1 = xn −1 })
= P({Z n = xn − xn −1 })
= P({ X n = xn | X n −1 = xn −1 })

Wiener Process

Consider a symmetrical random walk process { X n } given by


X n = X (n∆ )
where the discrete instants in the time axis are separated by ∆ as shown in the Fig. below.
Assume ∆ to be infinitesimally small.

0 Γ
∆ 2∆ t = n∆
Clearly,
EX n = 0
1 1
var( X n ) = 4 pqns 2 = 4 × × ns 2 = ns 2
2 2
For large n, the distribution of X n approaches the normal with men 0 and variance
t
ns2 = s2 = αt

As ∆ → 0 and n → ∞, X n becomes the continuous-time process X (t ) with the pdf
2
1x
e 2 α t . This process { X ( t )} is called the Wienerprocess.
1 −
f X ( t ) ( x) =
2πα t

A random process { X ( t )} is called a Wiener process or the Brownian motion process if it


satisfies the following conditions:

(1) X ( 0 ) = 0
(2) X ( t ) is an independent increment process.
(3) For each s ≥ 0, t ≥ 0 X ( s + t ) − X ( s ) has the normal distribution with mean 0 and
variance α t .
2
1x
1 −
f X ( s +t )− X ( s ) ( x) = e 2 αt
2πα t

• Wiener process was used to model the Brownian motion – microscopic particles
suspended in a fluid are subject to continuous molecular impacts resulting in the
zigzag motion of the particle named Brownian motion after the British Botanist
Brown.
• Wiener Process is the integration of the white noise process.

A realization of the Wiener process is shown in the figure below:


RX ( t1 , t2 ) = EX ( t1 ) X ( t2 )
= EX ( t1 ) { X ( t2 ) − X ( t1 ) + X ( t1 )} Assuming t2 > t1
= EX ( t1 ) E { X ( t2 ) − X ( t1 )} + EX 2 ( t1 )
= EX 2 ( t1 )
= α t1

Similarly if t1 > t2

RX ( t1 , t2 ) = α t2
∴ RX ( t1 , t2 ) = α min ( t1 , t2 )

2
1 x
1 −
f X (t ) ( x ) = e 2α t

2πα t

Remark
C X ( t1 , t2 ) = α min ( t1 , t2 )
X ( t ) is a Gaussian process.
Poisson Process

Consider a random process representing the number of occurrences of an event up to time


t (over time interval (0, t]. Such a process is called a counting process and we shall
denote it by { N (t ), t ≥ 0} . Clearly { N (t ), t ≥ 0} is a continuous-time discrete state
process and any of its realizations is non-decreasing function of time.

The counting process { N (t ), t ≥ 0} is called Poisson’s process with the rate parameter λ if

(i) N(0)=0
(ii) N (t) is an independent increment process.

Thus the increments N (t2 ) − N (t1 ) and N (t4 ) − N (t3 ), et. are independent.

(iii) P ({ N (∆t ) = 1} ) = λ∆t + 0(∆t )


0(∆t )
where 0( ∆t ) implies any function such that lim = 0.
∆t → 0 ∆t
(iv) P ({ N (∆t ) ≥ 2} ) = 0 ( ∆t )

The assumptions are valid for many applications. Some typical examples are

• Number of alpha particles emitted by a radio active substance.


• Number of binary packets received at switching node of a communication
network.
• Number of cars arriving at a petrol pump during a particular interval of time.

P ({ N (t + ∆t ) = n} ) = Probability of occurrence of n events up to time t + ∆t


= P ({ N (t ) = n, N (∆t ) = 0} ) + P ({N (t ) = n − 1, n(∆t ) = 1}) + P ({ N (t ) < n − 1, n(∆t ) ≥ 2} )
= P ({ N (t ) = n} ) (1 − λ∆t − 0(∆t )) + P ({ N (t ) = n − 1} ) (λ∆t + 0(∆t )) + P ({ N (t ) = (n − 1)} ) (0(

P ({ N (t ) = n, N (∆t ) = 0} ) + P ({N (t ) = n − 1, n(∆t ) = 1}) + P ({ N (t ) < n − 1, n(∆t ) ≥ 2} )


= P ({ N (t ) = n} ) (1 − λ∆t − 0(∆t )) + P ({ N (t ) = n − 1} ) (λ∆t + 0(∆t )) + P ({ N (t ) = (n − 1)} ) (0(∆t ))
P ({ N (t + ∆t ) = n} ) − P ({ N (t ) = n} )
lim = −λ ⎡⎣ P ({N (t ) = n}) − P ({N (t ) = n − 1}) ⎤⎦
∆t → 0 ∆t
P ({ N (t ) = n} ) = λ ⎡⎣ P ({ N (t ) = n} ) − P ({ N (t ) = n − 1} ) ⎤⎦
d
∴ (1)
dt

The above is a first-order linear differential equation with initial condition


P({ N (0) = n}) = 0. . This differential equation can be solved recursively.

First consider the problem to find P ({ N (t ) = 0} )


From (1)
P ({ N (t ) = 0} ) = λ P ({ N (t ) = 0} )
d
dt
⇒ P { N (t ) = 0} = e − λt

Next to find P ( N (t ) = 1)

d
dt
({P( N (t ) = 1}) = −λ P ({ N (t ) = 1}) − λ P ({ N (t ) = 0})
= −λ P { N (t ) = 1} − λ e − λt

with initial condition P ({ N (0) = 1} ) = 0.


Solving the above first-order linear differential equation we get
P {{ N (t ) = 1}} = λte− λt

Now by mathematical indication we can show that

(λ t ) n e − λ t
P ({ N (t ) = n} ) =
n!
Remark
(1) The parameter λ is called the rate or intensity of the Poisson process.
It can be shown that
(λ (t2 − t1 )) n e− λ (t2 −t1 )
P ({ N (t2 ) − N (t1 ) = n} ) =
n!
Thus the probability of the increments depends on the length of the interval t2 − t1 and not
on the absolute times t2and t1. Thus the Poisson process is a process with stationary
increments.
(2) The independent and stationary increment properties help us to compute the joint
probability mass function of N (t ). For example,
P ({ N (t1 ) = n1 , N (t2 ) = n2 } ) = P ({ N (t1 ) = n1}) P ({N (t2 ) = n2 } /{N (t1 ) = n1})
= P ({ N (t1 ) = n1}) P({N (t2 ) − N (t1 ) = n2 − n1} )
(λt1 ) n1 e − λt1 (λ (t2 − t1 )) n2 − n1 e − λ (t2 −t1 )
=
n1 ! (n2 − n1 )!

Mean, Variance and Covariance of the Poisson process

We observe that at any time t > 0, N (t ) is a Poisson random variable with the parameter
λt. Therefore,
EN (t ) = λ t
and var N (t ) = λ t
Thus both the mean and variance of a Poisson process varies linearly with time.
As N (t ) is a random process with independent increment, we can readily show that
CN (t1 , t2 ) = var( N (min(t1 , t2 )))
= λ min(t1 , t2 )
∴ RN (t1 , t2 ) = CN (t1 , t2 ) + EN (t1 ) EN (t2 )
= λ min(t1 , t2 ) + λ 2t1t2

A typical realization of a Poisson process is shown below:

Example: A petrol pump serves on the average 30cars per hour. Find the probability that
during a period of 5 minutes (i) no car comes to the station, (ii) exactly 3 cars come to the
station and (iii) more than 3 cars come to the station.

Average arrival = 30 cars/hr


= ½ car/min
Probability of no car in 5 minutes

1
− ×5
(i) P { N (5) = 0} = e 2
= e −2.5 = 0.0821
3
⎛1 ⎞
⎜ × 5⎟
(ii) P { N (5) = 3} = ⎝
2 ⎠ −2.5
e
3!
(iii) P{N (5) > 3} = 1 − 0.21 = 0.79

Binomial model:
P = probability of car coming in 1 minute=1/2
n=5
∴ P( X = 0) = (1 − P)5 = 0.55

Inter-arrival time and Waiting time of Poisson Process:

o
t1 t2 t n −1 tn

Tn−1 Tn
T1 T2

Let Tn = time elapsed between (n-1) st event and nth event. The random process
{Tn , n = 1, 2,..} represent the arrival time of the Poisson process.
T1 = time elapsed before the first event take place. Clearly T1 is a continuous random
variable.

Let us find out the probability P({T1 > t})

P ({T1 > t}) = P({0 event upto time t})


= e− λt

∴ FT1 (t ) = 1 − P{T1 > t} = 1 − e − λt


∴ fT1 (t ) = λ e − λt t ≥ 0

Similarly,
P ({Tn > t}) = P( {0 event occurs in the interval (tn −1 , tn −1 + t )/(n-1)th event occurs at (0, tn −1 ] })
= P( {0 event occurs in the interval (tn −1 , tn −1 + t ]})
= e − λt

∴ fTn (t ) = λ e − λt

Thus the inter-arrival times of a Poisson process with the parameter λ are exponentially
distributed. with

fTn (t ) = λ e − λt n>0
Remark
• We have seen that the inter-arrival times are identically distributed with the
exponential pdf. Further, we can easily prove that the inter-arrival times are
independent also. For example

FT1 ,T2 (t1 , t2 ) = P({T1 ≤ t1 , T2 ≤ t2 })


= P({T1 ≤ t1}) P({T2 ≤ t2 }/{T1 ≤ t1})
= P({T1 ≤ t1}) P({zero event occurs in (T1 , T1 +t2 ) }/{one event occurs in (01 , T1 ]})
= P({T1 ≤ t1}) P({zero event occurs in (T1 , T1 +t2 ) })
= P({T1 ≤ t1}) P({T2 ≤ t2 })
= FT1 (t1 ) FT2 (t2 )

• It is interesting to note that the converse of the above result is also true. If the
inter-arrival times between two events of a discrete state {N (t ), t ≥ 0} process are
1
exponentially distributed with mean , then {N (t ), t ≥ 0} is a Poisson process
λ
with the parameter λ .
• The exponential distribution of the inter-arrival process indicates that the arrival
process has no memory. Thus
P ({Tn > t0 + t1 / Tn > t1}) = P ({Tn > t0 }) ∀t0 , t1

Another important quantity is the waiting time Wn . This is the time that elapses before the
nth event occurs. Thus
n
Wn = ∑ Ti
i =1

How to find the first order pdf of Wn is left as an exercise. Note that Wn is the sum of n
independent and identically distributed random variables.
n n n
∴ EWn = ∑ ETi = ∑ ETi =
i =1 i =1 λ
and
n n n
var (Wn ) = var ∑ (Ti ) = ∑ var(Ti ) =
i =1 i =1 λ2
Example
The number of customers arriving at a service station is a Poisson process with a rate
10 customers per minute.
(a) What is the mean arrival time of the customers?
(b) What is the probability that the second customer will arrive 5 minutes after the
first customer has arrived?
(c) What is the average waiting time before the 10th customer arrives?

Semi-random Telegraph signal

Consider a two-state random process { X (t )} with the states X (t ) = 1 and X (t ) = −1.


Suppose X (t ) = 1 if the number of events of the Poisson process N (t ) in the interval
(0, t ] is even and X (t ) = −1 if the number of events of the Poisson process N (t ) in the
interval (0, t ] is odd. Such a process { X (t )} is called the semi-random telegraph signal
because the initial value X (0) is always 1.
p X ( t ) (1) = P ({ X (t ) = 1})
= P({N (t ) = 0} + P ({N (t ) = 2}) + ...
(λ t ) 2
= e − λt {1 + + ....)
2!
= e − λt cosh λ t
and
p X ( t ) (−1) = P({ X (t ) = −1})
= P({N (t ) = 1} + P ({N (t ) = 3}) + ...
λt(λ t ) 3
= e − λt { + + ....)
1! 3!
= e − λt sinh λt

We can also find the conditional and joint probability mass functions. For example, for
t1 < t2 ,
p X ( t1 ), X ( t2 ) (1,1) = P({ X (t1 ) = 1}) P({ X (t2 ) = 1)}/{ X (t1 ) = 1)}
= e − λt1 cosh λ t1 P ({N (t2 ) is even )}/{N (t1 ) is even)}
= e − λt1 cosh λ t1 P ({N (t2 ) − N (t1 ) is even )}/{N (t1 ) is even)}
= e − λt1 cosh λ t1 P ({N (t2 ) − N (t1 ) is even )}
= e − λt1 cosh λ t1e − λ ( t2 −t1 ) cosh λ (t2 − t1 )
= e − λt2 cosh λ t1 cosh λ (t2 − t1 )
Similarly
p X ( t1 ), X ( t2 ) (1, −1) = e − λt2 cosh λ t1 sinh λ (t2 − t1 ),
p X ( t1 ), X ( t2 ) (−1,1) = e − λt2 sinh λt1 sinh λ (t2 − t1 )
p X ( t1 ), X ( t2 ) (−1, −1) = e − λt2 sinh λ t1 cosh λ (t2 − t1 )

Mean, autocorrelation and autocovariance function of X (t )

EX (t ) = 1× e − λt cosh λt − 1× e− λt sinh λ t
= e − λt (cosh λt − sinh λt )
= e −2 λ t
EX 2 (t ) = 1× e − λt cosh λ t + 1× e− λt sinh λ t
= e − λt (cosh λt + sinh λt )
= e − λt eλt
=1
∴ var( X (t )) = 1 − e −4 λt

For t1 < t2
RX (t1 , t2 ) = EX (t1 ) X (t2 )
= 1× 1× p X ( t1 ), X ( t2 ) (1,1) + 1× (−1) × p X ( t1 ), X ( t2 ) (1, −1)
+ (−1) ×1× p X ( t1 ), X ( t2 ) (−1,1) + (−1) × (−1) × p X ( t1 ), X ( t2 ) (−1, −1)
= e − λt2 cosh λ t1 cosh λ (t2 − t1 ) − e − λt2 cosh λt1 sinh λ (t2 − t1 )
− e − λt2 sinh λt1 sinh λ (t2 − t1 ) + e− λt2 sinh λt1 cosh λ (t2 − t1 )
= e − λt2 cosh λ t1 (cosh λ (t2 − t1 ) − sinh λ (t2 − t1 )) + e − λt2 sinh λ t1 (cosh λ (t2 − t1 ) −1 sinh λ (t2 − t1 ))
= e − λt2 e − λ ( t2 −t1 ) (cosh λt1 + sinh λ t1 )
= e − λt2 e − λ ( t2 −t1 ) eλt1
= e −2 λ ( t2 −t1 )
Similarlrly for t1 > t2
RX (t1 , t2 ) = e −2 λ ( t1 −t2 )
∴ RX (t1 , t2 ) = e −2 λ t1 −t2
Random Telegraph signal
Consider a two-state random process {Y (t )} with the states Y (t ) = 1 and Y (t ) = −1.
1 1
Suppose P ({Y (0) = 1}) = and P ({Y (0) = −1}) = and Y (t ) changes polarity with an
2 2
equal probability with each occurrence of an event in a Poisson process of parameter λ .
Such a random process {Y (t )} is called a random telegraph signal and can be expressed
as
Y (t ) = AX (t )
where { X (t )} is the semirandom telegraph signal and A is a random variable
1 1
independent of X (t ) with P ({ A = 1}) = and P ({ A = −1}) = .
2 2
Clearly,
1 1
EA = (−1) × + 1× = 0
2 2
and
1 1
EA2 = (−1) 2 × + 12 × = 1
2 2
Therefore,
EY (t ) = EAX (t )
= EAEX (t ) ∵ A and X (t ) are independent
=0
RY (t1 , t2 ) = EAX (t1 ) AX (t2 )
= EA2 EX (t1 ) X (t2 )
= e −2 λ t1 −t2
Stationary Random Process

The concept of stationarity plays an important role in solving practical problems


involving random processes. Just like time-invariance is an important characteristics of
many deterministic systems, stationarity describes certain time-invariant property of a
random process. Stationarity also leads to frequency-domain description of a random
process.

Strict-sense Stationary Process


A random process { X (t )} is called strict-sense stationary (SSS) if its probability

structure is invariant with time. In terms of the joint distribution function, { X (t )} is

called SSS if
FX ( t1 ), X (t2 )..., X (tn ) ( x1 , x2 ..., xn ) = FX (t1 + t0 ), X ( t2 + t0 )..., X ( tn + t0 ) ( x1 , x2 ..., xn )

∀n ∈ N , ∀t0 ∈ Γ and for all choices of sample points t1 , t2 ,...tn ∈ Γ.

Thus the joint distribution functions of any set of random variables X ( t1 ), X ( t 2 ), ..., X ( t n )
does not depend on the placement of the origin of the time axis. This requirement is a
very strict. Less strict form of stationarity may be defined.
Particularly,
if FX ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x2 .....xn ) = FX ( t1 +t0 ), X ( t2 +t0 )..... X ( tn +t0 ) ( x1 , x2 .....xn ) for n = 1, 2,.., k , then

{ X (t )} is called kth order stationary.


• If { X (t )} is stationary up to order 1

FX ( t1 ) ( x1 ) = FX ( t1 + t0 ) ( x1 ) ∀t 0 ∈ T

Let us assume t0 = −t1 . Then

FX ( t1 ) ( x1 ) = FX (0) ( x1 ) which is independent of time.

As a consequence
EX (t1 ) = EX (0) = µ X (0) = constant

• If { X (t )} is stationary up to order 2
FX ( t1 ), X ( t 2 ) ( x1 , x 2 .) = FX ( t1 + t0 ), X ( t 2 +t0 ) ( x1 , x 2 )

Put t 0 = −t 2

FX (t1 ), X (t2 ) ( x1 , x2 ) = FX ( t1 −t2 ), X (0) ( x1 , x2 )


This implies that the second-order distribution depends only on the time-lag t1 − t2 .
As a consequence, for such a process
RX (t1 , t2 ) = E ( X (t1 ) X (t2 ))
∞ ∞
= ∫ ∫xx
−∞ −∞
1 2 f X (0) X (t1 −t2 ) ( x1 , x2 )dx1 dx2

= RX (t1 − t2 )
Similarly,
C X (t1 , t2 )= C X (t1 − t2 )
Therefore, the autocorrelation function of a SSS process depends only on the time lag
t1 − t2 .
We can also define the joint stationarity of two random processes. Two processes
{ X (t )} and {Y (t )} are called jointly strict-sense stationary if their joint probability

distributions of any order is invariant under the translation of time. A complex process
{Z (t ) = X (t ) + jY (t )} is called SSS if { X (t )} and {Y (t )} are jointly SSS.

Example An iid process is SSS. This is because, ∀n,


FX ( t1 ), X (t2 )..., X ( tn ) ( x1 , x2 ..., xn ) = FX ( t1 ) ( x1 ) FX ( t2 ) ( x2 )...FX ( tn ) ( xn )
= FX ( x1 ) FX ( x2 )...FX ( xn )
= FX ( t1 + t0 ), X (t2 +t0 )..., X ( tn + t0 ) ( x1 , x2 ..., xn )

Example The Poisson process is {N (t ), t ≥ 0} not stationary, because


EN (t ) = λ t
which varies with time.
Wide-sense stationary process
It is very difficult to test whether a process is SSS or not. A subclass of the SSS process
called the wide sense stationary process is extremely important from practical point of
view.
A random process { X (t )} is called wide sense stationary process (WSS) if
EX (t ) = µ X = constant
and
EX (t1 ) X (t2 ) = RX (t1 − t2 ) is a function of time lag t1 − t2 .
(Equivalently, Cov( X (t1 ) X (t2 )) = C X (t1 − t2 ) is a function of time lag t1 − t2 )

Remark
(1) For a WSS process { X (t )},

∴ EX 2 (t ) = RX (0) = constant
var( X (t )=EX 2 (t ) − ( EX (t ))2 = constant
CX (t1 , t2 ) = EX (t1 ) X (t2 ) − EX (t1 ) EX (t2 )
= RX (t2 − t1 ) − µ X2
∴ CX (t1 , t2 ) is a function of lag (t2 − t1 ).

(2) An SSS process is always WSS, but the converse is not always true.
Example: Sinusoid with random phase
Consider the random process { X (t )} given by

X (t ) = A cos( w0 t + φ ) where A and w0 are constants and φ is uniformly distributed


between 0 and 2π .
• This is the model of the carrier wave (sinusoid of fixed frequency) used to
analyse the noise performance of many receivers.

Note that

⎧ 1
⎪ 0 ≤ φ ≤ 2π
f Φ (φ ) = ⎨ 2π
⎪⎩0 otherwise

By applying the rule for the transformation of a random variable, we get

⎧ 1
⎪ -A ≤ x ≤ A
f X ( t ) ( x ) = ⎨ π A2 − x 2
⎪0 otherwise

which is independent of t. Hence { X (t )} is first-order stationary.
Note that
EX (t ) = EA cos( w0 t + φ )

1
= ∫ A cos(w t + φ ) 2π dφ
0
0

= 0 which is a constant
and
RX (t1 , t2 ) = EX (t1 ) X (t2 )
= EA cos( w0 t1 + φ ) A cos( w0 t2 + φ )
2
A
= E[c os( w0 t1 + φ + w0 t2 + φ ) + c os( w0 t1 + φ − w0 t2 − φ )]
2
A2
= E[c os( w0 (t1 + t2 ) + 2φ ) + c os( w0 (t1 − t2 )]
2
A2
= c os( w0 (t1 − t2 ) which is a function of the lag t1 − t2 .
2
Hence { X (t )} is wide-sense stationary.
Example: Sinusoid with random amplitude
Consider the random process { X (t )} given by

X (t ) = A cos( w0 t + φ ) where φ and w0 are constants and A is a random variable. Here,

EX (t ) = EA × cos( w0t + φ )

which is independent of time only if EA = 0.


RX (t1 , t2 ) = EX (t1 ) X (t2 )
= EA cos( w0t1 + φ ) A cos( w0t2 + φ )
= EA2 × cos( w0t1 + φ ) cos( w0t2 + φ )
1
= EA2 × [c os( w0 (t1 + t2 ) + 2φ ) + c os( w0 (t1 − t2 )]
2
which will not be function of (t1 − t2 ) only.

Example: Random binary wave


Consider a binary random process { X (t )} consisting of a sequence of random pulses of
duration T with the following features:
1
• Pulse amplitude AK is a random variable with two values p AK (1) = and
2
1
p AK (−1) =
2
• Pulse amplitudes at different pulse durations are independent of each other.
• The start time of the pulse sequence can be any value between 0 to T. Thus the
random start time D (Delay) is uniformly distributed between 0 and T.

A realization of the random binary wave is shown in Fig. above. Such waveforms are
used in binary munication- a pulse of amplitude 1is used to transmit ‘1’ and a pulse of
amplitude -1 is used to transmit ‘0’.

The random process X (t) can be written as,


(t − nT − D)
X (t ) = ∑ A rect
n =−∞
n
T

For any t,

1 1
EX (t ) = 1 × + (−1)( ) = 0
2 2
1 1
EX 2 (t ) = 12 × + (−1) 2 = 1
2 2

Thus mean and variance of the process are constants.

To find the autocorrelation function RX (t1 , t2 ) let us consider the case 0 < t1 < t1 + τ < T .
Depending on the delay D, the points t1 and t2 may lie on one or two pulse intervals.
Case 1:

X (t1 ) X (t2 ) = 1

Case 2:

X (t1 ) X (t1 ) = (−1)(−1) = 1

Case 3:

X (t1 ) X (t2 ) = −1

Thus,
⎧1 0 < D < t1 or t2 < D < T
X (t1 ) X (t2 ) = ⎨
⎩−1 t1 < D < t2

∴ RX (t1 , t2 ) = EEX (t1 ) X (t2 ) | D


= E ( X (t1 ) X (t2 ) | P(0 < D < t1 or t2 < D < T )) + E ( X (t1 ) X (t2 )) | t1 < D < t2 ).P(t1 < D < t2 )
⎛ t −t ⎞ 1 1 t −t
= 1 × ⎜1 − 2 1 ⎟ + (1 × − 1 × ) 2 1
⎝ T ⎠ 2 2 T
t −t
= 1− 2 1
T

We also have, RX (t2 , t1 ) = EX (t2 ) X (t1 ) = EX (t1 ) X (t2 ) = RX (t1 , t2 )

t2 − t1
So that RX (t1 , t2 ) = 1 − t2 − t1 ≤ T
T

For t2 − t1 > T , t1 and t2 are at different pulse intervals.


∴ EX (t1 ) X (t2 ) = EX (t1 ) EX (t2 ) = 0

Thus the autocorrelation function for the random binary waveform depends on

τ = t2 − t1 , and we can write

τ1
RX (τ ) = 1 − τ ≤T
T

The plot of RX (τ ) is shown below.

RX (τ )

T1 τ
-T1
Example Gaussian Random Process
Consider the Gaussian process { X (t )} discussed earlier. For any positive integer
n, X (t1 ), X (t 2 ),..... X (t n ) is jointly Gaussian with the joint density function given by

1
− X'C−X1X
e 2
f X (t1 ), X (t2 )... X (tn ) ( x1 , x2 ,...xn ) =
( )
n
2π det(CX )

where C X = E ( X − µ X )( X − µ X ) '
and µ X = E ( X) = [ E ( X 1 ), E ( X 2 )......E ( X n ) ] '.

If { X (t )} is WSS, then

⎡ EX (t1 ) ⎤ ⎡ µ X ⎤
⎢ ⎥ ⎢ ⎥
⎢ EX (t2 ) ⎥ ⎢ µ X ⎥
µ X = ⎢.. ⎥ = ⎢.. ⎥
⎢ ⎥ ⎢ ⎥
⎢.. ⎥ ⎢.. ⎥
⎢ EX (t ) ⎥ ⎢ µ ⎥
⎣ n ⎦ ⎣ X⎦

⎛ X (t ) − µ ′⎞
⎜⎡ 1 X ⎤ ⎡ X (t1 ) − µ X ⎤ ⎟
⎜ ⎢ X (t 2 ) − µ X ⎥ ⎢ X ( t 2 ) − µ X ⎥ ⎟
⎜⎢ ⎥⎢ ⎥ ⎟
CX = E ⎜ ⎢.. ⎥ ⎢.. ⎥ ⎟

⎜ .. ⎥ ⎢ ⎥ ⎟
⎜⎢ ⎥ ⎢.. ⎥ ⎟
⎜ ⎣ X (tn ) − µ X ⎦ ⎣ X (tn ) − µ X ⎥⎦ ⎟
⎢ ⎥ ⎢
⎝ ⎠
⎡C X (0) C X (t2 − t1 )... C X (tn − t1 ) ⎤
⎢ ⎥
C (t − t ) C X (0)... C X (t2 − tn ) ⎥
=⎢ X 2 1
⎢# ⎥
⎢ ⎥
⎣⎢C X (tn − t1 ) C X (tn − t1 )... C X (0) ⎦⎥

We see that f X (t1 ), X (t2 )... X (tn ) ( x1 , x2 ,...xn ) depends on the time-lags. Thus, for a Gaussian

random process, WSS implies strict sense stationarity, because this process is completely
described by the mean and the autocorrelation functions.
Properties Autocorrelation Function of a real WSS Random Process
Autocorrelation of a deterministic signal
Consider a deterministic signal x(t ) such that
1 T 2
0 < lim ∫ x (t ) dt < ∞
T →∞ 2T − T

Such signals are called power signals. For a power signal x(t ), the autocorrelation
function is defined as
1 T
Rx (τ ) = lim ∫ x (t + τ ) x (t ) dt
T →∞ 2T −T

Rx (τ ) measures the similarity between a signal and its time-shifted version.

1 T 2
Particularly, Rx (0) = lim ∫ x (t ) dt is the mean-square value. If x(t ) is a voltage
T →∞ 2T −T

waveform across a 1 ohm resistance, then Rx (0) is the average power delivered to the

resistance. In this sense, Rx (0) represents the average power of the signal.

Example Suppose x(t ) = A cos ω t. The autocorrelation function of x(t) at lag τ is given
by
1 T
Rx (τ ) = lim ∫ A cos ω (t + τ ) A cos ω tdt
T →∞ 2T −T

A2 T
= lim ∫ [cos(2ω t + τ ) + cos ωτ ]dt
T →∞ 4T −T

A2 cos ωτ
=
2
We see that Rx (τ ) of the above periodic signal is also periodic and its maximum occurs

2π 2π A2
when τ = 0, ± ,± , etc. The power of the signal is Rx (0) = .
ω ω 2
The autocorrelation of the deterministic signal gives us insight into the properties of the
autocorrelation function of a WSS process. We shall discuss these properties next.
Properties of the autocorrelation function of a WSS process
Consider a real WSS process { X (t )} . Since the autocorrelation function RX (t1 , t2 ) of such

a process is a function of the lag τ = t1 − t2 , we can redefine a one-parameter


autocorrelation function as
R X (τ ) = EX (t + τ ) X (t )
If { X (t )} is a complex WSS process, then
RX (τ ) = EX (t + τ ) X *(t )

where X * (t ) is the complex conjugate of X (t ). For a discrete random sequence, we can


define the autocorrelation sequence similarly.

The autocorrelation function is an important function charactersing a WSS random


process. It possesses some general properties. We briefly describe them below.
1. RX (0) = EX 2 (t ) is the mean-square value of the process. If X (t) is a voltage

signal applied across a 1 ohm resistance, then RX (0) is the ensemble average
power delivered to the resistance. Thus,
RX (0) = EX 2 (t ) ≥ 0.

2. For a real WSS process X (t), RX (τ ) is an even function of the time τ .

RX ( −τ ) = R X (τ ). Thus,

RX (−τ ) = EX (t − τ ) X (t )
= EX (t ) X (t − τ )
= EX (t1 + τ ) X (t1 ) ( Substituting t1 = t − τ )
= RX (τ )

Remark For a complex process X(t), RX (−τ ) = R*X (τ )

3. RX (τ ) ≤ RX (0). This follows from the Schwartz inequality

< X (t ), X (t + τ ) > ≤ X (t ) X (t + τ )
2 2 2

We have
RX2 (τ ) = {EX (t ) X (t + τ )}2
= EX 2 (t ) EX 2 (t + τ )
= RX (0) RX (0)
∴ RX (τ ) <RX (0)

4. RX (τ ) is a positive semi-definite function in the sense that for any positive integer
n n
n and real a j , a j , ∑ ∑ ai a j RX (ti − t j )>0
i =1 j =1

Proof
Define the random variable
n
Y = ∑ ai X (ti )
j =1

Then we have
n n
0 ≤ EY 2 = ∑ ∑ ai a j EX (ti )X (t j )
i =1 j =1
n n
= ∑ ∑ ai a j RX (ti −t j )
i =1 j =1

It can be shown that the sufficient condition for a function RX (τ ) to be the


autocorrelation function of a real WSS process { X (t )} is that RX (τ ) be real, even and
positive semidefinite.
5. If X (t ) is MS periodic, then R X (τ ) is also periodic with the same period.
Proof:
Note that a real WSS random process { X (t )} is called mean-square periodic ( MS

periodic) with a period T p if for every t ∈ Γ

E ( X (t + Tp ) − X (t )) 2 = 0
⇒ EX 2 (t + Tp ) + EX 2 (t ) − 2 EX (t + Tp ) X (t ) = 0
⇒ RX (0) + RX (0) − 2 RX (Tp ) = 0
⇒ RX (Tp ) = RX (0)

Again
E (( X (t + τ + Tp ) − X (t + τ )) X (t )) 2 ≤ E ( X (t + τ + Tp ) − X (t + τ )) 2 EX 2 (t )
⇒ ( RX (τ + Tp ) − RX (τ )) 2 ≤ 2( RX (0) − RX (Tp )) RX (0)
⇒ ( RX (τ + Tp ) − RX (τ )) 2 ≤ 0 ∵ RX (0) = RX (Tp )
∴ RX (τ + Tp ) = RX (τ )

For example, X (t ) = A cos( w0 t + φ ) where A and w0 are constants and



φ ~ U [0, 2π ], is MS periodic random process with a period . Its autocorrelation
ω0
function
A2 cos ω 0τ 2π
RX (τ ) = is periodic with the same period .
2 ω0
The converse of this result is also true. If R X (τ ) is periodic with period T p then
X (t ) is MS periodic with a period T p . This property helps us in determining time
period of a MS periodic random process.

6. Suppose X (t ) = µ X + V (t )
where V (t ) is a zero-mean WSS process and lim RV (τ ) = 0. Then
τ →∞

lim RX (τ ) = µ 2
X
τ →∞

Interpretation of the autocorrelation function of a WSS process

The autocorrelation function R X (τ ) measures the correlation between two random

variables X (t ) and X (t + τ ). If R X (τ ) drops quickly with respect to τ , then the


X (t ) and X (t + τ ) will be less correlated for large τ . This in turn means that the
signal has lot of changes with respect to time. Such a signal has high frequency
components. If R X (τ ) drops slowly, the signal samples are highly correlated and

such a signal has less high frequency components. Later on we see that R X (τ ) is
directly related to the frequency -domain representation of a WSS process.

Fig.

Cross correlation function of jointly WSS processes

If { X (t )} and {Y (t )} are two real jointly WSS random processes, their cross-correlation
functions are independent of t and depends on the time-lag. We can write the cross-
correlation function
RXY (τ ) = EX (t + τ )Y (t )

The cross correlation function satisfies the following properties:

(i ) RXY (τ ) = RYX (−τ )


This is because
RXY (τ ) = EX (t + τ )Y (t )
= EY (t ) X (t + τ )
= RYX (−τ )

RYX (τ )
RXY (τ )

O
τ
Fig. RXY (τ ) = RYX (−τ )

(ii ) RXY (τ ) ≤ RX (0) RY (0)


We have
RXY (τ ) = EX (t + τ )Y (t )
2 2

≤ EX 2 (t + τ ) EY 2 (t ) using Cauch-Schwarts Inequality


= RX (0) RY (0)
∴ RXY (τ ) ≤ RX (0) RY (0)
Further,
1
RX (0) RY (0) ≤ ( RX (0) + RY (0) ) ∵ Geometric mean ≤ Arithmatic mean
2
1
∴ RXY (τ ) ≤ RX (0) RY (0) ≤ ( RX (0) + RY (0) )
2

(iii) If X (t) and Y (t) are uncorrelated, then RXY (τ ) = EX (t + τ ) EY (t ) = µ X µY

(iv) If X (t) and Y (t) is orthogonal process, RXY (τ ) = EX ( t + τ ) Y ( t ) = 0


Example
Consider a random process Z (t ) which is sum of two real jointly WSS random processes
X(t) and Y(t). We have
Z (t ) = X (t ) + Y (t )
RZ (τ ) = EZ (t + τ ) Z (t )
= E[ X (t + τ ) + Y (t + τ )][ X (t ) + Y (t )]
= RX (τ ) + RY (τ ) + RXY (τ ) + RYX (τ )
If X (t ) and Y (t ) are orthogonal processes,then RXY (τ ) = RYX (τ ) = 0
∴ RZ (τ ) = RX (τ ) + RY (τ )
Continuity and Differentiation of Random Processes

• We know that the dynamic behavior of a system is described by differential


equations involving input to and output of the system. For example, the behavior
of the RC network is described by a first order linear differential equation with the
source voltage as the input and the capacitor voltage as the output. What happens
if the input voltage to the network is a random process?

• Each realization of the random process is a deterministic function and the


concepts of differentiation and integration can be applied to it. Our aim is to
extend these concepts to the ensemble of realizations and apply calculus to
random process.

We discussed the convergence and the limit of a random sequence. The continuity of the
random process can be defined with the help of convergence and limits of a random
process. We can define continuity with probability 1, mean-square continuity, and
continuity in probability etc. We shall discuss the mean-square continuity and the
elementary concepts of corresponding mean-square calculus.

Mean-square continuity of a random process


Recall that a sequence of random variables { X n } converges to a random variable X if
lim E [ X n − X ] = 0
2

n →∞

and we write
l.i.m. X n = X
n →∞

A random process { X ( t )} is said to be continuous at a point t = t0 in the mean-square


sense if l.i.m. X ( t ) = X ( t0 ) or equvalently
t →t0

lim E ⎡⎣ X ( t ) − X ( t0 ) ⎤⎦ = 0
2

t → t0

Mean-square continuity and autocorrelation function

(1) A random process { X ( t )} is MS continuous at t0 if its auto correlation function


RX ( t1 , t2 ) is continuous at (t0 , t0 ).

Proof:

E ⎡⎣ X ( t ) − X ( t0 ) ⎤⎦ = E ( X 2 ( t ) − 2 X ( t ) X ( t0 ) + X 2 ( t0 ) )
2

= R X ( t , t ) − 2 RX ( t , t0 ) + RX ( t0 , t0 )

If RX ( t1 , t2 ) is continuous at (t0 , t0 ), then


lim E ⎡⎣ X ( t ) − X ( t0 ) ⎤⎦ = lim RX ( t , t ) − 2 RX ( t , t0 ) + RX ( t0 , t0 )
2

t →t0 t → t0

= R X ( t 0 , t0 ) − 2 R X ( t0 , t0 ) + R X ( t 0 , t0 )
=0

(2) If { X ( t )} is MS continuous at t0 its mean is continuous at t0 .

This follows from the fact that

( E ⎡⎣ X ( t ) − X ( t )⎤⎦ ) ≤ E ⎡⎣( X ( t ) − X ( t0 ) ) ⎤⎦
2 2
0

∴ lim ⎣⎡ E ⎡⎣ X ( t ) − X ( t0 ) ⎤⎦ ⎦⎤ ≤ lim E ⎡⎣ X ( t ) − X ( t0 ) ⎤⎦ = 0
2 2

t →t0 t → t0

∴ EX ( t ) is continuous at t0 .

Example
Consider the random binary wave { X (t )} discussed in Eaxmple . Atypical realization of
the process is shown in Fig. below. The realization is a discontinuous function.

X (t )

1
… Tp

0 Tp t

−1 …

The process has the autocorrelation function given by


⎧ τ
⎪1 − τ ≤ Tp
RX (τ ) = ⎨ Tp
⎪0
⎩ otherwise

We observe that RX (τ ) is continuous at τ = 0. Therefore, RX (τ ) is continuous at all τ .


Example For a Wiener process { X ( t )} ,
RX ( t1 , t2 ) = α min ( t1 , t2 )
where α is a constant.
∴ RX ( t , t ) = α min ( t , t ) = α t
Thus the autocorrelation function of a Wiener process is continuous everywhere implying
that a Wiener process is m.s. continuous everywhere. We can similarly show that the
Poisson process is m.s. continuous everywhere.

Mean-square differentiability

The random process { X ( t )} is said to have the mean-square derivative X ' ( t ) at a point
X ( t + ∆t ) − X ( t )
t ∈ Γ, provided approaches X ' ( t ) in the mean square sense as
∆t
∆t → 0 . In other words, the random process { X ( t )} has a m-s derivative X ' ( t ) if
⎡ X ( t + ∆t ) − X ( t )
2

lim E ⎢ − X ' ( t )⎥ = 0
∆t →0
⎣ ∆t ⎦

Remark
(1) If all the sample functions of a random process X ( t ) is differentiable, then the above
condition is satisfied and the m-s derivative exists.
Example Consider the random-phase sinusoid { X (t )} given by

X (t ) = A cos( w0 t + φ ) where A and w0 are constants and φ ~ U [0, 2π ]. Then for each
φ , X (t ) is differentiable. Therefore, the m.s. derivative is

X ′(t ) = − Aw0 sin( w0t + φ )

M.S. Derivative and Autocorrelation functions


∂ 2 RX ( t1 , t2 )
The m-s derivative of a random process X ( t ) at a point t ∈ Γ exists if
∂t1∂t2
exists at the point (t , t ).

Applying the Cauchy criterion, the condition for existence of m-s derivative is

⎡ X ( t + ∆t1 ) − X ( t ) X ( t + ∆t2 ) − X ( t ) ⎤
2

lim E ⎢ − ⎥ =0
∆t1 , ∆t2 →0
⎣ ∆t1 ∆t2 ⎦

Expanding the square and taking expectation results,


⎡ X ( t + ∆t1 ) − X ( t ) X ( t + ∆t2 ) − X ( t ) ⎤
2

lim E ⎢ − ⎥
∆t1 , ∆t2 →0
⎣ ∆t1 ∆t 2 ⎦
⎡ R ( t + ∆t1 , t + ∆t1 ) + RX ( t , t ) − 2 RX ( t + ∆t1 , t ) ⎤ ⎡ RX ( t + ∆t2 , t + ∆t2 ) + RX ( t , t ) − 2 RX ( t + ∆t2 , t ) ⎤
= lim ⎢ X ⎥+⎢ ⎥
∆t1 , ∆t2 → 0
⎣ ∆t12 ⎦ ⎣ ∆t 2 2 ⎦
⎡ R ( t + ∆t1 , t + ∆t2 ) − RX ( t + ∆t1 , t ) − RX ( t , t + ∆t2 ) + RX ( t , t ) ⎤
−2⎢ X ⎥
⎣ ∆t1∆t2 ⎦

∂ 2 RX ( t1 , t2 ) ⎤
Each of the above terms within square bracket converges to ⎥ if the
∂t1∂t2 ⎦ t =t ,t
1 2 =t

second partial derivative exists.

⎡ X ( t + ∆t1 ) − X ( t ) X ( t + ∆t2 ) − X ( t ) ⎤ ∂ 2 RX ( t1 , t2 ) ∂ 2 RX ( t1 , t2 ) ∂ 2 RX ( t1 , t2 ) ⎤
2

∴ lim E ⎢ − ⎥ = + − 2 ⎥
∆t1 , ∆t2 →0
⎣ ∆t1 ∆t 2 ⎦ ∂t1∂t2 ∂t1∂t2 ∂t1∂t2 ⎦ t =t ,t =t
1 2

=0

∂ 2 RX ( t1 , t2 )
Thus, { X ( t )} is m-s differentiable at t ∈ Γ if exists at (t , t ) ∈ Γ × Γ.
∂t1∂t2

Particularly, if X ( t ) is WSS,

RX ( t1 , t2 ) = RX ( t1 − t2 )

Substituting t1 − t2 = τ , we get

∂ 2 RX ( t1 , t2 ) ∂ 2 RX ( t1 − t2 )
=
∂t1∂t2 ∂t1∂t2
∂ ⎛ dRX (τ ) ∂ ( t1 − t2 ) ⎞
= ⎜ . ⎟
∂t1 ⎝ dτ ∂t2 ⎠
d 2 RX (τ ) ∂τ
=−
dτ 2 ∂t1
d 2 RX (τ )
=−
dτ 2
Therefore, a WSS process X ( t ) is m-s differentiable if RX (τ ) has second derivative at
τ = 0.

Example
Consider a WSS process { X ( t )} with autocorrelation function
RX (τ ) = exp ( −a τ )
RX (τ ) does not have the first and second derivative at τ = 0. { X ( t )} is not mean-square
differentiable.
Example The random binary wave { X ( t )} has the autocorrelation function
⎧ τ
⎪1 − τ ≤ Tp
RX (τ ) = ⎨ Tp
⎪0
⎩ otherwise

RX (τ ) does not have the first and second derivative at τ = 0. Therefore, { X ( t )} is not
mean-square differentiable.
Example For a Wiener process { X ( t )} ,
RX ( t1 , t2 ) = α min ( t1 , t2 )
where α is a constant.
⎧α t if t2 < 0
∴ RX ( 0, t2 ) = ⎨ 2
⎩0 other wise
⎧α if t2 < 0
∂RX ( 0, t2 ) ⎪
∴ = ⎨0 if t2 > 0
∂t1 ⎪does not exist if if t = 0
⎩ 2

∂ RX ( t1 , t2 )
2
∴ does not exist at (t1 = 0, t2 = 0)
∂t1∂t2

Thus a Wiener process is m.s. differentiable nowhere..

Mean and Autocorrelation of the Derivative process

We have,
X ( t + ∆t ) − X ( t )
EX ' ( t ) = E lim
∆t →0 ∆t
EX ( t + ∆t ) − EX ( t )
= lim
∆t →0 ∆t
µ ( t + ∆t ) − µ X ( t )
= lim X
∆t →0 ∆t
= µ X '(t )

For a WSS process EX ' ( t ) = µ X ' ( t ) = 0 as µ X ( t ) = constant.

EX ( t1 ) X ' ( t2 ) = RXX ' ( t1 , t2 )


X ( t 2 + ∆t 2 ) − X ( t 2 )
= EX ( t1 ) lim
∆t2 →0 ∆t2
E ⎡⎣ X ( t1 ) X ( t2 + ∆t2 ) − X ( t1 ) X ( t2 ) ⎤⎦
= lim
∆t2 →0 ∆t2
RX ( t1 , t2 + ∆t2 ) − RX ( t1 , t2 )
= lim
∆t2 →0 ∆t2
∂RX
= ( t1 , t2 )
∂t2

Similarly we can show that

∂ 2 RXX ( t1 , t2 )
EX ' ( t1 ) X ' ( t2 ) =
∂t1∂t2

For a WSS process

∂ dR (τ )
EX ( t1 ) X ' ( t2 ) = RX ( t1 − t2 ) = X
∂t dτ

and
EX ′ ( t1 ) X ′ ( t2 ) = RX ′ ( t1 − t2 )
d 2 RX (τ )
=
dτ 2
d 2 RX (τ )
∴ var( X ′ ( t ) =
dτ 2 τ =0
Mean Square Integral
Recall that the definite integral (Riemannian integral) of a function x ( t ) over the interval
[t0 , t ] is defined as the limiting sum given by
t n −1

∫ x (τ )dτ = lim
n →∞ ∆k →0
∑ x (τ ) ∆
k =0
k k
t0

Where t0 < t1 < ................ < tn −1 < tn = t are partitions on the interval [ t0 , t ] and
∆ k = tk +1 − tk and τ k ∈ [tk −1 , tk ] .

For a random process { X ( t )} , the m-s integral can be similarly defined as the process
{Y ( t )} given by
t n −1
Y ( t ) = ∫ X (τ )dτ = l.i.m.
n →∞ ∆k →0
∑ X (τ )∆
k =0
k k
t0

Existence of M.S. Integral


t
• It can be shown that the sufficient condition for the m-s integral ∫ X (τ )dτ
t0
to

t t
exist is that the double integral ∫ ∫ R (τ ,τ ) dτ dτ
t0 t0
X 1 2 1 2 exists.

• If { X ( t )} is M.S. continuous, then the above condition is satisfied and the


process is M.S. integrable.
Mean and Autocorrelation of the Integral of a WSS process
We have
t
EY ( t ) = E ∫ X (τ )dτ
t0
t
= ∫ EX (τ )dτ
t0
t
= ∫ µ X dτ
t0

= µ X (t − t0 )
Therefore, if µ X ≠ 0, {Y ( t )} is necessarily non-stationary.
RY ( t1 , t2 ) = EY (t1 )Y (t2 )
t1 t2

= E ∫ ∫ X (τ 1 ) X (τ 2 ) dτ 1dτ 2
t0 t0
t1 t2

= ∫ ∫ EX (τ 1 ) X (τ 2 ) dτ 1dτ 2
t0 t0
t1 t2

= ∫ ∫ RX (τ 1 − τ 2 ) dτ 1dτ 2
t0 t0

which is a function of t1 and t2 .


Thus the integral of a WSS process is always non-stationary.

Fig. (a) Realization of a WSS process X (t )


(b) corresponding integral Y (t )

Remark The nonstatinarity of the M.S. integral of a random process has physical
importance – the output of an integrator due to stationary noise rises unboundedly.
Example The random binary wave { X ( t )} has the autocorrelation function
⎧ τ
⎪1 − τ ≤ Tp
RX (τ ) = ⎨ Tp
⎪0
⎩ otherwise

RX (τ ) is continuous at τ = 0 implying that { X ( t )} is M.S. continuous. Therefore,


{ X ( t )} is mean-square integrable.
Time averages and Ergodicity

Often we are interested in finding the various ensemble averages of a random process
{ X ( t )} by means of the corresponding time averages determined from single realization
of the random process. For example we can compute the time-mean of a single
realization of the random process by the formula

1 T
T →∞ 2T ∫−T
µx T
= lim x (t )dt

which is constant for the selected realization. µ x T


represents the dc value of x(t ).
Another important average used in electrical engineering is the rms value given by
1 T 2
2T ∫−T
xrms T = lim x (t )dt
T →∞

Can µ x T
and xrms T
represent µ X and EX 2 (t ) respectively ?
To answer such a question we have understand various time averages and their
properties.

Time averages of a random process


The time-average of a function g ( X (t )) of a continuous random process { X ( t )} is
defined by

1 T
2T ∫−T
g ( X (t )) T
=
g ( X (t )) dt
where the integral is defined in the mean-square sense.

Similarly, the time-average of a function g ( X n ) of a continuous random process { X n } is


defined by
1 N
g(X n ) = ∑ g( Xi )
N
2 N + 1 i =− N

The above definitions are in contrast to the corresponding ensemble average defined by

Eg ( X (t )) = ∫ g ( x) f X ( t ) ( x)dx for continuous case
−∞

= ∑
i∈RX ( t )
g ( xi ) p X (t ) ( xi ) for discrete case

The following time averages are of particular interest:

(a) Time-averaged mean


1 T
µX T
=
2T ∫ −T
X (t )dt (continuous case)
1 N
µX = ∑ Xi (discrete case)
N
2 N + 1 i =− N

(b) Time-averaged autocorrelation function

1 T
RX (τ ) T
=
2T ∫
−T
X (t ) X (t + τ )dt (continuous case)
1 N
R X [ m] = ∑ X i X i+m (discrete case)
N 2 N + 1 i =− N

Note that, g ( X (t )) T and g ( X n ) N are functions of random variables and are governed
by respective probability distributions. However, determination of these distribution
functions is difficult and we shall discuss the behaviour of these averages in terms of
their mean and variances. We shall further assume that the random processes { X ( t )} and
{ X n } are WSS.

Mean and Variance of the time averages


Let us consider the simplest case of the time averaged mean of a discrete-time WSS
random process { X n } given by
1 N
µX = ∑ Xi
2 N + 1 i =− N
N

The mean of µ X N
1 N
E µX =E ∑ Xi
N
2 N + 1 i =− N
1 N
= ∑ EX i
2 N + 1 i =− N
=µ X

and the variance


2
⎛ 1 ⎞
( )
2 N
E µX − µX = E⎜ ∑ Xi − µX ⎟
⎝ 2 N + 1 i =− N ⎠
N

2
⎛ 1 N ⎞
= E⎜ ∑ (Xi − µX )⎟
⎝ 2 N + 1 i =− N ⎠
=
1 ⎡ N
− µ +
N N
E ( X i − µ X )( X j − µ X ) ⎤⎥
∑ ∑ ∑
2
2 ⎢
E ( X ) 2
( 2 N + 1) ⎣i =− N ⎦
i X
i =− N ,i ≠ j j =− N

If the samples X − N , X − N +1 ,...., X 1 , X 2 ,...., X N are uncorrelated,


2
⎛ 1 ⎞
E ( µX )
2 N
− µX = E⎜ ∑ Xi − µX ⎟
⎝ 2 N + 1 i =− N ⎠
N

=
1 ⎡ N
E ( X i − µ X )2 ⎤
2 ⎢ ∑ ⎥⎦
( 2 N + 1) ⎣i =− N
σ X2
=
2N +1

We also observe that lim E ( µ X )


2
N
− µX =0
N →∞

From the above result, we conclude that µ X N


⎯⎯⎯
M .S .
→ µX

Let us consider the time-averaged mean for the continuous case. We have
1 T
2T ∫−T
µX T = X (t )dt

1 T
2T ∫−T
∴ E µX T = EX (t )dt

1 T
2T ∫−T
= µ X dt = µ X
and the variance
2
⎛ 1 T ⎞
E ( µX T − µX ) = E ⎜
2
∫−T X (t )dt − µ X ⎟
⎝ 2T ⎠
2
⎛ 1 T ⎞
= E⎜ ∫−T ( X (t ) − µ X )dt ⎟
⎝ 2T ⎠
1
= 2 ∫−TT ∫−TT E ( X (t1 ) − µ X )( X (t2 ) − µ X )dt1dt2
4T
1
= 2 ∫−TT ∫−TT C X (t1 − t2 )dt1dt2
4T
The above double integral is evaluated on the square area bounded by t1 = ±T and
t2 = ±T . We divide this square region into sum of trapizoida strips parallel to t1 − t2 = 0.
Putting t1 − t2 = τ and noting that the differential area between t1 − t2 = τ and
t1 − t2 = τ + dτ is (2T − τ )dτ , the above double integral is converted to a single integral
as follows:

Therefore,
E ( µX − µX ) =
2 1 T T
∫−T ∫−T C X (t1 − t2 )dt1dt2
T
4T 2
1
= 2 ∫−22TT (2T − τ )C X (τ )dτ
4T
1 2T ⎛ τ ⎞
= ∫−2T ⎜1 − ⎟ C X (τ )dτ
2T ⎝ 2T ⎠

t1

t1 − t2 = τ + dτ
t1 − t2 = 2T
T t1 − t2 = τ

−T T t2

−T t1 − t2 = −2T

Ergodicity Principle

If the time averages converge to the corresponding ensemble averages in the probabilistic
sense, then a time-average computed from a large realization can be used as the value for
the corresponding ensemble average. Such a principle is the ergodicity principle to be
discussed below:

Mean ergodic process


A WSS process { X ( t )} is said to be ergodic in mean, if µX T
⎯⎯⎯
M .S .
→ µ X as T → ∞ .
Thus for a mean ergodic process { X ( t )} ,
lim E µ X T = µ X
T →∞

and
lim var µ X T
= σ X2
T →∞
We have earlier shown that
E µX T = µX
and
1
2T
⎡ τ ⎤
2T −∫2T
var µ X T
= C X (τ ) ⎢1 − ⎥dτ
⎣ 2T ⎦
Therefore, the condition for ergodicity in mean is

1
2T
⎡ τ ⎤
T →∞ 2T ∫
lim CX (τ ) ⎢1− ⎥dτ = 0
−2T ⎣ 2T ⎦

If C X (τ ) decreases to 0 for τ > τ 0 , then the above condition is satisfied.

Further,
1
2T
⎡ τ ⎤ 1
2T


2T −2T
C X (τ ) ⎢1 − ⎥dτ ≤
⎣ 2T ⎦ 2T ∫
−2T
C X (τ ) dτ

Therefore, a sufficient condition for mean ergodicity is


2T


−2T
C X (τ ) dτ < ∞

Example Consider the random binary waveform { X (t )} discussed in Eaxmple .The


process has the autocovariance function for τ ≤ Tp given by
⎧ τ
⎪1 − τ ≤ Tp
C X (τ ) = ⎨ Tp
⎪0
⎩ otherwise

Here
2T 2T


−2T
C X (τ ) dτ = 2 ∫ C X (τ ) dτ
0
Tp
⎛ τ ⎞
= 2 ∫ ⎜1 − ⎟dτ
⎜ T ⎟
0 ⎝ p ⎠

⎛ Tp3 Tp2 ⎞
= 2 ⎜ Tp + 2 − ⎟
⎜ 3Tp Tp ⎟⎠

2Tp
=
3

∴ ∫ C X (τ ) dτ < ∞
−∞
Hence { X (t )} is not mean ergodic.
Autocorrelation ergodicity
T
1
2T −∫T
RX (τ ) T = X (t ) X (t + τ )dt

If we consider Z (t ) = X (t ) X (t + τ ) so that, µ Z = RX (τ )

Then { X (t )} will be autocorrelation ergodic if {Z (t )} is mean ergodic.

Thus { X (t )} will be autocorrelation ergodic if

1
2T
⎛ τ1 ⎞
lim
T →∞ 2T ∫ ⎜⎝1 − 2T ⎟⎠C
−2T
Z (τ 1 )dτ 1 = 0

where

CZ (τ 1 ) = EZ (t ) Z (t − τ 1 ) − EZ (t ) EZ (t − τ 1 )
= EX (t ) X (t − τ ) X (t − τ ) X (t − τ − τ 1 ) − RX2 (τ )

Involves fourth order moment.

Hence found the condition for autocorrelation ergodicity of a jointly Gaussian process.

Thus X (t ) will be autocorrelation ergodic if

1
2T
⎛ τ ⎞
lim
T →∞ 2T ∫ ⎜⎝1 − 2T ⎟⎠ C (τ )dτ → 0
−2T
z
Now CZ (τ ) = EZ (t ) Z (t + τ ) − RX2 (τ )

Hence, X (t) will be autocorrelation ergodic

2T
⎛ τ ⎞
∫−2T ⎝ 2T ⎟⎠ ( Ez (t ) z (t + α ) − RX (τ ) ) dα → 0
1
If lim ⎜ 1 − 2
T →∞ 2T

Example
Consider the random –phased sinusoid given by
X (t ) = A cos( w0 t + φ ) where A and w0 are constants and φ ~ U [0, 2π ] is a random
variable. We have earlier proved that this process is WSS with µ X = 0 and
A2
RX (τ ) = cos w0τ
2
For any particular realization x(t ) = A cos( w0t + φ1 ),

1 T
2T ∫−T
µx T
= A cos( w0t + φ1 )dt

1
= A sin( w0T )
Tw0
and
T
1
Rx (τ ) T
=
2T −T
∫ A cos(w t + φ ) A cos(w (t + τ ) + φ )dt
0 1 0 1

T
A2
=
4T ∫ [cos w τ + A cos(w (2t + τ ) + 2φ )]dt
−T
0 0 1

A2 cos w0τ A2 sin( w0 (2T + τ ))


= +
2 4w0T
A2 cos w0τ
We see that as T → ∞, both T
µx T
→ 0 and Rx (τ ) →
2
For each realization, both the time-averaged mean and the time-averaged autocorrelation
function converges to the corresponding ensemble averages. Thus the random-phased
sinusoid is ergodic in both mean and autocorrelation.

Remark
A random process { X (t )} is ergodic if its ensemble averages converges in the M.S. sense
to the corresponding time averages. This is a stronger requirement than stationarity- the
ensemble averages of all orders of such a process are independent of time. This implies
that an ergodic process is necessarily stationary in the strict sense. The converse is not
true- there are stationary random processes which are not ergodic.
Following Fig. shows a hierarchical classification of random processes.

Random processes

WSS random process

WSS Process

Ergodic Processes

Example
Suppose X (t ) = C where C ~ U [0 a]. { X (t )} is a family of straight line as illustrated in
Fig. below.

X (t ) = a

3
X (t ) = a
4
1
X (t ) = a
2

1
X (t ) = a
4

X (t ) = 0

t
a
Here µ X = and
2
1 T
µ X T = lim ∫ Cdt is a different constants for different realizations. Hence { X (t )} is
T →∞ 2T −T

not mean ergodic.


Spectral Representation of a Wide-sense Stationary Random Process

Recall that the Fourier transform (FT) of a real signal g (t ) is given by



G (ω ) = FT ( g (t ) = ∫ g (t )e − jω t dt
−∞
− jω t
where e = cos ω t + j sin ω t is the complex exponential.
The Fourier transform G (ω ) exists if g (t ) satisfies the following Dirichlet conditions
1) g (t ) is absolutely integrable, that is,

∫ g (t ) dt < ∞
−∞
2) g (t ) has only a finite number of discontinuities in any finite interval
3) g (t ) has only finite number of maxima and minima within any finite interval.

The signal g (t ) can be obtained from G (ω ) by the inverse Fourier transform (IFT) as
follows:
1 ∞
g (t ) = IFT (G (ω )) = ∫ G (ω )e dω
jω t

2π −∞
The existence of the inverse Fourier transform implies that we can represent a function g (t ) as a
superposition of continuum of complex sinusoids. The Fourier transform G (ω ) is the strength of
the sinusoids of frequency ω present in the signal. If g (t ) is a voltage signal measured in volt,
G (ω ) has the unit of volt/radian. The function G (ω ) is also called the spectrum of g (t ).

ω
We can define the Fourier transform also in terms of the frequency variable f = . In this

case, we can define the Fourier transform and the inverse Fourier transform as follows:

G ( f ) = ∫ g (t )e − j 2π ft dt
−∞
and


g (t ) = ∫ G ( f )e j 2π ft df
−∞
The Fourier transform is a linear transform and has many interesting properties. Particularly, the
energy of the energy of the signal f (t ) is related by the Parseval’s theorem
T ∞

∫g (t)dt = ∫ G( ω ) dω
2 2

−T −∞

How to have the frequency-domain representation of a random process, particularly a


WSS process?
The answer is the spectral representation of WSS random process. Wiener (1930) and
Khinchin (1934) independently discovered it. Einstein (1914) also used the concept.
Difficulty in Fourier Representation of a Random Process
We cannot define the Fourier transform of a WSS process X (t ) by the mean-square

integral

FT ( X (t )) = ∫ X (t )e − jωt dt
−∞

The existence of the above integral would have implied the existence the Fourier
transform of every realization of X (t ). But the very notion of stationarity demands that
the realization does not decay with time and the first condition of Dirichlet is violated.
This difficulty is avoided by a frequency-domain representation of X (t ) in terms of the
power spectral density (PSD). Recall that the power of a WSS process X (t ) is a constant

and given by EX 2 (t ). The PSD denotes the distribution of this power over frequencies.
Definition of Power Spectral Density of a WSS Process
Let us define
X T (t) = X(t) -T < t < T
=0 otherwise
t
= X (t )rect ( )
2T
t
where rect ( ) is the unity-amplitude rectangular pulse of width 2T centering the
2T
origin. As t → ∞, X T (t ) will represent the random process X (t ).
Define the mean-square integral
T
FTX T (ω ) = ∫X
−T
T (t)e − jω t dt

Applying the Pareseval’s theorem we find the energy of the signal


T ∞

∫ X T2 (t)dt = ∫ FTX T ( ω ) dω .
2

−T −∞

Therefore, the power associated with X T (t ) is


T ∞ 2
1 1
∫−T X (t)dt = 2T ∫ FTX T ( ω ) d ω and
2
T
2T −∞

The average power is given by


FTX T ( ω )
T ∞ ∞ 2 2
1 1
E ∫ X T2 (t)dt = E ∫ FTX T ( ω ) dω = E ∫ dω
2T −T 2T −∞ −∞
2 T

E FTX T (ω )
2

where is the contribution to the average power at frequency ω and


2T
represents the power spectral density for X T (t ). As T → ∞, the left-hand side in the
above expression represents the average power of X (t ). Therefore, the PSD S X (ω ) of

the process X (t ) is defined in the limiting sense by

E FTX T (ω )
2

S X (ω ) = lim
T →∞ 2T
Relation between the autocorrelation function and PSD: Wiener-Khinchin-Einstein
theorem
We have
| FTX T (ω ) |2 FTX T (ω ) FTX T * (ω )
E =E
2T 2T
T T
1
= ∫ ∫
2T −T −T
EX T (t1 )X T (t2 )e − jωt1 e+ jωt2 dt1dt2

T T
1
=
2T ∫ ∫R
−T −T
X (t1 − t2 )e − jω (t1 −t2 ) dt1dt2

t1

t1 − t2 = τ + dτ
t1 − t2 = 2T
T t1 − t2 = τ

−T T t2

−T t1 − t2 = −2T
Note that the above integral is to be performed on a square region bounded by
t1 = ±T and t2 = ±T . Substitute t1 − t 2 = τ so that t 2 = t1 − τ is a family of straight

lines parallel to t1 − t2 = 0. The differential area in terms of τ is given by the shaded area

and equal to (2T − | τ |)dτ . The double integral is now replaced by a single integral in τ .
Therefore,
FTX T (ω ) X T * (ω ) 1 2T
= ∫ RX (τ )e (2T − | τ |)dτ
− jωτ
E
2T 2T −2T
2T |τ |
= ∫ Rx (τ )e − jωτ (1 − )dτ
−2 T 2T

If R X (τ ) is integrable then the right hand integral converges to ∫ RX (τ )e − jωτ dτ as
−∞

T → ∞,

E FTX T (ω )
2

∴ lim = ∫ RX (τ )e − jωτ dτ
T →∞ 2T −∞

E FTX T (ω )
2

As we have noted earlier, the power spectral density S X (ω ) = lim is the


T →∞ 2T
contribution to the average power at frequency ω and is called the power spectral density
X (t ) . Thus

S X (ω ) = ∫R
−∞
X (τ )e − jωτ dτ

and using the inverse Fourier transform



1
RX (τ ) =
2π ∫
−∞
S X (ω )e jωτ dw

Example 1 The autocorrelation function of a WSS process X (t ) is given by


−b τ
RX (τ ) = a 2 e b>0
Find the power spectral density of the process.
∞ − jωτ
S (ω ) = ∫ R (τ )e dτ
X x
−∞
∞ −b τ − jωτ
= ∫ a2e e dτ
−∞
0 − jωτ ∞ − jωτ
= ∫ a 2 ebτ e dτ + ∫ a 2 e−bτ e dτ
−∞ 0
a2 a2
= +
b − jω b + jω
2a 2 b
=
b2 + ω 2
The autocorrelation function and the PSD are shown in Fig.
Example: Suppose X (t ) = A + B sin (ω c t + Φ ) where A is a constant bias and
Φ ~ U [0, 2π ]. Find RX (τ ) and S X (ω ). .
Rx (τ ) = EX (t + τ ) X (t )
= E ( A + B sin (ω c (t + τ ) + Φ ))( A + B sin (ω c t + Φ ))
B2
= A2 + cos ω cτ
2
B2
∴ S X (ω ) = A2δ (ω ) +
4
( δ (ω + ω c ) + δ (ω − ω c ) )
where δ (ω ) is the Dirac Delta function.

S X (ω )

A2
B2 B2
2 2

−ω c o ω
ωc

Example 3 PSD of the amplitude-modulated random-phase sinusoid


X (t ) = M (t ) cos ( ω ct + Φ ) , Φ ~ U ( 0, 2π )
where M(t) is a WSS process independent of Φ.

⇒ RX (τ ) = E M (t + τ ) cos (ω c (t + τ ) + Φ ) M (t ) cos (ω cτ + Φ )
= E M (t + τ ) M (t ) E cos (ω c (t + τ ) + Φ ) cos (ω cτ + Φ )
( Using the independence of M (t ) and the sinusoid)
A2
= RM (τ ) cos ω cτ
2
A2
∴ S X (ω ) =
4
( SM (ω + ωc ) + SM (ω − ωc ) )
where S M (ω ) is the PSD of M(t)
S M (ω )

ω
S X (ω )

−ω c o ω
ωc

Example 4 The PSD of a noise process is given by


N0 ω
S N (ω ) = ω ± ωc ≤
2 2
= 0 Otherwise
Find the autocorrelation of the process.


1
∴ RN (τ ) =
2π ∫ S (ω )
−∞
N e jωτ dω

ω
ωc +
2
1 No
=

×2× ∫ω 2
cos ωτ dτ
ωc −
2

⎡ ⎛ ω⎞ ⎛ ω ⎞⎤
⎢ sin ⎜ ωc + 2 ⎟ − sin ⎜ ωc − 2 ⎟ ⎥
⎢ ⎝ ⎠ ⎝ ⎠⎥
No
=
2π ⎢ τ ⎥
⎢ ⎥
⎣ ⎦
No sin ωτ 2
= cos ω o τ
2π ωτ 2
Properties of the PSD
S X (ω ) being the Fourier transform of RX (τ ), it shares the properties of the Fourier

transform. Here we discuss important properties of S X (ω ).

The average power of a random process X (t ) is

EX 2 (t) = R X (0)
1 ∞
= ∫ S X (ω )dw
2π −∞

ω2
The average power in the band [ω1 , ω 2 ] is 2 ∫ S X (ω )d ω
ω1
Average Power in
the frequency band
[ω1 , ω 2 ]

• If { X (t )} is real, R X (τ ) is a real and even function of τ . Therefore,


S X (ω ) = ∫R
−∞
X (τ )e − jωτ dτ


= ∫R
−∞
X (τ )(cos ωτ + j sin ωτ )dτ


=
−∞
∫R X (τ ) cos ωτ dτ


= 2∫ RX (τ ) cos ωτ dτ
0

Thus S X (ω ) is a real and even function of ω .


E X T (ω )
2

• From the definition S X ( w) = limT →∞ is always non-negative. Thus


2T
S X (ω ) ≥ 0.

• If { X (t )} has a periodic component, R X (τ ) is periodic and so S X (ω ) will have

impulses.
Remark
1) The function SX (ω) is the PSD of a WSS process { X (t )} if and only

if SX (ω) is a non-negative, real and even function of ω and



∫ SX (ω) < ∞
−∞

2) The above condition on SX (ω) also ensures that the corresponding

autocorrelation function R X (τ ) is non-negative definite. Thus the

non-negative definite property of an autocorrelation function can be


tested through its power spectrum.
3) Recall that a periodic function has the Fourier series expansion. If
{ X (t )} is M.S. periodic we can have an equivalent Fourier series

expansion { X (t )} .
Cross power spectral density
Consider a random process Z (t ) which is sum of two real jointly WSS random processes
X(t) and Y(t). As we have seen earlier,

RZ (τ ) = RX (τ ) + RY (τ ) + RXY (τ ) + RYX (τ )
If we take the Fourier transform of both sides,
S Z (ω ) = S X (ω ) + SY (ω ) + FT ( RXY (τ )) + FT ( RYX (τ ))
where FT (.) stands for the Fourier transform.
Thus we see that S Z (ω ) includes contribution from the Fourier transform of the cross-

correlation functions RXY (τ ) and RYX (τ ). These Fourier transforms represent cross power

spectral densities.
Definition of Cross Power Spectral Density
Given two real jointly WSS random processes X(t) and Y(t), the cross power spectral
density (CPSD) S XY (ω ) is defined as

FTX T∗ (ω ) FTYT (ω )
S XY (ω ) = lim E
T →∞ 2T
where FTX T (ω ) and FTYT (ω ) are the Fourier transform of the truncated processes

t t ∗
X T (t ) = X(t)rect ( ) and YT (t ) = Y(t)rect ( ) respectively and denotes the complex
2T 2T
conjugate operation.
We can similarly define SYX (ω ) by

FTYT∗ (ω ) FTX T (ω )
SYX (ω ) = lim E
T →∞ 2T
Proceeding in the same way as the derivation of the Wiener-Khinchin-Einstein theorem
for the WSS process, it can be shown that

S XY (ω ) = ∫ RXY (τ )e − jωτ dτ
−∞

and

SYX (ω ) = ∫ RYX (τ )e − jωτ dτ
−∞
The cross-correlation function and the cross-power spectral density form a Fourier
transform pair and we can write

RXY (τ ) = ∫ S XY (ω )e jωτ dω
−∞

and

RYX (τ ) = ∫ SYX (ω )e jωτ dω
−∞

Properties of the CPSD


The CPSD is a complex function of the frequency ω. Some properties of the CPSD of
two jointly WSS processes X(t) and Y(t) are listed below:

(1) S XY (ω ) = SYX
*
(ω )

Note that RXY (τ ) = RYX (−τ )



∴ S XY (ω ) = ∫ RXY (τ )e − jωτ dτ
−∞

= ∫ RYX ( −τ )e − jωτ dτ
−∞

= ∫ RYX (τ )e jωτ dτ
−∞
*
= SYX (ω )

(2) Re( S XY (ω )) is an even function of ω and Im( S XY (ω )) is an odd function of ω


We have

S XY (ω ) = ∫ RXY (τ )(cos ωτ + j sin ωτ )dτ
−∞
∞ ∞
= ∫ RXY (τ ) cos ωτ dτ + j ∫ RXY (τ )sin ωτ )dτ
−∞ −∞
= Re( S XY (ω )) + j Im( S XY (ω ))
where

Re( S XY (ω )) = ∫ RXY (τ )cos ωτ dτ is an even function of ω and
−∞

Im( S XY (ω )) = ∫ RXY (τ )sin ωτ dτ is an odd function of ω and
−∞

(3) X(t) and Y(t) are uncorrelated and have constant means, then
S XY (ω ) = SYX (ω ) = µ X µY δ (ω )

Observe that
RXY (τ ) = EX (t + τ )Y (t )
= EX (t + τ ) EY (t )
= µ X µY
= µY µ X
= RXY (τ )
∴ S XY (ω ) = SYX (ω ) = µ X µY δ (ω )

(4) If X(t) and Y(t) are orthogonal, then


S XY (ω ) = SYX (ω ) = 0

If X(t) and Y(t) are orthogonal,


RXY (τ ) = EX (t + τ )Y (t )
=0
= RXY (τ )
∴ S XY (ω ) = SYX (ω ) = 0

(5) The cross power PXY between X(t) and Y(t) is defined by

1 T
PXY = lim E ∫ X (t )Y (t )dt
T →∞ 2T −T

Applying Parseval’s theorem, we get


1 T
PXY = lim E ∫ X (t )Y (t )dt
T →∞ 2T −T

1 ∞
= lim E ∫ X T (t )YT (t )dt
T →∞ 2T −∞

1 1 ∞
= lim ∫ FTX T (ω ) FTYT (ω )dω
*
E
T →∞ 2T 2π −∞
1 ∞ EFTX T* (ω ) FTYT (ω )
= ∫ lim dω
2π −∞ T →∞ 2T
1 ∞
= ∫ S XY (ω )dω
2π −∞
1 ∞
∴ PXY = ∫ S XY (ω )dω
2π −∞
Similarly,
1 ∞
PYX = ∫ SYX (ω )dω
2π −∞
1 ∞ *
= ∫ S XY (ω )dω
2π −∞
= PXY
*
Example Consider the random process Z (t ) = X (t ) + Y (t ) discussed in the beginning of
the lecture. Here Z (t ) is the sum of two jointly WSS orthogonal random processes
X(t) and Y(t).
We have,
RZ (τ ) = RX (τ ) + RY (τ ) + RXY (τ ) + RYX (τ )
Taking the Fourier transform of both sides,

S Z (ω ) = S X (ω ) + SY (ω ) + S XY (ω ) + SYX (ω )
1 ∞ 1 ∞ 1 ∞ 1 ∞ 1 ∞
∴ ∫ S Z (ω )dω = ∫ S X (ω )dω + ∫ SY (ω )dω + ∫ S XY (ω ) dω + ∫ SYX (ω )dω
2π −∞ 2π −∞ 2π −∞ 2π −∞ 2π −∞
Therefore,
PZ (ω ) = PX (ω ) + PY (ω ) + PXY (ω ) + PYX (ω )

Remark
• PXY (ω ) + PYX (ω ) is the additional power contributed by X (t ) and Y (t ) to the
resulting power of X (t ) + Y (t )
• If X(t) and Y(t) are orthogonal, then

S Z (ω ) = S X (ω ) + SY (ω ) + 0 + 0
= S X (ω ) + SY (ω )
Consequently
PZ (ω ) = PX (ω ) + PY (ω )
Thus in the case of two jointly WSS orthogonal processes, the power of the sum of
the processes is equal to the sum of respective powers.

Power spectral density of a discrete-time WSS random process


Suppose g [ n ] is a discrete-time real signal. Assume g[ n] to be obtained by sampling a
continuous-time signal g (t ) at an uniform interval T such that
g[ n] = g ( nT ), n = 0, ±1, ±2,...
The discrete-time Fourier transform (DTFT) of the signal g[n] is defined by

G (ω ) = ∑ g[ n]e − jω n
n =−∞

G (ω ) exists if { g[n]} is absolutely summable, that is, ∑ g[ n] < ∞. The signal g[n] is
n =−∞

obtained from G (ω ) by the inverse discrete-time Fourier transform


π
1
g[n]) =
2π ∫π

g (ω )e jω n dw

Following observations about the DTFT are important:


• ω is a frequency variable representing the frequency of a discrete sinusoid.
Thus the signal g[n] = A cos(ω0 n) has a frequency of ω0 radian/samples.

• G (ω ) is always periodic in ω with a period of 2π . Thus G (ω ) is uniquely


defined in the interval −π ≤ ω ≤ π .
• Suppose { g[n]} is obtained by sampling a continuous-time signal g a (t ) at a
uniform interval T such that
g[n] = g a (nT ), n = 0, ±1, ±2,...

The frequency ω of the discrete-time signal is related to the frequency Ω of the


ω
continuous time signal by the relation Ω =
T
where T is the uniform sampling interval. The symbol Ω for frequency of a
continuous signal is used in the signal-processing literature just to distinguish it from
the corresponding frequency of the discrete-time signal. This is illustrated in the Fig.
below.

• We can define the Z − transform of the discrete-time signal by the relation



G ( z ) = ∑ g[n]z − n
n =−∞

where z is a complex variable. G (ω ) is related to G ( z ) by


G (ω ) = G ( z ) z = e jω

Power spectrum of a discrete-time real WSS process { X [n]}


Consider a discrete-time real WSS process { X [ n]}. The very notion of stationarity
poses problem in frequency-domain representation of { X [n]} through the Discrete-time
Fourier transform. The difficulty is avoided similar to the case of the continuous-time
WSS process by defining the truncated process
⎧⎪ X [n] for n ≤ N
X N [ n] = ⎨
⎪⎩0 otherwise

The power spectral density S X (ω ) of the process { X [n]} is defined as

1
S X (ω ) = lim E DTFTX N (ω )
2

N →∞ 2N +1
where
∞ N
DTFTX N (ω ) = ∑ X N [n]e − jwn = ∑ X [ n]e − jwn
n =−∞ n =− N

Note that the average power of { X [n]} is RX [0] = E X 2 [n ] and the power spectral

density S X (ω ) indicates the contribution to the average power of the sinusoidal

component of frequency ω.

Wiener-Einstein-Khinchin theorem
The Wiener-Einstein-Khinchin theorem is also valid for discrete-time random processes.
The power spectral density S X (ω ) of the WSS process { X [n]} is the discrete-time
Fourier transform of autocorrelation sequence.

S X (ω ) = ∑ RX [ m ] e − jω m −π ≤ w ≤ π
m =−∞

RX [ m] is related to S X (ω ) by the inverse discrete-time Fourier transform and given by


π
1
RX [ m]) =
2π ∫π

S X (ω )e jω m d ω
Thus RX [ m] and S X (ω ) forms a discrete-time Fourier transform pair. A generalized
PSD can be defined in terms of z − transform as follows

S X ( z) = ∑ R [m ] z
m =−∞
x
−m

Clearly, S X (ω ) = S X ( z ) z =e jω
−m
Example Suppose RX [m] = 2 m = 0, ±1, ±2, ±3.... Then

S X (ω ) = ∑ RX [ m ] e− jω m
m =−∞
m
∞ ⎛1⎞
= 1 + ∑ ⎜ ⎟ e − jω m
m =−∞ ⎝ 2 ⎠
m≠0

3
=
5 − 4cos ω

The plot of the autocorrelation sequence and the power spectral density is shown in Fig.
below.

Example
Properties of the PSD of a discrete-time WSS process
• For the real discrete-time process { X [ n]}, the autocorrelation function RX [m] is
real and even. Therefore, S X (ω ) is real and even.
• S X (ω ) ≥ 0.
• The average power of { X [n]} is given by
1 π
EX 2 [ n] = RX [0] = ∫ S X (ω )d ω
2π −π
Similarly the average power in the frequency band [ω1 , ω2 ] is given by

ω2
2 ∫ S X (ω ) d ω
ω1

• S X (ω ) is periodic in ω with a period of 2π .

Interpretation of the power spectrum of a discrete-time WSS process


Assume that the discrete-time WSS process { X [n]} is obtained by sampling a continuous-
time random process { X a (t )} at an uniform interval, that is,

X [n] = X a (nT ), n = 0, ±1, ±2,...

The autocorrelation function RX [m] is defined by

R X [ m] = E X [ n + m] X [ n ]
= E X a (n T + mT ) X a (n T )
= RX a ( mT )

∴ RX [ m] = RX a ( mT ) m = 0, ±1, ±2,...

Thus the sequence { RX [m] } is obtained by sampling the autocorrelation function


RX a ( τ ) at a uniform interval T .

The frequency ω of the discrete-time WSS process is related to the frequency Ω of the
ω
continuous time process by the relation Ω =
T
White noise process
A white noise process {W (t )} is defined by

N0
SW (ω ) = −∞ < ω < ∞
2
where N 0 is a real constant and called the intensity of the white noise. The
corresponding autocorrelation function is given by
N
RW (τ ) = δ (τ ) where δ (τ ) is the Dirac delta.
2
The average power of white noise
1 ∞ N
Pavg = EW 2 (t ) = ∫ dω → ∞
2π −∞ 2
The autocorrelation function and the PSD of a white noise process is shown in Fig.
below.

S W (ω )

N0
2

O ω
RW (τ )

N0
δ (τ )
2

O τ

Remarks

• The term white noise is analogous to white light which contains all visible light
frequencies.
• We generally consider zero-mean white noise process.
• A white noise process is unpredictable as the noise samples at different instants of
time are uncorrelated.:
CW (ti , t j ) = 0 for ti ≠ t j .

• White noise is a mathematical abstraction, it cannot be realized since it has infinite


average power.
• If the system band-width (BW) is sufficiently narrower than the noise BW and noise
PSD is flat , we can model it as a white noise process. Thermal noise, which is the
noise generated in resistors due to random motion electrons, is well modelled as
white Gaussian noise, since they have very flat psd over very wide band (GHzs)
• A white noise process can have any probability density function. Particularly, if the
white noise process {W (t )} is a Gaussian random process, then {W (t )} is called a

white Gaussian random process.


• A white noise process is called strict-sense white noise process if the noise samples at
distinct instants of time are independent. A white Gaussian noise process is a strict-
sense white noise process. Such a process represents a ‘purely’ random process,
because its samples at arbitrarily close intervals also will be independent.
Example A random-phase sinusoid corrupted by white noise
Suppose X (t ) = B sin (ω c t + Φ ) + W (t ) where A is a constant bias and Φ ~ U [0, 2π ].
N
and {W (t ) } is a zero-mean WGN process with PSD of 0 and independent of Φ.
2
Find RX (τ ) and S X (ω ). .
Rx (τ ) = EX (t + τ ) X (t )
= E ( B sin (ωc (t + τ ) + Φ ) + W (t + τ ) )( W (t ) + B sin (ω c t + Φ ))
B2
= cos ω cτ + RW (τ )
2
B2
( δ (ω + ω c ) + δ (ω − ω c ) ) + 0
N
∴ S X (ω ) =
4 2
where δ (ω ) is the Dirac Delta function.

Band-limited white noise


A noise process which has non-zero constant PSD over a finite frequency band and zero
elsewhere is called band-limited white noise. Thus the WSS process { X (t )} is band-

limited white noise if


N0
S X (ω ) = −B < ω < B
2
For example, thermal noise which has constant PSD up to very high frequency is better
modeled by a band-limited white noise process.
The corresponding autocorrelation function RX (τ ) is given by

N 0 B sin Bτ
RX (τ ) =
2π Bτ
The plot of S X (ω ) and RX (τ ) of a band-limited white noise process is shown in Fig.

below. Further assume that { X (t )} is a zero-mean process.

S X (ω )
N0
2

−B
O
B ω

Observe that
N0 B
• The average power of the process is EX 2 (t ) = RX (0) =

π 2π 3π nπ
• RX (τ ) = 0 for τ = ± ,± ,± ,... This means that X (t ) and X (t + )
B B B B
where n is a non-zero integer are uncorrelated. Thus we can get uncorrelated
samples by sampling a band-limited white noise process at a uniform interval of
π
.
B
• A band-limited white noise process may also be a band-pass process with PSD as
shown in the Fig. and given by
⎧N B
⎪ 0 ω − ω0 <
S X (ω ) = ⎨ 2 2
⎪⎩0 otherwise

The corresponding autocorrelation function is given by



N 0 B sin 2
RX (τ ) = cos ω 0τ
2π Bτ
2

S X (ω )
N0
2

ω0 ω
−ω 0 −
B −ω 0 −ω 0 +
B O
ω0 −
B ω0 ω +
B
2 2 2 0
2
Coloured Noise
A noise process which is not white is called coloured noise. Thus the noise process

2 a 2b
{ X (t )} with RX (τ ) = a 2 e −b τ b > 0 and PSD S X (ω ) = is an example of a
b2 + ω 2
coloured noise.

White Noise Sequence


A random sequence {W [n]} is called a white noise sequence if

N
SW (ω ) = −π ≤ ω ≤ π
2
Therefore
N
RW (m) = δ ( m)
2
where δ (m) is the unit impulse sequence. The autocorrelation function and the PSD of a
whitenoise sequence is shown in Fig.

SW (ω )
RW [m] N
N 2 NNN
2 2 22

• • • • • →
m shown −π π →ω
A realization of a white noise sequence is in Fig. below.

Remark
1 N N
• The average power of the white noise sequence is EX 2 [n] = × 2π =
2π 2 2
The average power of the white noise sequence is finite and uniformly
distributed over all frequencies.
• If the white noise sequence {W [n]} is a Gaussian sequence, then {W [n]} is
called a white Gaussian noise (WGN) sequence.
• An i.i.d. random sequence is always white. Such a sequence may be called
strict-sense white noise sequence. A WGN sequence is is a strict-sense
stationary white noise sequence.
• The model white noise sequence looks artificial, but it plays a key role in
random signal modelling. It plays the similar role as that of the impulse
function in the modeling of deterministic signals. A class of WSS processes
called regular processes can be considered as the output of a linear system
with white noise as input as illustrated in Fig.
• The notion of the sequence of i.i.d. random variables is also important in
statistical inference.

Linear
White noise process System Regular WSS random process
Response of Linear time-invariant system to WSS input:

In many applications, physical systems are modeled as linear time invariant (LTI) system.
The dynamic behavior of an LTI system to deterministic inputs is described by linear
differential equations. We are familiar with time and transfer domain (such as Laplace
transform and Fourier transform) techniques to solve these equations. In this lecture, we
develop the technique to analyse the response of an LTI system to WSS random process.

The purpose of this study is two-folds:

(1) Analysis of the response of a system


(2) Finding a LTI system that can optionally estimate an unobserved random process
from an observed process. The observed random process is statistically related to
the unobserved random process. For example, we may have to find LTI system
(also called a filter) to estimate the signal from the noisy observations.

Basics of Linear Time Invariant Systems:

A system is modeled by a transformation T that maps an input signal x(t ) to an output


signal y(t). We can thus write,

y(t ) = T [ x(t )]

Linear system
The system is called linear if superposition applies: the weighted sum of inputs results in
the weighted sum of the corresponding outputs. Thus for a linear system

T ⎡⎣ a1 x1 ( t ) + a2 x2 ( t ) ⎤⎦ = a1T ⎡⎣ x1 ( t ) ⎤⎦ + a2T ⎡⎣ x2 ( t ) ⎤⎦

Example : Consider the output of a linear differentiator, given by


d x(t )
y (t ) =
dx

Then,
d
dt
( a1 x1 (t ) + a2 x2 (t ) )
d d
= a1 x1 ( t ) + a2 x2 (t )
dt dt

Hence the linear differentiator is a linear system.


Linear time-invariant system

Consider a linear system with y(t) = T x(t). The system is called time-invariant if
T x ( t − t0 ) = y ( t − t0 ) ∀ t0
It is easy to check that that the differentiator in the above example is a linear time-
invariant system.

Causal system
The system is called causal if the output of the system at t = t0 depends only on the
present and past values of input. Thus for a causal system

y ( t0 ) = T ( x(t ), t ≤ t0 )
Response of a linear time-invariant system to deterministic input
A linear system can be characterised its impulse response h(t ) = T δ (t ) where δ (t ) is the
Dirac delta function.
δ (t ) LTI h(t )
system

Recall that any function x(t) can be represented in terms of the Dirac delta function as
follows

x(t ) = ∫ x(τ ) δ ( t − s )
−∞
ds

If x(t) is input to the linear system y(t) = T x(t), then



y (t ) = T ∫ x( s) δ ( t − s )
−∞
ds


= ∫ x( s ) T δ ( t − s ) ds [ Using the linearity property ]
−∞

= ∫ x( s) h ( t , s )
−∞
ds

Where h ( t , s ) = T δ ( t − s ) is the response at time t due to the shifted impulse δ ( t − s ) .

If the system is time invariant,


h ( t, s ) = h (t − s )
Therefore for a linear invariant system,

y (t ) = ∫ x ( s ) h ( t − s ) ds =
−∞
x(t ) * h(t )

where * denotes the convolution operation.

Thus for a LTI System,

y (t ) = x (t ) * h(t ) = h(t ) * x(t )


x(t ) LTI System y (t )
h(t )

X (ω ) LTI System Y (ω )
H (ω )

Taking the Fourier transform, we get

Y (ω ) = H (ω ) X (ω )

where H (ω ) = FT h ( t ) = ∫ h(t ) e
− jω t
dt is the frequency response of the system
−∞

Response of an LTI System to WSS input

Consider an LTI system with impulse response h(t). Suppose { X (t )} is a WSS process
input to the system. The output {Y (t )} of the system is given by
∞ ∞
Y (t ) = ∫ h ( s ) X ( t − s ) ds = ∫ h (t − s ) X ( s ) ds
−∞ −∞
where we have assumed that the integrals exist in the mean square (m.s.) sense.

Mean and autocorrelation of the output process {Y ( t )}

The mean of the output process is given by,


EY ( t ) = E ∫ h ( s )X ( t − s ) ds
−∞

= ∫ h ( s )EX ( t − s ) ds
−∞

= ∫ h ( s )µ
−∞
X ds


= µX ∫ h ( s )ds
−∞

= µ X H (0)

where H (0) is the frequency response H (ω ) at 0 frequency ( ω = 0 ) given by


∞ ∞

∫ h ( t )e ∫ h ( t )dt
− jω t
H (ω ) ω =0 = dt =
−∞ ω =0 −∞

Therefore, the mean of the output process {Y ( t )} is a constant


The Cross correlation of the input X(t) and the out put Y(t) is given by


EX ( t + τ )Y t ) = E X (t + τ ) ∫ h ( s ) X ( t − s ) ds
−∞

= ∫ h (s)
−∞
E X ( t + τ ) X ( t − s ) ds

= ∫ h(s)
−∞
RX (τ + s ) ds

= ∫ h ( −u )
−∞
RX (τ − u ) du [ Put s = − u ]

= h ( −τ ) * RX (τ )

∴ R XY (τ ) = h ( −τ ) * R X (τ )
also RYX (τ ) = R XY ( −τ ) = h (τ ) * R X ( − τ )
= h (τ ) * RX (τ )
Thus we see that RXY (τ ) is a function of lag τ only. Therefore, X ( t ) and Y ( t ) are
jointly wide-sense stationary.

The autocorrelation function of the output process Y(t) is given by,


∴ EY ( t + τ )Y (t ) ) = E ∫ h ( s ) X ( t + τ − s ) dsY (t )
−∞

=
−∞
∫ h(s) E X ( t + τ − s ) Y (t ) ds


=
−∞
∫ h(s) R XY (τ − s ) ds

= h(τ ) * R XY (τ ) = h(τ ) * h(−τ ) * R X (τ )

Thus the autocorrelation of the output process {Y ( t )} depends on the time-lag τ , i.e.,
EY ( t ) Y ( t + τ ) = RY (τ ) .
Thus
RY (τ ) = RX (τ ) * h (τ ) * h ( −τ )

The above analysis indicates that for an LTI system with WSS input
(1) the output is WSS and
(2) the input and output are jointly WSS.

The average power of the output process {Y ( t )} is given by

PY = RY ( 0 )
= RX ( 0 ) * h ( 0 ) * h ( 0 )
Power spectrum of the output process
Using the property of Fourier transform, we get the power spectral density of the output
process given by

SY (ω ) = S X (ω ) H (ω ) H * (ω )
= S X (ω ) H (ω )
2

Also note that


R XY (τ ) = h ( −τ ) * R X (τ )
and RYX (τ ) = h (τ ) * RX (τ )
Taking the Fourier transform of RXY (τ ) we get the cross power spectral density S XY (ω )
given by

S XY (ω ) = H * (ω ) S X (ω )
and
SYX (ω ) = H (ω ) S X (ω )

S XY (ω ) SY (ω )
S X (ω )
H (ω )
* H (ω )

RXY (τ ) RY (τ )
RX (τ )
h(−τ ) h(τ )
Example:

N0
(a) White noise process X ( t ) with power spectral density is input to an ideal low
2
pass filter of band-width B. Find the PSD and autocorrelation function of the output
process.

H (ω )

N
2

−B B ω

N0
The input process X ( t ) is white noise with power spectral density S X (ω ) = .
2

The output power spectral density SY (ω ) is given by,

SY (ω ) = H (ω ) S X (ω )
2

N0
= 1× −B ≤ω ≤ B
2
N0
= − B ≤ω ≤ B
2

∴ RY (τ ) = Inverse Fourier transform of SY (ω )


N sin 2π Bτ
B
1 N 0 jωτ
=
2π ∫
−B
2
e dω = 0
2 πτ

The output PSD SY (ω ) and the output autocorrelation function RY (τ ) are illustrated in
Fig. below.
S Y (ω )
N0
2

O
−B B ω

Example 2:

A random voltage modeled by a white noise process X ( t ) with power spectral density
N0
is input to a RC network shown in the fig.
2

Find (a) output PSD SY (ω )


(b) output auto correlation function RY (τ )
(c) average output power EY 2 ( t ) R
The frequency response of the system is given by

1
jCω
H (ϖ ) =
1
R+
jCω
1
=
jRCω + 1

Therefore,
(a)
SY (ω ) = H (ω ) S X (ω )
2

1
= S X (ω )
R C ω 2 +1
2 2

1 N0
= 2 2 2
R C ω +1 2
(b) Taking inverse Fourier transform

τ
N 0 − RC
RY (τ ) = e
4 RC

(c) Average output power


N0
EY 2 ( t ) = RY ( 0 ) =
4 RC
Discrete-time Linear Shift Invariant System with WSS Random Inputs
We have seen that the Dirac delta function δ (t ) plays a very important role in the analysis
of the response of the continuous-time LTI systems to deterministic and random inputs.
Similar role in the case of the discrete-time LTI system is played by the unit sample
sequence δ [n] defined by

⎧1 for n = 0
δ [ n] = ⎨
⎩0 otherwise
Any discrete-time signal x[ n] can be expressed in terms of δ [n] as follows:

x[n] = ∑ x[ k ]δ [ n − k ]
k =−∞

A discrete-time linear shift-invariant system is characterized by the unit sample response


h[ n ] which is the output of the system to the unit sample sequence δ [n ].
δ [n ] LTI h[n]
system

The transfer function of such a system is given by



H (ω ) = ∑ h[ n]e − jω n
n =−∞

An analysis similar to that for the continuous-time LTI system shows that the response
y[ n] of a the linear time-invariant system with impulse response h[ n ] to a deterministic
input x[ n] is given by

y[ n] = ∑ x[ k ]h[ n − k ] = x[ n]* h[n]
k =−∞

Consider a discrete-time linear system with impulse response h[n] and WSS input X [n]

h[n]
X [n] Y [n]


Y [n] = X [n]* h[n] = ∑ h[k ] X [n − k ]
k =−∞

E Y [n] = E X [n]* h[n]

For the WSS input X [n]



µY = E Y [n] = µ X * h[ n] = µ X ∑ h[ n] = µ X H (0)
n =−∞
where H (0) is the dc gain of the system given by
H (0) = H (ω ) ω =0

= ∑ h[n]e − jwn
n =−∞ ω =0

= ∑ h[n]
n =−∞

RY [m] = E Y [n]Y [n − m]
= E ( X [n]* h[n])( X [n − m]* h[n − m]) .
= RX [m]* h[m]* h[− m]

RY [m] is a function of lag m only.


From above we get
SY ( ω ) = | Η (ω ) |2 S X (ω )

S X (ω ) S Y (ω )
Hω )
2

• Note that though the input is an uncorrelated process, the output is a correlated
process.
Consider the case of the discrete-time system with a random sequence x[n ] as an input.

x[n ] y[n ]
h[n ]
RY [m] = R X [m] * h[m] * h[ −m]
Taking the z − transform, we get
S Y ( z ) = S X ( z ) H ( z ) H ( z −1 )

Notice that if H (z ) is causal, then H ( z −1 ) is anti causal.

Similarly if H (z ) is minimum-phase then H ( z −1 ) is maximum-phase.

RX [ m ] RY [m]
H (z ) H ( z −1 )
S X ( z) SY ( z )
Example
1
If H ( z) = and x[n ] is a unity-variance white-noise sequence, then
1 − α z −1

Given EX 2 [n] = σ X2
σ X2
∴ S X (ω ) =

SY ( z ) = H ( z ) H ( z −1 ) S X ( z )
⎛ 1 ⎞⎛ 1 ⎞ 1
=⎜ −1 ⎟⎜ ⎟
⎝ 1 − α z ⎠⎝ 1 − α z ⎠ 2π
By partial fraction expansion and inverse z − transform, we get
1
RY [m ] = a |m|
1−α 2

Spectral factorization theorem

A WSS random signal X [n ] that satisfies the Paley Wiener condition


π
∫ | ln S X (ω ) | dω < ∞ can be considered as an output of a linear filter fed by a white noise
−π

sequence.
If S X (ω ) is an analytic function of ω ,
π
and ∫ | ln S X (ω ) | dω < ∞ , then S X ( z ) = σ v2 H c ( z ) H a ( z )
−π

where
H c ( z ) is the causal minimum phase transfer function

H a ( z ) is the anti-causal maximum phase transfer function

and σ v2 a constant and interpreted as the variance of a white-noise sequence.


Innovation sequence
v[n] X [n ]
H c ( z)

Figure Innovation Filter


Minimum phase filter => the corresponding inverse filter exists.

1 V [ n]
X [n ] H c ( z)

Figure whitening filter


1
Since ln S X ( z ) is analytic in an annular region ρ < z < ,
ρ

ln S X ( z ) = ∑ c[ k ] z − k
k =−∞

1 π iω n
where c[k ] = ∫ ln S X (ω )e dw is the kth order cepstral coefficient.
2π −π
For a real signal c[k ] = c[−k ]
1 π
and c[0] = ∫ ln S XX ( w)dw
2π −π

∑ c[ k ] z − k
S XX ( z ) = e k =−∞

∞ −1
∑ c[ k ] z − k ∑ c[ k ] z − k
=e c[0]
e k =1 e k =−∞


∑ c[ k ] z − k
Let H C ( z ) = ek =1 z >ρ
= 1 + hc (1)z -1 + hc (2) z −2 + ......
(∵ hc [0] = Limz →∞ H C ( z ) = 1
H C ( z ) and ln H C ( z ) are both analytic

=> H C ( z ) is a minimum phase filter.


Similarly let
−1
∑ c( k ) z− k
H a ( z ) = ek =−∞

∑ c( k ) zk 1
= ek =1 = H C ( z −1 ) z<
ρ
Therefore,
S XX ( z ) = σ V2 H C ( z ) H C ( z −1 )

where σ V2 = ec ( 0)

Salient points
• S XX (z ) can be factorized into a minimum-phase and a maximum-phase factors

i.e. H C ( z ) and H C ( z −1 ).
• In general spectral factorization is difficult, however for a signal with rational
power spectrum, spectral factorization can be easily done.
• Since is a minimum phase filter, 1 exists (=> stable), therefore we can have a
H C (z)

1
filter to filter the given signal to get the innovation sequence.
HC ( z)

• X [n ] and v[n] are related through an invertible transform; so they contain the
same information.
Example

Wold’s Decomposition
Any WSS signal X [n ] can be decomposed as a sum of two mutually orthogonal
processes
• a regular process X r [ n] and a predictable process X p [n] , X [n ] = X r [n ] + X p [n ]

• X r [ n ] can be expressed as the output of linear filter using a white noise

sequence as input.
• X p [n] is a predictable process, that is, the process can be predicted from its own

past with zero prediction error.

You might also like