You are on page 1of 128

18.05 Spring 2005 Lecture Notes 18.

05 Lecture 1 February 2, 2005

Required Textbook - DeGroot & Schervish, Probability and Statistics, Third Edition Recommended Introduction to Probability Text - Feller, Vol. 1

1.2-1.4. Probability, Set Operations. What is probability? Classical Interpretation: all outcomes have equal probability (coin, dice) Subjective Interpretation (nature of problem): uses a model, randomness involved (such as weather) ex. drop of paint falls into a glass of water, model can describe P(hit bottom before sides) or, P(survival after surgery)- subjective, estimated by the doctor. Frequency Interpretation: probability based on history P(make a free shot) is based on history of shots made. Experiment has a random outcome. 1. Sample Space - set of all possible outcomes. coin: S={H, T}, die: S={1, 2, 3, 4, 5, 6} two dice: S={(i, j), i, j=1, 2, ..., 6} 2. Events - any subset of sample space ex. A S, A - collection of all events. 3. Probability Distribution - P: A [0, 1] Event A S, P(A) or Pr(A) - probability of A Properties of Probability: 1. 0 P(A) 1 2. P(S ) = 1 3. For disjoint (mutually exclusive) events A, B (denition A B = ) P(A or B) = P(A) + P(B) - this can be written for any number of events. For a sequence of events A 1 , ..., An , ... all disjoint (Ai Aj = , i = j): P(

Ai ) =

i=1

which is called countably additive.


If continuous, cant talk about P(outcome), need to consider P(set)
Example: S = [0, 1], 0 < a < b < 1.
P([a, b]) = b a, P(a) = P(b) = 0.

i=1

P(Ai )

Need to group outcomes, not sum up individual points since they all have P = 0.

1.3 Events, Set Operations

Union of Sets: A B = {s S : s A or s B }

Intersection: A B = AB = {s S : s A and s B }

Complement: Ac = {s S : s / A}

Set Dierence: A \ B = A B = {s S : s A and s / B} = A B c

Symmetric Dierence: (A B c ) (B Ac ) Summary of Set Operations: 1. Union of Sets: A B = {s S : s A or s B } 2. Intersection: A B = AB = {s S : s A and s B } 3. Complement: Ac = {s S : s / A} 4. Set Dierence: A \ B = A B = {s S : s A and s / B} = A B c 5. Symmetric Dierence:
/ B ) or (s B and s /
A)} = AB = {s S : (s A and s (A B c ) (B Ac ) Properties of Set Operations: 1. A B = B A 2. (A B ) C = A (B C ) Note that 1. and 2. are also valid for intersections. 3. For mixed operations, associativity matters:
(A B ) C = (A C ) (B C )
think of union as addition and intersection as multiplication: (A+B)C = AC + BC
4. (A B )c = Ac B c - Can be proven by diagram below:

Both diagrams give the same shaded area of intersection. 5. (A B )c = Ac B c - Prove by looking at a particular point: s (A B )c = s / (A B ) s / A or s / B = s Ac or s B c s (Ac B c ) QED ** End of Lecture 1

18.05 Lecture 2 February 4, 2005

1.5 Properties of Probability. 1. P(A) [0, 1] 2. P(S ) = 1 3. P(Ai ) = P (Ai ) if disjoint Ai Aj = , i = j The probability of a union of disjoint events is the sum of their probabilities. 4. P(), P(S ) = P(S ) = P(S ) + P() = 1
where S and are disjoint by denition, P(S) = 1 by #2., therefore, P() = 0.
5. P(Ac ) = 1 P(A)
because A, Ac are disjoint, P(A Ac ) = P(S ) = 1 = P(A) + P(Ac )
the sum of the probabilities of an event and its complement is 1. 6. If A B, P(A) P(B )
by denition, B = A (B \ A), two disjoint sets.
P(B ) = P(A) + P(B \ A) P(A)
7. P(A B ) = P(A) + P(B ) P(AB )
must subtract out intersection because it would be counted twice, as shown:

write in terms of disjoint pieces to prove it:


P(A) = P(A \ B ) + P(AB )
P(B ) = P(B \ A) + P(AB )
P(A B ) = P(A \ B ) + P(B \ A) + P(AB )
Example: A doctor knows that P(bacterial infection) = 0.7 and P(viral infection) = 0.4
What is P(both) if P(bacterial viral) = 1?
P(both) = P(B V)
1 = 0.7 + 0.4 - P(BV)
P(BV) = 0.1

Finite Sample Spaces There are a nite # of outcomes S = {s1 , ..., sn } Dene pi = P(si ) as the probability function.

pi 0,

n i=1

pi = 1 P(s)

P(A) =

sA

Classical, simple sample spaces - all outcomes have equal probabilities. A) P(A) = #( #(S ) , by counting methods. Multiplication rule: #(s1 ) = m, #(s2 ) = n, #(s1 s2 ) = mn Sampling without replacement: one at a time, order is important
s1 ...sn outcomes
k n (k chosen from n)
#(outcome vectors) = (a1 , a2 , ..., ak ) = n(n 1) ... (n k + 1) = Pn,k
Example: order the numbers 1, 2, and 3 in groups of 2. (1, 2) and (2, 1) are dierent.
P3,2 = 3 2 = 6
Pn,n = n(n 1) ... 1 = n!
Pn,k = n! (n k )!

Example: Order 6 books on a shelf = 6! permutations.

Sampling with replacement, k out of n


number of possibilities = n n n... = nk
Example: Birthday Problem- In a group of k people,
what is the probability that 2 people will have the same birthday?
Assume n = 365 and that birthdays are equally distributed throughout the year, no twins, etc.
# of dierent combinations of birthdays= #(S = all possibilities) = 365k
# where at least 2 are the same = #(S ) #(all are dierent) = 365k P365,k
P(at least 2 have the same birthday) = 1 P365,k 365k

each set can be ordered k! ways, so divide that out of Pn,k Cn,k - binomial coecients Binomial Theorem: (x + y )n =
n n

Sampling without replacement, k at once s1 ...sn sample a subset of size k, b1 ...bk , if we arent concerned with order. n n! number of subsets = Cn,k = = k k !(n k )!

k=0

xk y nk

There are

Example: a - red balls, b - black balls.


number of distinguishable ways to order in a row =
a+b a+b = a b Example: r1 + ... + rk = n; ri = number of balls in each box; n, k given
How many ways to split n objects into k sets?
Visualize the balls in boxes, in a line - as shown:

n
k

times that each term will show up in the expansion.

Fix the outer walls, rearrange the balls and the separators. If you x the outer walls of the rst and last boxes,
you can rearrange the separators and the balls using the binomial theorem.
There are n balls and k-1 separators (k boxes).
Number of dierent ways to arrange the balls and separators =
n+k1 n+k1 = n k1 Example: f (x1 , x2 , ..., xk ), take n partial derivatives: nf 2 x1 x2 5 x3 ...xk k boxes k coordinates
n balls n partial derivatives
k1 n+k1
number of dierent partial derivatives = n+n = k 1

Example: In a deck of 52 cards, 5 cards are chosen.


What is the probability that all 5 cards have dierent face values?
total number of outcomes = 52 5 total number of face value combinations = 13 5 total number of suit possibilities, with replacement = 45 13 5 5 4 P(all 5 dierent face values) = 52
5

** End of Lecture 2.

18.05 Lecture 3 February 7, 2005


n! Pn,k = (n k)! - choose k out of n, order counts, without replacement. k n - choose k out of n, order counts, with replacement. n! Cn,k = k!(n k)! - choose k out of n, order doesnt count, without replacement.

1.9 Multinomial Coecients These values are used to split objects into groups of various sizes.
s1 , s2 , ..., sn - n elements such that n1 in group 1, n2 in group 2, ..., nk in group k.
n1 + ... + nk = n
n n n1 n n1 n2 n n1 ... nk2 nk ... n2 n3 n k 1 nk n1 = (n n1 )! (n n1 n2 )! (n n1 ... nk2 )! n! 1 ... n1 !(n n1 )! n2 !(n n1 n2 )! n3 !(n n1 n2 n3 )! nk1 !(n n1 ... nk1 )! = n! = n1 !n2 !...nk1 !nk ! n n1 , n2 , ..., nk

These combinations are called multinomial coecients.

Further explanation: You have n spots in which you have n! ways to place your elements.
However, you can permute the elements within a particular group and the splitting is still the same.
You must therefore divide out these internal permutations.
This is a distinguishable permutations situation.
Example #1 - 20 members of a club need to be split into 3 committees (A, B, C) of 8, 8, and 4 people,
respectively. How many ways are there to split the club into these committees?
20 20! ways to split = = 8!8!4! 8, 8, 4 Example #2 - When rolling 12 dice, what is the probability that 6 pairs are thrown?
This can be thought of as each number appears twice
There are 612 possibilities for the dice throws, as each of the 12 dice has 6 possible values.
In pairs, the only freedom is where the dice show up.
12! 12! 12 = P= = 0.0034 (2!)6 612 2, 2, 2, 2, 2, 2 (2!)6

Example #3 - Playing Bridge


Players A, B, C, and D each get 13 cards.
P(A 6s, B 4s, C 2s, D 1) =?
13 39 (choose s)(choose other cards) 6,4,2,1 7,9,11,12 = P= = 0.00196 52 (ways to arrange all cards) 13,13,13,13 Note - If it didnt matter who got the cards, multiply by 4! to arrange people around the hands. Alternate way to solve - just track the locations of the s 13131313 P=
6 4 2 52 13 1

Probabilities of Unions of Events:

P(A B ) = P(A) + P(B ) P(AB )

P(A B C ) = P(A) + P(B ) + P(C ) P(AB ) P(BC ) P(AC ) + P(ABC ) 1.10 - Calculating a Union of Events - P(union of events)
P(A B ) = P(A) + P(B ) P(AB ) (Figure 1)
P(A B C ) = P(A) + P(B ) + P(C ) P(AB ) P(BC ) P(AC ) + P(ABC ) (Figure 2)
Theorem:

P(

i=1

Ai ) =

i n

P(Ai )

i<j

P(Ai Aj ) +

i<j<k

P(Ai Aj Ak ) ... + (1)n+1 P(Ai ...An )

Express each disjoint piece, then add them up according to what sets each piece
belongs or doesnt belong to.
A1 ... An can be split into a disjoint partition of sets:

c Ai1 Ai2 ... Aik Ac i(k+1) ... Ain

where k = last set the piece is a part of. P(


n

Ai ) =

i=1

P(disjoint partition)

To check if the theorem is correct, see how many times each partition is counted.
P (A1 ), P(A2 ), ..., P ( Ak ) - k times
k P ( A A ) i j i<j 2 times
(needs to contain Ai and Aj in k dierent intersections.) Example: Consider the piece A B C c , as shown:

This piece is counted: P(A B C ) = once. P(A) + P(B ) + P(C ) = counted twice.
P(AB ) P(AC ) P(BC ) = subtracted once.
+P(ABC ) = counted zero times.
The sum: 2 - 1 + 0 = 1, piece only counted once.
Example: Consider the piece A1 A2 A3 Ac 4 k = 3, n = 4.
P(A1 ) + P(A2 ) + P(A3 ) + P(A4 ) = counted k times (3 times).
P(A1 A2 ) P(A1 A3 ) P(A1 A4 ) P(A2 A3 ) P(A2 A4 ) P(A3 A4 ) = counted k 2
times (3 times).
k as follows: i<j<k = counted 3 times (1 time). k k k+1 k total in general: k k 2 + 3 4 + ... + (1) k = sum of times counted. To simplify, this is a binomial situation.

0 = (1 1) =

k k i=0

(1) (1)

(ki)

k k k k = + ... 0 1 2 3

0 = 1 sum of times counted therefore, all disjoint pieces are counted once.

** End of Lecture 3

10

18.05 Lecture 4 February 11, 2005

Union of Events P(A1 ... An ) =


i

P(Ai )

i<j

P(Ai Aj ) +

i<j<k

P(Ai Aj Ak ) + ...

It is often easier to calculate P(intersections) than P(unions)


Matching Problem: You have n letters and n envelopes, randomly stu the letters into the envelopes.
What is the probability that at least one letter will match its intended envelope?
P(A1 ... An ), Ai = {ith position will match}
1)! 1 = (n P(Ai ) = n n! (permute everyone else if just Ai is in the right place.) 2)! P(Ai Aj ) = (n (Ai and Aj are in the right place) n! k)! P(Ai1 Ai2 ...Aik ) = (n n! 1 n (n 2)! n (n 3)! n (n n)! P(A1 ... An ) = n + ... + (1)n+1 n n! n 2 n! 3 n! general term: n!(n k )! 1 n (n k )! = = k n! k !(n k )!n! k! SUM = 1
2

1 1 1 + ... + (1)n+1 2! 3! n!
3

x Recall: Taylor series for ex = 1 + x + x 2! + 3! + ... 1 1 1 for x= -1, e = 1 1 + 2 3! + ... therefore, SUM = 1 - limit of Taylor series as n When n is large, the probability converges to 1 e1 = 0.63

2.1 - Conditional Probability Given that B happened, what is the probability that A also happened? The sample space is narrowed down to the space where B has occurred:

The sample size now only includes the determination that event B happened. Denition: Conditional probability of Event A given Event B: P(A|B ) = P(AB ) P(B )

Visually, conditional probability is the area shown below: 11

It is sometimes easier to calculate intersection given conditional probability: P(AB ) = P(A|B )P(B ) Example: Roll 2 dice, sum (T) is odd. Find P(T < 8). B = {T is odd}, A = {T < 8} P(A|B ) = P(AB ) 18 1 , P(B ) = 2 = P(B ) 6 2

All possible odd T = 3, 5, 7, 9, 11.


Ways to get T = 2, 4, 6, 4, 2 - respectively.
1/3 2 12 =1 P(AB ) = 36 3 ; P(A|B ) = 1/2 = 3 Example: Roll 2 dice until sum of 7 or 8 results (T = 7 or 8)
P(A = {T = 7}), B = {T = 7 or 8}
This is the same case as if you roll once.
(AB ) P(A) 6/36 6 P(A|B ) = PP (B ) = P(B ) = (6+5)/36 = 11 Example: Treatments for a Result Relapse No Relapse disease, results A B C 18 13 22 22 25 16 after 2 years: Placebo 24 10
24 24+10

Example, considering Placebo: B = Placebo, A = Relapse. P(A|B ) = 13 Example, considering treatment B: P(A|B ) = 13+25 = 0.34

= 0.7

As stated earlier, conditional probability can be used to calculate intersections:


Example: You have r red balls and b black balls in a bin.
Draw 2 without replacement, What is P(1 = red, 2 = black)?
r What is P(2 = black) given that 1 = red ? P(1 = red) = r+ b Now, there are only r - 1 red balls and still b black balls. b b r P(2 = black|1 = red) = b+r 1 P(AB ) = b+r 1 r +b P(A1 A2 ...An ) = P(A1 ) P(A2 |A1 ) P(A3 |A2 |A1 ) ... P(An |An1 ...A2 |A1 ) = = P(A1 ) P(An An1 ...A1 ) P(A2 A1 ) P(A3 A2 A1 ) ... = P(A1 ) P(A2 A1 ) P(An1 ...A1 ) = P(An An1 ...A1 ) Example, continued: Now, nd P(r, b, b, r) 12

r b b1 r1 r+b r1+b r+b2 r+b3

Example, Casino game - Craps. Whats the probability of actually winning??


On rst roll: 7, 11 - win; 2, 3, 12 - lose; any other number (x1 ), you continue playing.
If you eventually roll 7 - lose; x1 , you win!
P(win) = P(x1 = 7 or 11) + P(x1 = 4)P(get 4 before 7|x1 = 4)+ +P(x1 = 5)P(get 5 before 7|x1 = 5) + ... = 0.493 The game is almost fair! ** End of Lecture 4

13

18.05 Lecture 5 February 14, 2005

2.2 Independence of events. (AB ) P(A|B ) = PP (B ) ; Denition - A and B are independent if P(A|B ) = P(A) P(A|B ) = P(AB ) = P(A) P(AB ) = P(A)P(B ) P(B )

Experiments can be physically independent (roll 1 die, then roll another die),
or seem physically related and still be independent.
Example: A = {odd}, B = {1, 2, 3, 4}. Related events, but independent.
2 1 .P(B ) = 3 .AB = {1, 3} P(A) = 2 1 2 1 , therefore independent. P(AB ) = 2 3 = P(AB ) = 3 Independence does not imply that the sets do not intersect.

Disjoint = Independent. If A, B are independent, nd P(AB c )


P(AB ) = P(A)P(B )
AB c = A \ AB , as shown:

so, P(AB c ) = P(A) P(AB )


= P(A) P(A)P(B )
= P(A)(1 P(B ))
= P(A)P(B c )
therefore, A and B c are independent as well.
similarly, Ac and B c are independent. See Pset 3 for proof.
Independence allows you to nd P(intersection) through simple multiplication. 14

Example: Toss an unfair coin twice, these are independent events. P(H ) = p, 0 p 1, nd P(T H ) = tails rst, heads second P(T H ) = P(T )P(H ) = (1 p)p Since this is an unfair coin, the probability is not just 1 4 TH 1 If fair, HH +HT +T H +T T = 4 If you have several events: A1 , A2 , ...An that you need to prove independent:
It is necessary to show that any subset is independent.
Total subsets: Ai1 , Ai2 , ..., Aik , 2 k n
Prove: P(Ai1 Ai2 ...Aik ) = P(Ai1 )P(Ai2 )...P(Aik )
You could prove that any 2 events are independent, which is called pairwise independence,
but this is not sucient to prove that all events are independent.
Example of pairwise independence:
Consider a tetrahedral die, equally weighted.
Three of the faces are each colored red, blue, and green,
but the last face is multicolored, containing red, blue and green.
P(red) = 2/4 = 1/2 = P(blue) = P(green)
P(red and blue) = 1/4 = 1/2 1/2 = P(red)P(blue)
Therefore, the pair {red, blue} is independent.
The same can be proven for {red, green} and {blue, green}.
but, what about all three together?
P(red, blue, and green) = 1/4 = P(red)P(blue)P(green) = 1/8, not fully independent.
Example: P(H ) = p, P(T ) = 1 p for unfair coin
Toss the coin 5 times P(HTHTT)
= P(H )P(T )P(H )P(T )P(T )
= p(1 p)p(1 p)(1 p) = p2 (1 p)3
Example: Find P(get 2H and 3T, in any order)
= sum of probabilities for ordering
= P(HHT T T ) + P(HT HT T ) = ...
2 =p (1 p)3 + p2 (1 p)3 + ...
5 = 2 p2 (1 p)3

General Example: Throw a coin n times, P(k heads out of n throws)


n k = p (1 p)nk k

Example: Toss a coin until the result is heads; there are n tosses before H results.
P(number of tosses = n) =?
needs to result as TTT....TH, number of Ts = (n - 1)
P(tosses = n) = P(T T...H ) = (1 p)n1 p

Example: In a criminal case, witnesses give a specic description of the couple seen eeing the scene.
P(random couple meets description) = 8.3 108 = p
We know at the beginning that 1 couple exists. Perhaps a better question to be asked is:
Given a couple exists, what is the probability that another couple ts the same description?
P(2 couples exists)
A = P(at least 1 couple), B = P(at least 2 couples), nd P(B |A)
(BA) P(B ) P(B |A) = PP (A) = P(A) 15

n Out of n couples, P(A) = P(at least 1 couple) = 1 P(no couples) = 1 i=1 (1 p)


*Each* couple doesnt satisfy the description, if no couples exist.
Use independence property, and multiply.
P(A) = 1 (1 p)n
P(B ) = P(at least two) = 1 P(0 couples) P(exactly 1 couple)
= 1 (1 p)n n p(1 p)n1 , keep in mind that P(exactly 1) falls into P(k out of n)
P(B |A) = 1 (1 p)n np(1 p)n1 1 (1 p)n

If n = 8 million people, P(B |A) = 0.2966, which is within reasonable doubt! P(2 couples) < P(1 couple), but given that 1 couple exists, the probability that 2 exist is not insignicant.

In the large sample space, the probability that B occurs when we know that A occured is signicant! 2.3 Bayess Theorem It is sometimes useful to separate a sample space S into a set of disjoint partitions:

B1 , ..., Bk - a partition of sample space S. k Bi Bj = , for i = j, S = i=1 Bi (disjoint) k k Total probability: P(A) = i=1 P(ABi ) = i =1 P(A|Bi )P(Bi ) k (all ABi are disjoint, i=1 ABi = A)

** End of Lecture 5

16

18.05 Lecture 6 February 16, 2005

Solutions to Problem Set #1 1-1 pg. 12 #9 Bn = i=n Ai , Cn = i=n Ai a) Bn Bn +1 ... Bn = An ( i=n+1 Ai ) = An Bn+1 s Bn+1 s Bn+1 An = Bn Cn Cn+1 ... Cn = An Cn+1 s Cn = n Cn+1 s Cn+1 A b) s n=1 Bn s Bn for all n s i=1 Ai for all n s some Ai for i n, for all n s innitely many events Ai Ai happen innitely often.
c) s n=1 Cn s some Cn = i=n Ai for some n, s all Ai for i n
s all events starting at n. 1-2 pg. 18 #4 P (at least 1 fails) = 1 P (neither fail) = 1 0.4 = 0.6 1-3 pg. 18 #12 A1 , A2 , ... c c B1 = A1 , B2 = Ac 1 A2 , ..., Bn = A1 ...An1 An n n P ( i=1 Ai ) = i=1 P (B i ) splits the union into disjoint events, and covers the entire space. n follows from: n A = i i=1 i=1 Bi n take point (s) in i=1
Ai , s at least one s A1 = B1 , c
if not, s Ac 1 , if s A2 , then s A1 A2 = B2 , if not... etc. at some point, the point belongs to a set. c The sequence stops when s Ac Ac 1 2 ... Ak1 Ak = Bk n n n s i=1 Bi .P ( i=1 Ai ) = P ( i=1 Bi ) n = i=1 P (Bi ) if Bi s are disjoint. Should also prove that the point in Bi belongs in Ai . Need to prove Bi s disjoint - by construction: c Bi , Bj B i = A c i ... Ai1 Ai

c c Bj = A1 ... Ai ... Ac j 1 A j s B i s A i , s Bj s / Ai . implies that s = s 1-4 pg. 27 #5 #(S ) = 6 6 6 6 = 64 #(all dierent) = 6 5 4 3 = P6,4 P6,4 5 P (all dierent) = 6 = 18 4 1-5 pg. 27 #7
12 balls in 20 boxes.
P(no box receives > 1 ball, each box will have 0 or 1 balls)
also means that all balls fall into dierent boxes.
#(S ) = 2012
#(all dierent) = 20 19... 9 = P20,12

17

P (...) =

P20,12 2012

1-6 pg. 27 #10


100 balls, r red balls.
Ai = {draw red at step i}
think of arranging the balls in 100 spots in a row.
r a) P (A1 ) = 100 b) P (A50 )
sample space = sequences of length 50.
#(S ) = 100 99 ... 50 = P100,50
#(A50 ) = r P99,49 red on 50. There are 99 balls left, r choices to put red on 50.
r P (A50 ) = 100 , same as part a.
c) As shown in part b, the particular draw doesnt matter, probability is the same.
r P (A100 ) = 100 1-7 pg. 34 #6
Seat n people in n spots.
#(S ) = n!
#(AB sit together) =?
visualize n seats, you have n-1 choices for the pair.
2(n-1) ways to seat the pair, because you can switch the two people.
but, need to account for the (n-2) people remaining!
#(AB ) = 2(n 1)(n 2)!
1)! 2 =n therefore, P = 2(nn ! or, think of the pair as 1 entity. There are (n-1) entities, permute them, multiply by 2 to swap the pair. 1-8 pg. 34 #11 Out of 100, choose 12. #(S ) = 100 98 12 #(AB are on committee) = 10 , choose 10 from the 98 remaining. (98 10) P = 100 ( 12 ) 1-9 pg. 34 #16 50 states 2 senators each. a) Select 8 , #(S ) = 100 8 298 2 #(state 1 or state 2) = 98 6 2 + 1 7

1-10 pg. 34 #17


In the sample 13
of the aces in the hands.
space, only consider the positions , #(all go to 1 player) = 4 #(S ) = 52 4 4 (13 4) P = 4 52 (4) 1-11 r balls, n boxes, no box is empty.
rst of all, put 1 ball in each box from the beginning.
r-n balls remain to be distributed in n boxes.

50 b) #(one senator from each state) = 2 100 select group of 50 = 50

or, calculate: 1 P(neither chosen) = 1

(98 8) (100 8 )

18

n + (r n) 1 rn

r1 rn

1-12 30 people, 12 months.


P(6 months with 3 birthdays, 6 months with 2 birthdays)
#(S ) = 1230
Need to choose the 6 months with 3 or 2 birthdays, then the multinomial coecient:
12 30 #(possibilities) = 6 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2 ** End of Lecture 6

19

18.05 Lecture 7 February 18, 2005

Bayes Formula.

Partition B1 , ..., Bk k B = S, Bi Bj = for i = j i i=1 k P(A) = k i=1 P(ABi ) = i=1 P(A|Bi )P(Bi ) - total probability.

Example: In box 1, there are 60 short bolts and 40 long bolts. In box 2,
there are 10 short bolts and 20 long bolts. Take a box at random, and pick a bolt.
What is the probability that you chose a short bolt?
B1 = choose Box 1.
B2 = choose Box 2.
60 1 1 P(short) = P(short|B1 )P(B1 ) + P(short|B2 )P(B2 ) = 100 ( 2 ) + 10 30 ( 2 )
Example:
Partitions: B1 , B2 , ...Bk and you know the distribution.
Events: A, A, ..., A and you know the P(A) for each Bi
If you know that A happened, what is the probability that it came from a particular B i ?
P(Bi |A) = P(Bi A) P(A|Bi )P(Bi ) = : Bayess Formula P(A) P(A|B1 )P(B1 ) + ... + P(A|Bk )P(Bk )

Example: Medical detection test, 90% accurate.


Partition - you have the disease (B1 ), you dont have the disease (B2 )
The accuracy means, in terms of probability: P(positive|B1 ) = 0.9, P(positive|B2 ) = 0.1
In the general public, the chance of getting the disease is 1 in 10,000.
In terms of probability: P(B1 ) = 0.0001, P(B2) = 0.9999
If the result comes up positive, what is the probability that you actually have the disease? P(B 1 |positive)?
P(B1 |positive) = P(positive|B1 )P(B1 ) P(positive|B1 )P(B1 ) + P(positive|B2 )P(B2 )

(0.9)(0.0001) = 0.0009 (0.9)(0.0001) + (0.1)(0.9999)

The probability is still very small that you actually have the disease.

20

Example: Identify the source of a defective item.


There are 3 machines: M1 , M2 , M3 . P(defective): 0.01, 0.02, 0.03, respectively.
The percent of items made that come from each machine is: 20%, 30%, and 50%, respectively.
Probability that the item comes from a machine: P (M1 ) = 0.2, P (M2 ) = 0.3, P (M3 ) = 0.5
Probability that a machines item is defective: P (D|M1 ) = 0.01, P (D|M2 ) = 0.02, P (D|M3 ) = 0.03
Probability that it came from Machine 1: P (M1 |D) = P (D|M1 )P (M1 ) P (D|M1 )P (M1 ) + P (D|M2 )P (M2 ) + P (D|M3 )P (M |3) (0.01)(0.2) = 0.087 (0.01)(0.2) + (0.02)(0.3) + (0.03)(0.5)

Example: A gene has 2 alleles: A, a. The gene exhibits itself through a trait with two versions.
The possible phenotypes are dominant, with genotypes AA or Aa, and recessive, with genotype aa.
Alleles travel independently, derived from a parents genotype.
In a population, the probability of having a particular allele: P(A) = 0.5, P(a) = 0.5
Therefore, the probabilities of the genotypes are: P(AA) = 0.25, P(Aa) = 0.5, P(aa) = 0.25
Partitions: genotypes of parents: (AA, AA), (AA, Aa), (AA, aa), (Aa, Aa), (Aa, aa), (aa, aa).
Assume pairs match regardless of genotype.
Parent genotypes
(AA, AA)
(AA, Aa)
(AA, aa)
(Aa, Aa)
(Aa, aa)
(aa, aa)
Probabilities
1 2 (1 4 )( 2 ) = 1 1 2 ( 4 )( 4 ) = 1 1 (1 2 )( 2 ) = 4 1 2 (1 2 )( 4 ) = 1 16 1 16 1 4 1 8 1 4

Probability that child has dominant phenotype 1 1 1


3 4 1 2

If you see that a person has dark hair, predict the genotypes of the parents: P ((AA, AA)|A) =
1 16 (1) 1 4 (1) 1 8 (1) 1 1 16 1 1 3 + 4(4) + 1 4(2) + 1 16 (0)

1 12

You can do the same computation to nd the probabilities of each type of couple. Bayess formula gives a prediction inside the parents that you arent able to directly see. Example: You have 1 machine.
In good condition: defective items only produced 1% of the time. P(in good condition) = 90%
In broken condition: defective items produced 40% of the time. P(broken) = 10%
Sample 6 items, and nd that 2 are defective. Is the machine broken?
This is very similar to the medical example worked earlier in lecture:
P(good|2 out of 6 are defective) =
= P (2of 6|good)P (good) P (2of 6|good)P (good) + P (2of 6|broken)P (broken) 6 2 4 2 (0.01) (0.99) (0.9) 6 = 6 = 0.04 2 4 2 4 2 (0.01) (0.99) (0.9) + 2 (0.4) (0.6) (0.1)

** End of Lecture 7

21

18.05 Lecture 8 February 22, 2005

3.1 - Random Variables and Distributions Transforms the outcome of an experiment into a number.
Denitions:
Probability Space: (S, A, P)
S - sample space, A - events, P - probability
Random variable is a function on S with values in real numbers, X:S R
Examples:
Toss a coin 10 times, Sample Space = {HTH...HT, ....}, all congurations of H & T.
Random Variable X = number of heads, X: S R
X: S {0, 1, ..., 10} for this example.
There are fewer outcomes than in S, you need to give the distribution of the
random variable in order to get the entire picture. Probabilities are therefore given.
Denition: The distribution of a random variable X:S R, is dened by: A R, P(A) = P(X A) = P(s S : X (s) A)

The random variable maps outcomes and probabilities to real numbers.


This simplies the problem, as you only need to dene the mapped R, P, not the original S, P.
The mapped variables describe X, so you dont need to consider the original
complicated probability space.
1 k 1 10k 10 1 From the example, P(X = #(heads in 10 tosses) = k ) = 10 = k 210 k (2) (2) Note: need to distribute the heads among the tosses,
account for probability of both heads and tails tossed.
This is a specic example of the more general binomial problem:
A random variable X {1, ..., n}
n k P(X = k ) = p (1 p)nk k This distribution is called the binomial distribution: B(n, p), which is an example of a discrete distribution. Discrete Distribution A random variable X is called discrete if it takes a nite or countable number (sequence) of values: X {s1 , s2 , s3 , ...} It is completely described by telling the probability of each outcome. Distribution dened by: P(X = sk ) = f (sk ), the probability function (p.f.) p.f. cannot be negative and should sum to 1 over all outcomes. P(X A) = sk A f (sk )

Example: Uniform distribution of a nite number of values {1, 2, 3, ..., n} each outcome 22

1 : uniform probability function. has equal probability f (sk ) = n random variable X R, P(A) = P(X A), A R
can redene probability space on random variable distribution:
(R, A, P) - sample space, X: R R, X (x) = x (identity map)
P(A) = P(X : X (x) A) = P(x A) = P(x A) = P(A)
all you need is the outcomes mapped to real numbers and relative probabilities
of the mapped outcomes.

Example: Poisson Distribution, {0, 1, 2, 3, ...} (), = intensity


probability function:
f (k ) = P(X = k ) =
k k0 k! e

k e , where parameter > 0. k!

k = e k 0 e = e0 = 1 k! = e Very common distribution, will be used later in statistics.


Represents a variety of situations - ex. distribution of typos in a book on a particular page,
number of stars in a random spot in the sky, etc.
Good approximation for real world problems, as P > 10 is small.
Continuous Distribution Need to consider intervals not points.
Probability distribution function (p.d.f.): f (x) 0.
Summation replaced by integral: f (x)dx = 1
then, P(A) = A f (x)dx, as shown:

Example: In a uniform distribution [a, b], denoted U[a, b]: 1 p.d.f.: f (x) = b / [a, b] a , for x [a, b]; 0, for x Example: On an interval [a, b], such that a < c < d < b, d 1 dc P([c, d]) = c b a dx = ba (probability on a subinterval) Example: Exponential Distribution

If you were to choose a random point on an interval, the probability of choosing


a particular point is equal to zero.
You cant assign positive probability to any point, as it would add up innitely on a continuous interval.
It is necessary to take P(point is in a particular sub-interval).
a Denition implies that P({a}) = a f (x)dx = 0

E (), > 0 parameter p.d.f.: f (x) = ex , if x 0; 0, if x < 0 Check that it integrates to 1:

23

1 x ex dx = ( e |0 = 1 Real world: Exponential distribution describes the life span of quality products (electronics). 0

** End of Lecture 8

24

18.05 Lecture 9 February 23, 2005

Cumulative distribution function (c.d.f):


F (x) = P(X x), x R
Properties:
1. x1 x2 , {X x1 } {X x2 } P(X x1 ) P(X x2 ) non-decreasing function. 2. limx F (x) = P(X ) = 0, limx F (x) = P(X ) = 1. A random variable only takes real numbers, as x , set becomes empty.
1 , P(X = 1) = Example: P(X = 0) = 2 1 2

Discrete Random Variable: - dened by probability function (p.f.) {s1 , s2 , ...}, f (si ) = P(X = si ) Continuous: probability distribution function (p.d.f.) - also called density function. f (x) 0, f (x)dx, P(X A) = A f (x)dx

P(X x < 0) = 0 1 1 P(X 0) = P(X = 0) = 2 , P(X x) = P(X = 0) = 2 , x [0, 1) P(X x) = P(X = 0 or 1) = 1, x [1, ) 3. right continuous: limyx+ F (y ) = F (x) F (y ) = P(X y ), event {X y }

n=1

{X yn } = {X x}, F (yn ) P(X x) = F (x)

Probability of random variable occuring within interval: P(x1 < X < x2 ) = P({X x2 }\{X x1 }) = P(X x2 ) P(X x1 ) = F (x2 ) F (x1 )

25

{X x2 } {X x1 } Probability of a point x, P(X = x) = F (x) F (x ) where F (x ) = limxx F (x), F (x+ ) = limxx+ F (x)
If continuous, probability at a point is equal to 0, unless there is a jump,
where the probability is the value of the jump.

P(x1 X x2 ) = F (x2 ) F (x 1 ) P(A) = P(X A)
X - random variable with distribution P
When observing a c.d.f:

Discrete: sum of probabilities at all the jumps = 1. Graph is horizontal in between the jumps, meaning that probability = 0 in those intervals.

x Continuous: F (x) = P(X x) = f (x)dx eventually, the graph approaches 1.

26

If f continuous, f (x) = F (x) Quantile: p [0, 1], p-quantile = inf {x : F (x) = P(X x) p} nd the smallest point such that the probability up to the point is at least p.
The area underneath F(x) up to this point x is equal to p.
If the 0.25 quantile is at x = 0, P(X 0) 0.25

Note that if disjoint, the 0.25 quantile is at x = 0, but so is the 0.3, 0.4...all the way up to 0.5. What if you have 2 random variables? multiple?
ex. take a person, measure weight and height. Separate behavior tells you nothing
about the pairing, need to describe the joint distribution.
Consider a pair of random variables (X, Y)
Joint distribution of (X, Y): P((X, Y ) A)
Event, set A R2
27

2 1 2 Discrete distribution: {(s1 1 , s1 ), (s2 , s2 ), ...} (X, Y ) 1 2 2 Joint p.f.: f (si , si ) = P((X, Y ) = (s1 i , s1 )) 1 2 = P(X = si , Y = si ) Often visualized as a table, assign probability for each point:

1 1.5 3 Continuous: f (x, y ) 0,

0 0.1 0 0.2

-1 0 0 0

-2.5 0.2 0 0.4

5 0 0.1 0

f (x, y )dxdy =

Joint p.d.f. f (x, y ) : P((X, Y ) A) = A f (x, y )dxdy Joint c.d.f. F (x, y ) = P(X x, Y y )

R2

f (x, y )dxdy = 1

If you want the c.d.f. only for x, F (x) = P(X x) = P(X x, Y +)


= F (x, ) = limy F (x, y )
Same for y.
To nd the probability within a rectangle on the (x, y) plane:

Continuous: F (x, y ) =

x y

f (x, y )dxdy. Also,

2F xy

= f (x, y )

** End of Lecture 9

28

18.05 Lecture 10 February 25, 2005

x y In the continuous case: F (x, y ) = P(X x, T y ) = f (x, y )dxdy. Marginal Distributions Given the joint distribution of (X, Y), the individual distributions of X, Y
are marginal distributions.
Discrete (X, Y): marginal
probability function f1 (x) = P(X = x) = y P(X = x, Y = y ) = y f (x, y )
In the table for the previous lecture, of probabilities for each point (x, y):
Add up all values for y in the row x = 1 to determine P(X = 1)
Continuous (X, Y): joint p.d.f. f(x, y); p.d.f. of X: f1 (x) = f (x, y )dy x F (x) = P(X x) = P(X x, Y ) = f (x, y )dydx

Review of Distribution Types Discrete distribution for (X, Y): joint p.f. f (x, y ) = P(X = x, Y = y ) Continuous: joint p.d.f. f (x, y ) 0, R2 f (x, y )dxdy = 1 Joint c.d.f.: F (x, y ) = P(X x, Y y ) F (x) = P(X x) = limy F (x, y )

f1 (x) = F x = f (x, y )dy Why not integrate over x line?


P({X = x}) = ( x f (x, y )dx)dy = 0
P(of continuous random variable at a specic point) = 0. Example: Joint p.d.f. 2 2 f (x, y ) = 21 4 x y, x y 1, 0 x 1; 0 otherwise

29

What is the distribution of x? 1 1 2 1 2 2 x ydy = 21 p.d.f. f1 (x) = x2 21 4 4 x 2 y |x 2 =

21 2 8 x (1

x4 ), 1 x 1

Discrete values for X, Y in tabular form: 1 2 1 0.5 0 0.5 2 0 0.5 0.5 0.5 0.5 Note: If all entries had 0.25 values, the two variables would have the same marginal dist. Independent X and Y: Denition: X, Y independent if P(X A, Y B ) = P(X A)P(Y B )
Joint c.d.f. F (x, y ) = P(X x, Y y ) = P(X x)P(Y y ) = F1 (x)F2 (y ) (intersection of events)
The joint c.d.f can be factored for independent random variables.
Implication: (X, Y): joint p.d.f. f(x, y), x), f2 (y ) y x continuous x marginal f1 ( y F (x, y ) = f (x, y )dydx = F1 (x)F2 (y ) = f1 (x)dx f2 (y )dy
2

Take xy of both sides: f (x, y ) = f1 (x)f2 (y ) Independent if joint density is a product.

Much simpler in the discrete case:


Discrete (X, Y): f (x, y ) = P(X = x, Y = y ) = P(X = x)P(Y = y ) = f1 (x)f2 (y ) by denition.
Example: Joint p.d.f.
f (x, y ) = kx2 y 2 , x2 + y 2 1; 0 otherwise
X and Y are not independent variables.
f (x, y ) = f1 (x)f2 (y ) because of the circle condition.

30

P(square) = 0 = P(X side) P(Y side) Example: f (x, y ) = kx2 y 2 , 0 x 1, 0 y 1; 0 otherwise Can be written as a product, as they are independent:
f (x, y ) = kx2 y 2 I (0 x 1, 0 y 1) = k1 x2 I (0 x 1) k2 y 2 I (0 y 1)
Conditions on x and y can be separated.
Note: Indicator Notation
/
A I (x A) = 1, x A; 0, x For the discrete case, given a table of values, you can tell independence: b1 p11 ... ... pn1 p+1 b2 p12 ... ... ... p+2 ... ... ... ... ... ... bm p1m ... ... pnm p+n

a1 a2 ... an

p1+ p2+ ... pn+

pij = P(X = ai , Y = bj ) = P(X = ai )P(Y = bj ) m pi+ = P(X = ai ) = j =1 pij n p+j = P(Y = bj ) = i=1 pij pij = pi+ p+j , for every i, j - all points in table. ** End of Lecture 10

31

18.05 Lecture 11 February 28, 2005

A pair (X, Y) of random variables:


f(x, y) joint p.f. (discrete), joint p.d.f. (continuous)
Marginal Distributions: f (x) = y f (x, y ) - p.f. of X (discrete)
f (x) = f (x, y )dy - p.d.f. of X (continuous)
Conditional Distributions Discrete Case: P(X = x|Y = y ) =
f (x,y ) f (y ) = f (x|y ) conditional (x,y ) f (y |x) = ff (x) conditional p.f.

P(X = x, Y = y ) P(Y = y )

P=

p.f. of X given Y = y. Note: dened when f(y) is positive.

of Y given X = x. Note: dened when f(x) is positive. If the marginal probabilities are zero, conditional probability is undened.
Continuous Case:
Formulas are the same, but cant treat like exact possibilities at xed points.
Consider instead in terms of probability density:
Conditional c.d.f. of X given Y=y;
P(X x|Y [y , y + ]) = P(X x, Y [y , y + ]) P(Y [y , y + ])

Joint p.d.f. f (x, y ), P(A) =

f (x, y )dxdy

= As 0:

y+ x 1 2 y f (x, y )dxdy y+ 1 f (x, y )dxdy 2 y f (x, y )dx f (x, y )dx = x

Conditional c.d.f:

f (x, y )dx f (y )

32

P(X x|Y = y ) = Conditional p.d.f: f (x|y ) =

f (x, y )dx f (y )

f (x, y ) P(X x|Y = y ) = x f (y )

Same result as discrete.


Also, f (x|y ) only dened when f(y) > 0.
Multiplication Rule f (x, y ) = f (x|y )f (y ) Bayess Theorem: f (y |x) = f (x, y ) f (x|y )f (y ) f (x|y )f (y ) = = f (x) f (x, y )dy f (x|y )f (y )dy

Bayess formula for Random Variables. For each y, you know the distribution of x. Note: When considering the discrete case, In statistics, after observing data, gure out the parameter using Bayess Formula. Example: Draw X uniformly on [0, 1], Draw Y uniformly on [X, 1] p.d.f.: 1 f (x) = 1 I (0 x 1), f (y |x) = I (x y 1) 1x Joint p.d.f: 1 I (0 x y 1) f (x, y ) = f (y |x)f (x) = 1x Marginal: y 1 f (y ) = f (x, y )dx = dx = ln(1 x)|y 0 = ln(1 y ) 1 x 0

Keep in mind, this condition is everywhere: given, y [0, 1] and f(y) = 0 if y / [0, 1] Conditional (of X given Y): f (x|y ) = 1 f (x, y ) I (0 x y 1) = f (y ) (1 x) ln(1 y )

Multivariate Distributions Consider n random variables: X1 , X2 , ..., Xn Joint p.f.: f (x1 , x2 , ..., xn ) = P(X f =1 1 = x1 , ..., Xn = xn ) 0, Joint p.d.f.: f (x1 , x2 , ..., xn ) 0, f dx1 dx2 ...dxn = 1 Marginal, Conditional in the same way: Dene notation as vectors to simplify: X = (X1 , ..., Xn ), x = (x1 , ..., xn ) X = ( Y , Z ) subsets of coordinates: Y = (X1 , ..., Xk ), y = (y1 ...yk ) z = (z1 ...znk ) Z = (Xk+1 , ..., Xn ), Joint p.d.f. or joint p.f. of X , f ( x ) = f ( y , z) 33

Marginal: f ( y)= Conditional:

f ( y , z )d z , f ( z)=

f ( y , z )d y

y | z )f ( f ( z) f ( y , z) , f ( z | y)= f ( y | z)= f( z ) z f ( y | z )f ( z )d

Functions of Random Variables Consider random variable X and a function r: R R, Y = r(X ), and you want to calculate the distribution of Y. Discrete Case: Discrete p.f.: f (y ) = P(Y = y ) = P(r(X ) = y ) = P(x : r(x) = y ) = (very similar to change of variable) Continuous Case:
Find the c.d.f. of Y = r(X) rst.
P(Y y ) = P(r(X ) y ) = P(x : r(x) y ) = P(A(y )) = p.d.f. f (y ) =
y

x:r (x)=y

P(X = x) =

x:r (x)=y

f (x)

** End of Lecture 11

f (x)dx
A(y )

A(y )

f (x)dx

34

18.05 Lecture 12
March 2, 2005

Functions of Random Variables X - random variable, continuous with p.d.f. f(x)


Y = r(X)
Y doesnt have to be continuous, if it is, nd the p.d.f.
To nd the p.d.f., rst nd the c.d.f.
P(Y y ) = P(r(X ) y ) = P(x : r(X ) y ) = Then, dierentiate the c.d.f to nd the p.d.f.: f (y ) = (P(Y y )) y f (x)dx.
x:r (x)y

Example:
Take random variable X, uniform on [-1, 1]. Y = X 2 , nd distribution of Y.
1 p.d.f. f (x) = { 2 for 1 x 1; 0 otherwise} Y = X , P(Y y ) = P(X Y ) = P( y X y ) =
2 2

. Take derivative before integrating.

f (x)dx

1 1 1 P(Y y ) = f ( y ) + f (y ) = (f ( y) + f ( y)) y 2 y 2 y y 1 f (y ) = { , 0 y 1; 0 otherwise.} y Suppose r is monotonic (strictly one-to-one function).


X = r(y ), can always nd inverse: y = r 1 (x) = s(y ) - inverse of r.
P(Y y ) = P(r(x) y ) =
= P(X s(y )) if r is increasing (1)
= P(X s(y )) if r is decreasing (2) (1) = F (s(y )) where F () - c.d.f. of X , P(Y y ) F (s(y )) = = f (s(y ))s (y ) y y (2) = 1 P(X < s(y )) = 1 F (s(y )), P(Y y ) = f (s(y ))s (y ) y

If r is increasing s = r 1 is increasing. s (y ) 0 s (y ) = |s (y )| If r is decreasing s = r 1 is decreasing. s (y ) 0 s (y ) = |s (y )| Answer: p.d.f. of Y : f (y ) = f (s(y ))|s (y )|

35

Example: f (x) = {3(1 x)2 , 0 x 1; 0 otherwise.} Y = 10e5x Y = 10e5x X = f (y ) = 3(1 1 1 Y ln( ); X = 5 10 5Y

1 Y 1 ln( )| |, 10 y 10e5 ; 0, otherwise. 5 10 5Y

X has c.d.f. F (X ) = P(X x), continuous. Y = F (X ), 0 Y 1, what is the distribution of Y?

c.d.f. P(Y y ) = P(F (X ) y ) = P(X F 1 (y )) = F (F 1 (y )) = y, 0 y 1 p.d.f. f (y ) = {1, 0 y 1; 0, otherwise.} Y - uniform on interval [0, 1]

X - uniform on interval [0, 1]; F - c.d.f. of Y .

36

Y = F 1 (X ); P(Y y ) = P(F 1 (x) y ) = P(X F (y )) = F (y ). Random Variable Y = F 1 (X ) has c.d.f. F (y ). Suppose that (X, Y) has joint p.d.f. f(x, y). Z = X + Y. P(Z z ) = P(X + Y z ) = f (x, y )dxdy =
x+y z

z x

f (x, y )dydx,

p.d.f.: f (z ) = P(Z z ) = z

f (x, z x)dx

If X, Y independent, f1 (x) = p.d.f. of X. f2 (y ) = p.d.f. of Y Joint p.d.f.: f (x, y ) = f1 (x)f2 (y ); f (z ) = f1 (x)f2 (z x)dx

Example: X, Y independent, have p.d.f.: f (x) = {ex , x 0; 0, otherwise}. Z =X +Y : f (z ) =


z 0

ex e(zx) dx

This distribution describes the lifespan of a high quality product.


It should work like new after a point, given it doesnt break early on.
Distribution of X itself:

Limits determined by: (0 x, z x 0 0 x z ) z f (z ) = 2 ez dx = 2 ez


0

z 0

dx = 2 zez

37

X, P(X x) = Conditional Probability:

1 = ex ex dx = ex ( )| x

P(X x + t|X x) = ** End of Lecture 12

P(X x + t) e(X +t) P(X x + t, x x) = = = et = P(X t) P(X x) P(X x) ex

38

18.05 Lecture 13 March 4, 2005 Functions of random variables. If (X, Y) with joint f (x, y ), consider Z = X + Y. p.d.f. p.d.f. of Z: f (z ) = f (x, z x)dx If X and Y independent: f (z ) = f1 (x)f2 (z x)dx

Example:
X, Y independent, uniform on [0, 1], X, Y U [0, 1], Z = X + Y
p.d.f. of X, Y:
f1 (x) = {1, 0 x 1; 0 otherwise} = I (0 x 1),
f2 (y ) =I (0 y 1) = I (0 z x 1)
f (z ) = I (0 x 1) I (0 z x 1)dx
Limits: 0 x 1; z 1 x z
Both must be true, consider all the cases for values of z:

Case 1: (z 0) = 0 z Case 2: (0 z 1) 0 1dx = z 1 Case 3: (1 z 2) z1 1dx = 2 z Case 4: (z 2) = 0 Random variables likely to add up near 1, peak of f(z) graph. Example: Multiplication of Random Variables X 0, Y 0.Z = XY (Z is positive) First, look at the c.d.f.:

P(Z z ) = P(XY z ) = p.d.f. of Z:

f (x, y )dxdy =
XY z

z/x

f (x, y )dydx
0

39

P(Z z ) f (z ) = = z

Example: Ratio of Random Variables zy Z = X/Y (all positive), P(Z z ) = P(X zY ) = xzy f (x, y )dxdy = 0 0 f (x, y )dxdy p.d.f. f (z ) = 0 f (zy, y )ydy In general, look at c.d.f. and express in terms of x and y. Example: X1 , X2 , ..., Xn - independent with same distribution (same c.d.f.)
f (x) = F (x) - p.d.f. of Xi
P(Xi x) = F (x)
Y = maximum among X1 , X2 ...Xn
P(Y y ) = P(max(X1 , ..., Xn ) y ) = P(X1 y, X2 y...Xn y )
Now, use denition of independence to factor:
= P(X1 y )P(X2 y )...P(Xn y ) = F (y )n p.d.f. of Y: (y ) = F (y )n = nF (y )n1 F (y ) = nF (y )n1 f (y ) f y Y = min(X1 , . . . , Xn ), P(Y y ) = P(min(X1 , ..., Xn ) y )
Instead of intersection, use union. But, ask if greater than y :
= 1 P(min(X1 , ..., Xn ) > y )
= 1 P(X1 > y, ..., Xn > y )
= 1 P(X1 > y )P(X2 > y )...P(Xn > y )
= 1 P(X1 > y )n
= 1 (1 P(X1 y ))n
= 1 (1 F (y ))n

z 1 f (x, ) dx x x


X = (X1 , X2 , .., Xn ), Y = (Y1 , Y2 , ..., Yn ) = r( X ) Y1 = r1 (X1 , ..., Xn )
Y2 = r2 (X1 , ..., Xn )
...
Yn = rn (X1 , ..., Xn )

Suppose that a map r has inverse. X = r1 ( Y ) P( Y A) = A g ( y )d y g ( y ) is the joint p.d.f. of Y P( Y A) = P(r( X ) A) = P( X s(A)) = s(A) f ( x )dx = A f (s( y ))|J |d y, Note: change of variable x = s( y ) Note: J = Jacobian:
s1 y1

J = det The p.d.f. of Y : f (s( y ))|J |

...
sn y1

... ... ...

s1 yn

...
sn yn

Example:
(X1 , X2 ) with joint p.d.f. f (x1 , x2 ) = {4x1 x2 , for 0 x1 1, 0 x2 1; 0, otherwise}
40

Y1 =

X1 ; Y 2 = X 1 X2 X2 Y2
= s2 (Y1 , Y2 ) Y1

But, keep in mind the intervals for non-zero values: J = det Joint p.d.f. of (Y1 , Y2 ): g (y1 , y2 ) = {4 y1 y2
y 2 2 y1 y2 2y1
3/2


Y1 = r1 (X1 , X2 ), Y2 = r2 (X1 , X2 ) inverse X1 = Y1 Y2 = s1 (Y1 , Y2 ), X2 =
y 1 2 y2 1 2 y 2 y1

1 1 1 + = 4y1 4y1 2y1

Condition implies that they are positive, absolute value is unneccessary.

2y2 y2 |J | = , if 0 y1 y2 1, and 0 y1 |y1 |

y2 1; 0 otherwise } y1

Joint p.d.f. of (Y1 , Y2 ) ** Last Lecture of Coverage on Exam 1 ** ** End of Lecture 13

41

18.05 Lecture 14 March 7, 2005 Linear transformations of random vectors: Y = r( X ) y1 x1 . . . =A . . . yn xn 0 A 1 = B A - n by n matrix, X = A1 Y if det A = x1 = b1 y1 + ... + b1n yn b11 ... b1n ... ... where b J = Jacobian = det i s are partial derivatives of si with respect to yi bn1 ... bnn det B = det A1 = 1 detA p.d.f. of Y: g (y ) = Example: X = (x1 , x2 ) with p.d.f.: f (x1 , x2 ) = {cx1 x2 , 0 x1 1, 0 x2 1; 0 otherwise} To make integral equal 1, c = 4. Y1 = X1 + 2X2 , Y2 = 2X1 + X2 ; A = Calculate the inverse functions: 1 1 X1 = (Y1 2Y2 ), X2 = (Y2 2Y1 ) 3 3 New joint function: 1 1 1 g (y1 , y2 ) = { 4( (y1 2y2 ))( (y2 2y1 )) 3 3 3 1 1 for 0 (y1 2y2 ) 1 and 0 (y2 2y1 ) 1; 3 3 0, otherwise} Simplied: f (y1 , y2 ) = { 4 (y1 2y2 )(y2 2y1 ) for 3 y1 2y2 0, 3 y2 2y1 0; 27 0, otherwise} 1 2 2 det(A) = 3 1 1 f (A1 x) |detA|

42

Linear transformation distorts the graph from a square to a parallelogram. Note: From Lecture 13, when min() and max() functions were introduced, such functions
describe engines in series (min) and parallel (max).
When in series, the length of time a device will function is equal to the minimum life
in all the engines (weakest link).
When in parallel, this is avoided as a device can function as long as one engine functions.
Review of Problems from PSet 4 for the upcoming exam: (see solutions for more details) Problem 1 - f (x) = {ce2x for x 0; 0 otherwise}
Find c by integrating over the range and setting equal to 1:
c 1 1= ce2x dx = ce2x | 0 = 1 = 1 c = 2 2 2 0 2 2x P(1 X 2) = 1 2e dx = e2 e4

Problem 3 - X U [0, 5], Y = 0 if X 1; Y = X if 1 X 3; Y = 5 if 3 < X 5 Draw the c.d.f. of Y, showing P(Y y )

Graph of Y vs. X, not the c.d.f. Write in terms of X P(X ?)

Cumulative Distribution Function

43

Cases:
y < 0 P(Y y ) = P() = 0
0 y 1 P(Y y ) = P(0 X 1) = 1/5 1 < y 3 P(Y y ) = P(X y ) = y/5 3 < y 5 P(Y y ) = P(X 3) = 3/5 y > 5 P(Y 5) = P(X 5) = 1 These values over X from 0 to give its c.d.f. Problem 8 - 0 x 3, 0 y 4 1 c.d.f. F (x, y ) = 156 xy (x2 + y ) P(1 x 2, 1 y 2) = F (2, 2) F (2, 1) F (1, 2) + F (1, 1)

Rectangle probability algorithm. or, you can nd the p.d.f. and integrate (more complicated): c.d.f. of Y: P(Y y ) = P(X , Y y ) = P(X 3, Y y )
(based on the domain of the joint c.d.f.)
1 P(Y y ) = 156 3y (9 + y ) for 0 y 4
Must also mention: y 0, P(Y y ) = 0; y 4, P(Y y ) = 1 Find the joint p.d.f. of x and y: f (x, y ) = 2 F (x, y ) 1 ={ (3x2 + 2y ), 0 x 3, 0 y 4; 0 otherwise} xy 156

P(Y X ) = ** End of Lecture 14

f (x, y )dxdy =
y x

3 0

x 0

93 1 (3x2 + 2y )dydx = 156 208

44

18.05 Lecture 15 March 9, 2005

Review for Exam 1 Practice Test 1: 1. In the set of all green envelopes, only 1 card can be green.
Similarly, in the set of red envelopes, only 1 card can be red.
Sample Space = 10! ways to put cards into envelopes, treating each separately.
You cant have two of the same color matching, as that would be 4 total.
Degrees of Freedom = which envelope to choose (5 5) and which card to select (5 5)
Then, arrange the red in green envelopes (4!), and the green in red envelopes (4!)
P= 2. Bayes formula: 0.53 0.5 P(HHH |f air)P(f air) = 3 P(HHH |f air)P(f air) + P(HHH |unf air)P(unf air) 0.5 0.5 + 1 0.5 54 (4!)2 10!

P(f air|HHH ) =

3. f1 (x) = 2xI (0 < x < 1), f2 (x) = 3x2 I (0 < x < 1) Y = 1, 2 P(Y = 1) = 0.5, P(Y = 2) = 0.5 f (x, y ) = 0.5 I (y = 1) 2xI (0 < x < 1) + 0.5 I (y = 2) 3x2 I (0 < x < 1) f (x) = 0.5 2xI (0 < x < 1) + 0.5 3x2 I (0 < x < 1) = (x + 1.5x2 )I (0 < x < 1) P(Y = 1|X =
1 f1 ( 1 1 4) 2 )= 1 1 1 4 ) f1 ( 4 ) 2 + f 2 ( 4 1 2

2 1/4 1/2 2 1/4 1/2 + 3 1/16 1/2

4. f (z ) = 2e2z I (Z > 0), T = 1/Z we know t > 0 P(T t) = P(1/Z t) = P(Z 1/t) = 1 2 F (T t) = 2e2/t 2 = 2 e2/t , t > 0 (0 otherwise) = 2e2z dz, p.d.f. f (t) = t t t 1/t T = r(Z ), Z = s(T ) =
1 T

g (t) = |s (t)|f (1/t) by change of variable.

5. f (x) = ex I (x > 0)
Joint p.d.f. f (x, y ) = ex I (x > 0)ey I (y > 0) = e(x+y) I (x > 0, y > 0)
X ;V = X +Y X +Y Step 1 - Check values for random variables: (0 < V < ), (0 < U < 1) Step 2 - Account for change of variables: X = U V ; Y = V U V = V (1 U ) Jacobian: U= J = det
X U Y U X V Y V

V -V

U = V (1 U ) + U V = V 1-U

45

g (u, v ) = f (uv, v (1 u)) |v |I (uv > 0, v (1 u) > 0) = ev vI (v > 0, 0 < u < 1) Problem Set #5 (practice pset, see solutions for details): p. 175 #4
f (x1 , x2 ) = x1 + x2 I (0 < x1 < 1, 0 < x2 < 1)
Y = X1 X2 (0 < Y < 1)
First look at the c.d.f.: P(Y y ) = P(X1 X2 y ) = {x1 x2 y}={x2 y/x1 } f (x1 , x2 )dx1 dx2

Due to the complexity of the limits, you can integrate the area in pieces, or you can nd the complement, which is easier with only 1 set of limits. f (x1 , x2 ) = 1 =1
1 y

{x1 x2 >y }

1 y/x1

(x1 + x2 )dx2 dx1 = 1 (1 y )2 = 2y y 2

f (x1 , x2 ) = 0 for y < 0; 2y y 2 for 0 < y < 1; 1 for y > 1. p.d.f.: g (y ) = { P(Y y ) = 2(1 y ), y (0, 1); 0, otherwise.} y

p. 164 #3 f (x) = { x 2 , 0 x 2; 0 otherwise}


Y = X (2 X ), nd the p.d.f. of Y.
First, nd the limits of Y, notice that it is not a one-to-one function.

Y varies from 0 to 1 as X varies from 0 to 2. Look at the c.d.f.: P(Y y ) = P(X (2 X ) y ) = P(X 2 2X + 1 1 y ) = P((1 X )2 1 y ) = = P(|1 X |
1 y ) = P(1 X 46

1 y or 1 X 1 y) =

= P(X 1 = P(0 X 1 ={
1 1y

x dx + 2

1 y or X 1 +

1 y ) + P(1 +

1+ 1y

Take derivative to get the p.d.f.:


x dx = 1 1 y, 0 y 1; 0, y < 0; 1, y > 1}
2


1 y X 2) =


1 y) =

1
g (y ) = , 0 y 1; 0, otherwise. 4 1y ** End of Lecture 15

47

18.05 Lecture 16 March 14, 2005

Expectation of a random variable. X - random variable


roll a die - average value = 3.5
ip a coin - average value = 0.5 if heads = 0 and tails = 1
Denition: If X is discrete, p.f. f (x) = p.f. of X ,
Then, expectation of X is EX = xf (x)
For a die:
f(x) 1 1/6
1 6

2 1/6

3 1/6
1 6

4 1/6

5 1/6

6 1/6

E=1

+ ... + 6

= 3.5

Another way to think about it:

Consider each pi as a weight on a horizontal bar. Expectation = center of gravity on the bar. If X - continuous, f (x) = p.d.f. then E(X ) = xf (x)dx 1 Example: X - uniform on [0, 1], E(X ) = 0 (x 1)dx = 1/2

Consider Y = r(x), then EY = x r(x)f (x) or r(x)f (x)dx p.f. g (y ) = {x:y=r(x)} f (x)
E(Y ) = y yg (y ) = y y {x:y=r(x)} f (x) = y {x:r(x)=y} yf (x) = y {x:r(x)=y} r(x)f (x)
then, can drop y since no reference to y : E(Y ) = x r(x)f (x) Example: X - uniform on [0, 1] 1 EX 2 = 0 X 2 1dx = 1/3

X1 , ..., Xn - random variables with joint p.f. or p.d.f. f (x1 ...xn ) E(r(X1 , ..., Xn )) = r(x1 , ..., xn )f (x1 , ..., xn )dx1 ...dxn Example: Cauchy distribution p.d.f.:

f (x) = Check validity of integration:


1 (1 + x2 )

1 1 dx = tan1 (x)| = 1 (1 + x2 )

But, the expectation is undened:

48

1 dx = 2 E|X | = |x| (1 + x2 ) Note: Expectation of X is dened if E|X | < Properties of Expectation:

1 x = ln(1 + x2 )| 0 = (1 + x2 ) 2

1) E(aX + b) = aE(X ) + b Proof: E(aX + b) = (aX + b)f (x)dx = a xf (x)dx + b f (x)dx = aE(X ) + b 2) E(X1 + X2 + ... + X ) = EX1 + EX2 + ... + EXn n Proof: E ( X + X ) = (x + x2 )f (x1 , x2 )dx1 dx2 = 1 2 1 f ( x , x ) dx dx + = x 1 1 2 1 2 2 f (x1 , x2 )dx1 dx2 = x f (x1 , x2 )dx1 dx2 = dx + x = x1 f (x1 , x2 )dx 2 1 2 = x1 f1 (x1 )dx1 + x2 f2 (x2 )dx2 = EX1 + EX2

Example: Toss a coin n times, T on i: Xi = 1; H on i: Xi = 0. Number of tails = X1 + X2 + ... + Xn E(number of tails) = E(X1 + X2 + ... + Xn ) = EX1 + EX2 + ... + EXn EXi = 1 P(Xi = 1) + 0 P(Xi = 0) = p, probability of tails Expectation = p + p + ... + p = np This is natural, because you expect np of n for p probability.

k nk Y = Number oftails, P(Y = k ) = n k p (1 p) n n k nk = np E(Y ) = k=0 k k p (1 p) More dicult to see though denition, better to use sum of expectations method. Two functions, h and g, such that h(x) g (x), for all x R Then, E(h(X )) E(g (X )) E(g (X ) h(X )) 0 (g (x) h(x)) f (x)dx 0 You know that f (x) 0, therefore g (x) h(x) must also be 0 If a X b a E(X ) E(b) b E(I (X A)) = 1 P(X A) + 0 P(X / A), for A being a set on R / A) = 1 P(X A) Y = I (X A) = {1, with probability P(X A); 0, with probability P(X E(I (X A)) = P(X A)} In this case, think of the expectation as an indicator as to whether the event happens. Chebyshevs Inequality Suppose that X 0, consider t > 0, then: 1 E(X ) t Proof: E(X ) = E(X )I (X < t) + E(X )I (X t) E(X )I (X t) E(t)I (X t) = tP(X t) P(X t) ** End of Lecture 16

49

18.05 Lecture 17 March 16, 2005

Properties of Expectation. Law of Large Numbers. E(X1 + ... + Xn ) = EX1 + ... + EXn Matching Problem (n envelopes, n letters)
Expected number of letters in correct envelopes?
Y - number of matches
Xi = {1, letter i matches; 0, otherwise}, Y = X1 + ... + Xn
E(Y ) = EX1 + ... + EXn , but EXi = 1 P(Xi = 1) + 0 P(Xi = 0) = P(Xi = 1) = Therefore, expected match = 1: 1 =1 n If X1 , ..., Xn are independent, then E(X1 ... Xn ) = EX1 ... EXn As with the sum property, we will prove for two variables: EX1 X2 = EX1 EX2
joint p.f. or p.d.f.: f (x1 , x2 ) = f1 (x1 )f x2 )
2 ( X = x x f ( x , x ) dx dx = x1 x2 f1 (x1 )f2 (x2 )dx1 dx2 =
EX 1 2 1 2 1 2 1 2 = f1 (x1 )x1 dx1 f2 (x2 )x2 dx2 = EX1 EX2
E(Y ) = n

1
n

2 X1 , X2 , X3 - independent, uniform on [0, 1]. Find EX1 (X2 X3 )2 .


2 2 = EX1 E(X2 X3 ) by independence.
2 2 2 2 2 2 2EX2 X3 )
+ EX3 (EX2 ) = EX1 2X2 X3 + X3 E(X2 = EX1 2 2 2 By independence of X2 , X3 ; = EX1 (EX2 + EX3 2EX2 EX3 )
1 1 2 EX1 = 0 x 1dx = 1/2, EX1 = 0 x2 1dx = 1/3 (same for X2 and X3 ) 1 1 1 1 1 1 2 EX1 (X2 X3 )2 = 3 (3 + 3 2( 2 )( 2 )) = 18

For discrete random variables, X takes values 0, 1, 2, 3, ... E(X ) = n=0 nP(x = n) for n = 0, contribution = 0; for n = 1, P(1); for n = 2, 2P(2); for n = 3, 3P(3); ... E(X ) = n=1 P(X n) Example: X - number of trials until success.
P(success) = p
P(f ailure) = 1 p = q
E(X ) =

n=1

P(X n) =

n=1

Formula based upon reasoning that the rst n - 1 times resulted in failure.
Much easier than the original formula:
n1
n P ( X = n ) = p n=0 n=1 n(1 p) Variance: Denition: Var(X ) = E(X E(X ))2 = 2 (X )
Measure of the deviation from the expectation (mean).
Var(X ) = (X E(X ))2 f (x)dx - moment of inertia.
50

(1 p)n1 = 1 + q + q 2 + ... =

1 1 = 1q p

Standard Deviation:
(X ) = Var(X ) Var(aX + b) = a2 Var(X ) (aX + b) = |a| (X ) Proof by denition:
E((aX + b) E(aX + b))2 = E(aX + b aE(X ) b)2 = a2 E(X E(X ))2 = a2 Var(X )
Property: Var(X ) = EX 2 (E(X ))2
Proof:
Var(X ) = E(X E(X ))2 = E(X 2 2X E(X ) + (E(X ))2 ) =
EX 2 2E(X ) E(X ) + (E(X ))2 = E(X )2 (E(X ))2
Example: X U [0, 1] EX =

(X center of gravity )2 mx

1 0

x 1dx =

1 , EX 2 = 2

1 0

x2 1dx =

1 3

1 1 1 ( )2 = 3 2 12 If X1 , ..., Xn are independent, then Var(X1 + ... + Xn ) = Var(X1 ) + ... + Var(Xn ) Proof: Var(X ) = Var(X1 + X2 ) = E(X1 + X2 E(X1 + X2 ))2 = E((X1 EX1 ) + (X2 EX2 ))2 = = E(X1 EX1 )2 + E(X2 EX2 )2 + 2E(X1 EX1 )(X2 EX2 ) = = Var(X1 ) + Var(X2 ) + 2E(X1 EX1 ) E(X2 EX2 ) By independence of X1 and X2 : = Var(X1 ) + Var(X2 )
2 Property: Var(a1 X1 + ... + an Xn + b) = a2 1 Var(X1 ) + ... + an Var(Xn )

k nk Example: Binomial distribution - B (n, p), P(X = k ) = n k p (1 p) X = X1 + ... + Xn , Xi = {1, Trial i is success ; 0, Trial i is failure.} Var(X ) = n i=1 Var(Xi ) Var(Xi ) = EXi2 (EXi )2 , EXi = 1(p) + 0(1 p) = p; EXi2 = 12 (p) + 02 (1 p) = p. Var(Xi ) = p p2 = p(1 p) Var(X ) = np(1 p) = npq Law of Large Numbers: X1 , X2 , ..., Xn - independent, identically distributed. X1 + ... + Xn n EX1 n Take > 0 - but small, P(|Sn EX1 | > ) 0 as n By Chebyshevs Inequality: Sn = 51

P((Sn EX1 )2 > 2 ) = P(Y > M )

1 EY = M

1 1 X1 + ... + Xn 1 1 EX1 )2 = 2 Var( (X1 + ... + Xn )) = E(Sn EX1 )2 = 2 E( 2 n n = for large n. ** End of Lecture 17 1 2 n2 (Var(X1 ) + ... + Var(Xn )) = nVar(X1 ) Var(X1 ) = 0 2 2 n n2

52

18.05 Lecture 18
March 18, 2005

Law of Large Numbers. X1 , ..., Xn - i.i.d. (independent, identically distributed) X1 + ... + Xn as n , EX1 n Can be used for functions of random variables as well:
Consider Yi = r(X1 ) - i.i.d.
x= r(X1 ) + ... + r(Xn ) as n ,
EY1 = Er(X1 ) n Relevance for Statistics: Data points xi , as n
, The average converges to the unknown expected value of the distribution which often contains a lot (or all) of information about the distribution. Y = Example: Conduct a poll for 2 candidates:
p [0, 1] is what were looking for
Poll: choose n people randomly: X1 , ..., Xn
P(Xi = 1) = p
P(Xi = 0) = 1 p
EX1 = 1(p) + 0(1 p) = p X1 + ... + Xn as n n

Other characteristics of distribution:


Moments of the distribution: for each integer, k 1, kth moment EX k
kth moment is dened only if E|X |k <
Moment generating function: consider a parameter y R.
and dene (t) = EetX where X is a random variable.
(t) - m.g.f. of X
Taylor series of (t) =
k (0)

k=0

k!

tk

Taylor series of EetX = E EX = (0)


k k

k=0

tk (tX )k = EX k k! k!
k=0

Example: Exponential distribution E () with p.d.f. f (x) = {ex , x 0; 0, x < 0}


Compute the moments:
EX k = 0 xk ex dx is a dicult integral.
Use the m.g.f.: tX tx x (t) = Ee = e e dx = e(t)x dx
0 0

(dened if t < to keep the integral nite)

53

Recall the formula for geometric series:

t tk 1 e(t)x |0 = 1 = = = ( )k = EX k t t t 1 t/ k!
k=0

xk =

k=0

1 when k < 1 1x

1 Exk k! = Exk = k k k! The moment generating function completely describes the distribution.
Exk = xk f (x)dx
If f(x) unknown, get a system of equations for f unique distribution for a set of moments.
M.g.f. uniquely determines the distribution. X1 , X2 from E(), Y = X1 + X2 .
To nd distribution of sum, we could use the convolution formula,
but, it is easier to nd the m.g.f. of sum Y :
EetY = Eet(X1 +X2 ) = EetX1 etX2 = EetX1 EetX2 Moment generating function of each: t ( Consider the exponential distribution: E () X1 , EX = 2 ) t

For the sum:

1 , f (x) = {ex , x 0; 0, x < 0} This distribution describes the life span of quality products. = E1 X , if small, life span is large. Median: m R such that: P(X m) 1 1 , P(X m) 2 2

(There are times in discrete distributions when the probability cannot ever equal exactly 0.5) When you exclude the point itself: P(X > m) 1 2 P(X m) + P(X > m) = 1 The median is not always uniquely dened. Can be an interval where no point masses occur.

54

1 . For a continuous distribution, you can dene P > or < m as equal to 2 But, there are still cases in which the median is not unique!

For a continuous distribution: P(X m) = P(X m) = 1 2

The average measures center of gravity, and is skewed easily by outliers.

The average will be pulled towards the tail of a p.d.f. relative to the median. Mean: nd a R such that E(X a)2 is minimized over a. E(X a)2 = E2(X a) = 0, EX a = 0 a = EX a expectation - squared deviation is minimized. Median: nd a R such that E|X a| is minimized. E|X a| E|X m|, where m - median E (|X a| |X m|) 0 (|x a| |x m|)f (x)dx

55

Need to look at each part: 1)a x (m x) = a m, x m 2)x a (x m) = m a, x m 3)a x (x + m) = a + m 2x, m x a

The integral can now be simplied: (|x a| |x m|)f (x)dx = (a m)(


m

(a m)f (x)dx +

f (x)dx

(m a)f (x)dx =

f (x)dx) = (a m)(P(X m) P(X > m)) 0

As both (a m) and the dierence in probabilities are positive. The absolute deviation is minimized by the median. ** End of Lecture 18

56

18.05 Lecture 19 March 28, 2005

Covariance and Correlation Consider 2 random variables X, Y 2 2 x = Var(X ), y = Var(Y ) Denition 1:


Covariance of X and Y is dened as:
Cov(X, Y ) = E(X EX )(Y EY ) Positive when both high or low in deviation.
Denition 2:
Correlation of X and Y is dened as:
(X, Y ) = Cov(X, Y ) Cov(X, Y ) =
x y Var(X )Var(Y )

The scaling is thus removed from the covariance.

Cov(X, Y ) = E(XY X EY Y EX + EX EY ) = = E(XY ) EX EY EY EX + EX EY = E(XY ) EX EY Cov(X, Y ) = E(XY ) EX EY Property 1:


If the variables are independent, Cov(X, Y ) = 0 (not correlated)
Cov(X, Y ) = E(XY ) EX EY = EX EY EX EY = 0

1 1 1 , 3, 3} Example: X takes values {1, 0, 1} with equal probabilities { 3 2 Y =X X and Y are dependent, but they are uncorrelated. Cov(X, Y ) = EX 3 EX EX 2 but, EX = 0, and EX 3 = EX = 0 Covariance is 0, but they are still dependent. Also - Correlation is always between -1 and 1.

Cauchy-Schwartz Inequality: (EXY )2 EX 2 EY 2 Also known


as the dot-product inequality: v || u| |( v , u )| | To prove for expectations: (t) = E(tX + Y )2 = t2 EX 2 + 2tEXY + EY 2 0 Quadratic f(t), parabola always non-negative if no roots:
D = (EXY )2 EX 2 EY 2 0) (discriminant)
Equality is possible if (t) = 0 for some point t.
(t) = E(tX + Y )2 = 0, if tX + Y = 0, Y = -tX, linear dependence.
2 2 (Cov(X, Y ))2 = (E(X EX )(Y EY ))2 E(X EX )2 E(Y EY )2 = x y |Cov(X, Y )| x y , 57

|(X, Y )| = So, the correlation is between -1 and 1. Property 2:

|Cov(X, Y )| 1 x y

1 (X, Y ) 1 When is the correlation equal to 1, -1? |(X, Y )| = 1 only when Y EY = c(X EX ), or Y = aX + b for some constants a, b.
(Occurs when your data points are in a straight line.)
If Y = aX + b :
(X, Y ) = a E(aX 2 + bX ) EX E(aX + b) aVar(X )
= = sign(a) = 2 |a|Var(X ) |a| Var(X ) a Var(X )

If a is positive, then the correlation = 1, X and Y are completely positively correlated. If a is negative, then correlation = -1, X and Y are completely negatively correlated.

Looking at the distribution of points on Y = X 2 , there is NO linear dependence, correlation = 0. However, if Y = X 2 + cX , then there is some linear dependence introduced in the skewed graph. Property 3: Var(X + Y ) = E(X + Y EX EY )2 = E((X EX ) + (Y EY ))2 = E(X EX )2 2E(X EX )(E(Y EY ) + E(Y EY )2 = Var(X ) + Var(Y ) 2Cov(X, Y ) Conditional Expectation:
(X, Y) - random pair.
What is the average value of Y given that you know X?
f(x, y) - joint p.d.f. or p.f. then f (y |x) - conditional p.d.f. or p.f.
Conditional expectation: E(Y |X = x) = yf (y |x)dy or yf (y |x) E(Y |X ) = h(X ) = Property 4:

yf (y |X )dy - function of X, still a random variable.

E(E(Y |X )) = EY 58

Proof:
E(E f( x)f (x)dx =
(Y |X )) = E(h(X )) = =( yf (y |x)dy )f (x)dx = yf (y |x)f (x)dydx = yf (x, y )dydx =
= y ( f (x, y )dx)dy = yf (y )dy = EY
Property 5: E(a(X )Y |X ) = a(X )E(Y |X ) See text for proof. Summary of Common Distributions: 1) Bernoulli Distribution: B (p), p [0, 1] - parameter Possible values of the random variable: X = {0, 1}; f (x) = px (1 p)1x P(1) = p, P(0) = 1 p E(X ) = p, Var(X ) = p(1 p) 2) Binomial Distribution: B (n, p), n repetitions of Bernoulli x 1x X {0, 1, ..., n}; f (x) = n x p (1 p) E(X ) = np, Var(X ) = np(1 p) 3) Exponential Distribution: E (), parameter > 0 X = [0, ), p.d.f. f (x) = {ex , x 0; 0, otherwise } EX = k! 1 , EX k = k 2 1 1 2 = 2 2

Var(X ) = ** End of Lecture 19

59

18.05 Lecture 20 March 30, 2005

5.4 Poisson Distribution


(), parameter > 0, random variable takes values: {0, 1, 2, ...}
p.f.:
f (x) = P(X = x) = Moment generating function: (t) = EetX = etX x x = e e = 1 e ;e x! x!
x0

x0

EX k = k (0) t EX = (0) = e(e 1) et |t=0 = t t EX 2 = (0) = (e(e 1)+t ) |t=0 = e(e 1)+t (et + 1)|
t=0 = ( + 1)
Var(X ) = EX 2 (EX )2 = ( + 1) 2 = If X1 (1 ), X2 (2 ), ...Xn (n ), all independent:
Y = X1 + ... + Xn , nd moment generating function of Y,

(et )x t t x (et )x e = e = e = e ee = e(e 1) x! x! x!


x0 x0

(t) = EetY = Eet(X1 +...+Xn ) = EetX1 ... etXn By independence: EetX1 EetX2 ... EetXn = e1 (e Moment generating function of (1 + ... + n ): (t) = e(1 +2 +...+n )(e
t t

1) 2 (et 1)

...en (e

1)

1)

If dependent, for example:


X1 , X1 2X1 {0, 2, 4, ...} - skips odd numbers, so not Poisson.
Approximation of Binomial: X1 , ..., Xn B (p), P(Xi = 1) = p, P(X1 =0) 1p = k p (1 p)nk Y = X1 + .. + Xn B (n, p), P(Y = k ) = n k If p is very small, n is large; np = p = 1/100, n = 100; np = 1 nk k n n 1 n k n k k nk = (1 ) (1 ) p (1 p) = ( ) (1 ) n n! k nk k k n n Many factors can be eliminated when n is large limn (1 + x n ) = ex n

Simplify the left fraction:

n 1 n! 1 (n k + 1)(n k + 2)...n 1 = = k nk k !(n k )! nk n n ... n k! k2 1 k1 )(1 )...(1 ) 1 n n n 60

(1

So, in the end:

1 k!

k n k p (1 p)nk = e k k! Poisson distribution with parameter results. Example:


1 B (100, 1/100) (1); P(2) 1 = 2e
e1 2

very close to actual.

Counting Processes: Wrong connections to a phone number, number of typos in a book on


a page, number of bacteria on a part of a plate.
Properties:
1) Count(S) - a count of random objects in a region S T
E(count(S )) = |S |, where |S | - size of S
(property of proportionality)
2) Counts on disjoint regions are independent.
3) P(count(S ) 2) is very small if the size of the region is small.
1, 2, and 3 lead to count(S ) (|S |), - intensity parameter.

A region from [0, T] is split into n sections, each section has size |T |/n The count on each region is X1 , ..., Xn By 2), X1 , ..., Xn are independent. P(Xi 2) is small if n is large. | By 1), EXi = |T n = 0(P(X1 = 0)) + 1(P(X1 = 1) + 2(P(X1 = 2)) + ...
But, over 1 the value is very small.
T |
P(Xi = 1) |n T| P(X1 = 0) 1 |n P(count(T ) = k ) = P(X1 + ... + Xn = k ) B (n, 5.6 - Normal Distribution |T | (|T |)k |T | ) (|T |) e n k!

61

Change variables to facilitate integration: (x2 +y2 ) y2 x2 2 2 2 = e dx e dy = e dxdy Convert to polar:

x2 2

dx)2

2 0

1 2 2 r

rdrd = 2 2

2 1 2r

rdr = 2

2 1 2r

r2 rd( ) = 2 2

et dt = 2

So, original integral area =

p.d.f.:

1 x2 e 2 dx = 1 2

1 x2 f (x) = e 2 2 Standard normal distribution, N(0, 1) ** End of Lecture 20

62

18.05 Lecture 21 April 1, 2005

Normal Distribution Standard Normal Distribution, N(0, 1) p.d.f.:


2 1 f (x) = ex /2 2

m.g.f.: (t) = E(etX ) = et


2

/2

Proof - Simplify integral by completing the square: 2 2 1 1 (t) = etx ex /2 dx = etxx /2 dx = 2 2 2 2 2 2 2 1 1 1 et /2t /2+txx /2 dx = et /2 e 2 (tx) dx 2 2 Then, perform the change of variables y = x - t: 2 2 2 2 1 2 1 2 1 1 = et /2 e 2 y dy = et /2 f (x)dx = et /2 e 2 y dy = et /2 2 2 Use the m.g.f. to nd expectation of X and X 2 and therefore Var(X ): E(X ) = (0) = tet
2

/2

|t=0 = 0; E(X 2 ) = (0) = et

/2 2

t + et

/2

|t=0 = 1; Var(X ) = 1

Consider X N (0, 1), Y = X + , nd the distribution of Y: P(Y y ) = P(X + y ) = P(X p.d.f. of Y: f (y ) = y )=


y

2 1 ex /2 dx 2

(y)2 (y)2 1 1 P(Y y ) 1 = e 22 = e 22 N (, ) y 2 2

EY = E(X + ) = (0) + (1) =


)2 = E(X + )2 = 2 E(X 2 ) = 2 - variance of N (, )
E(Y
= Var(X ) - standard deviation

63

To describe an altered standard normal distribution N(0, 1) to a normal distribution N (, ), The peak is located at the new mean , and the point of inection occurs away from

Moment Generating Function of N (, ); Y = X + (t) = EetY = Eet(X +) = Ee(t)X et = et Ee(t)X = et e(t) Note: X1 N (1 , 1 ), ..., Xn N (n , n ) - independent.
Y = X1 + ... + Xn , distribution of Y:
Use moment generating function:
EetY = Eet(X1 +...+Xn ) = EetX1 ...etXn = EetX1 ...EetXn = e1 t+1 t =e
P i t+ P
2 2 i t /2 2 2 2

/2

= et+t

( )2 /2

/2

... en t+n t

2 2

/2

N(

The sum of dierent normal distributions is still normal!


This is not always true for other distributions (such as exponential)
Example:
X N (, ), Y = cX , nd that the distribution is still normal:
Y = c(N (0, 1) + ) = (c )N (0, 1) + (c)
Y cN (, ) = N (c, c )
Example:
Y N (, )
b P(a Y b) = P(a x + b) = P( a X )
This indicates the new limits for the standard normal.
Example:
Suppose that the heights of women: X N (65, 1) and men: Y N (68, 2)
P(randomly chosen woman taller than randomly chosen man)
P(X > Y ) = P(X Y > 0)

Z = X Y N (65 68, 12 + 22 ) = N (3,
(5)) (3) (3) 3 P(Z > 0) = P( Z > ) = P(standard normal > = 1.342) = 0.09 5 5 5 Probability values tabulated in the back of the textbook. Central Limit Theorem Flip 100 coins, expect 50 tails, somewhere 45-50 is considered typical.

i ,

2) i

64

Flip 10,000 coins, expect 5,000 tails, and the deviation can be larger, perhaps 4,950-5,050 is typical. Xi = {1(tail); 0(head)} X1 + ... + Xn 1 1 1 1 number of tails E(X1 ) = by LLN Var(X1 ) = (1 ) = = n n 2 2 4 2 But, how do you describe the deviations? X1 , X2 , ..., Xn are independent with some distribution P = EX1 , 2 = Var(X1 ); x = 1 Xi EX1 = n i=1
n

x) x on the order of n
n( behaves like standard normal. n(x )
is approximately standard normal N (0, 1) for large n n(x ) x) P(standard normal x) = N (0, 1)(, x) P( n This is useful in terms of statistics to describe outcomes as likely or unlikely in an experiment.

P(number of tails 4900) = P(X1 + ... + X10,000 4, 900) = P(x 0.49) = = P(


1 10, 000(x 2 ) 1 2

10, 000(0.49 0.5)


1 2

) N (0, 1)(,

100(0.01)
1 2

= 2) = 0.0267

Tabulated values always give for positive X, area to the left. In the table, look up -2 by nding the value for 2 and taking the complement. ** End of Lecture 21

65

18.05 Lecture 22 April 4, 2005

Central Limit Theorem X1 , ..., Xn - independent, identically distributed (i.i.d.) 1 x= n (X1 + ... + Xn ) = EX, 2 = Var(X ) n(x ) N (0, 1) n You can use the knowledge of the standard normal distribution to describe your data: n(x ) Y = Y, x = n This expands the law of large numbers:
It tells you exactly how much the average value and expected vales should dier.
xn 1 n(x ) 1 x1 = n ( + ... + ) = (Z1 + ... + Zn ) n n
; E(Zi ) = 0, Var(Zi ) = 1 where: Zi = Xi Consider the m.g.f., see that it is very similar to the standard normal distribution:

Ee

1 (Z1 +...+Zn ) t n

= EetZ1 /

... etZn /

= (EetZ1 /

n n

1 1 2 3 EetZ1 = 1 + tEZ1 + t2 EZ1 + t3 EZ1 + ... 2 6 1 1 3 + ... = 1 + t2 + t3 EZ1 2 6 Ee(t/ Therefore: (EetZ1 / (1 +
n n n)Z1

=1+

t2 t3 t2 3 + 3/2 EZ1 + ... 1 + 2n 6n 2n t2 n ) 2n

) (1 +

2 t2 n ) n et /2 - m.g.f. of standard normal distribution! 2n Gamma Distribution: Gamma function; for > 0, > 0

66

() =

x1 ex dx

p.d.f of Gamma distribution, f(x): 1 1 x 1 1 x x e dx, f (x) = { x e , x 0; 0, x < 0} 1= ( ) ( ) 0 Change of variable x = y, to stretch the function: 1 1 1 y 1= y e dy = y 1 ey dy () () 0 0

p.d.f. of Gamma distribution, f (x|, ): f (x|, ) = { 1 x x e , x 0; 0, x < 0} Gamma(, ) ()


0

Integrate by parts:

Properties of the Gamma Function: () =

x1 ex dx =

x1 d(ex ) =

= x1 ex | 0

(ex )( 1)x2 dx = 0 + ( 1)

x2 ex dx = ( 1)( 1)

In summary, Property 1: () = ( 1)( 1) You can expand Property 1 as follows: (n) = (n 1)(n 1) = (n 1)(n 2)(n 2) = (n 1)(n 2)(n 3)(n 3) = = (n 1)...(1)(1) = (n 1)!(1), (1) = In summary, Property 2: (n) = (n 1)!
0

ex dx = 1 (n) = (n 1)!

Make this integral into a density to simplify: ( + k ) +k x(+k)1 ex dx = () +k ( + k ) 0 The integral is just the Gamma distribution with parameters ( + k, )! = For k = 1:

Moments of the Gamma Distribution: X (, ) k k 1 x EX = x x e dx = x(+k)1 ex dx () () 0 0

( + k ) ( + k 1)( + k 2) ... () ( + k 1) ... = = () k () k k

67

E(X ) = For k = 2: E(X 2 ) = Var(x) = Example:

( + 1) 2

( + 1) 2 2 = 2 2

If the mean = 50 and variance = 1 are given for a Gamma distribution, Solve for = 2500 and = 50 to characterize the distribution. Beta Distribution:
1 0

x1 (1 x) 1 dx =

()( ) ,1 = ( + )

Beta distribution p.d.f. - f (x|, ) Proof: ()( ) =


0

1 0

( + ) 1 x (1 x) 1 dx ()( )

x1 ex dx

Set up for change of variables:

y 1 ey dy =

x1 y 1 e(x+y) dxdy

x1 y 1 e(x+y) = x1 ((x + y ) x) 1 e(x+y) = x1 (x + y ) 1 (1 Change of Variables: s = x + y, t = Substitute: =


1 0

x 1 (x+y) ) e x+y

x , x = st, y = s(1 t) J acobian = s(1 t) (st) = s x+y


0

t1 s+ 2 (1 t) 1 es sdsdt = =
1 0

1 0

t1 (1 t) 1 dt

s+ 1 es ds =

t1 (1 t) 1 ( + ) = ()( )

Moments of Beta Distribution: 68

EX k =

Once again, the integral is the density function for a beta distribution. = ( + ) ( + k )( ) ( + ) ( + k ) ( + k 1) ... = = ()( ) ( + + k ) ( + + k ) () ( + + k 1) ... ( + ) EX = For k = 2: EX 2 = Var(X ) = ( + 1) ( + + 1)( + ) +

1 0

xk

( + ) 1 ( + ) x (1 x) 1 dx = ()( ) ()( )

1 0

x(+k)1 (1 x) 1 dx

For k = 1:

( + 1) 2 = ( + + 1)( + ) ( + )2 ( + )2 ( + + 1)

Shape of beta distribution. ** End of Lecture 22

69

18.05 Lecture 23 April 6, 2005

Estimation Theory: If only 2 outcomes: Bernoulli distribution describes your experiment.


If calculating wrong numbers: Poisson distribution describes experiment.
May know the type of distribution, but not the parameters involved.
A sample (i.i.d.) X1 , ..., Xn has distribution P from the family of distributions: {P : }
P = P0 , 0 is unknown
Estimation Theory - take data and estimate the parameter.
It is often obvious based on the relation to the problem itself.
Example: B(p), sample: 0 0 1 1 0 1 0 1 1 1
p = E(X ) x = 6/10 = 0.6
Example: E (), ex , x 0, E(X ) = 1/.
Once again, parameter is connected to the expected value.
1/ = E(X ) x, 1/x - estimate of alpha.
Bayes Estimators: - used when intuitive model can be used in describing the data.

X1 , ..., Xn P0 , 0 Prior Distribution - describes the distribution of the set of parameters (NOT the data)
f () - p.f. or p.d.f. corresponds to intuition.
P0 has p.f. or p.d.f.; f (x|)
Given x1 , ..., xn joint p.f. or p.d.f.: f (x1 , ..., xn |) = f (x1 |) ... f (xn |) To nd the Posterior Distribution - distribution of the parameter given your collected data. Use Bayes formula:

The posterior distribution adjusts your assumption (prior distribution) based upon your sample data. Example: B (p), f (x|p) = px (1 p)1x ; 70

f (|x1 , ..., xn ) =

f (x1 , .., xn |)f () f (x1 , ..., xn |)f ()d

f (x1 , ..., xn |p) = pxi (1 p)1xi = p

xi

(1 p)n

xi

Your only possibilities are p = 0.4, p = 0.6, and you make a prior distribution based on the probability that the parameter p is equal to each of those values. Prior assumption: f(0.4) = 0.7, f(0.6) = 0.3 You test the data, and nd that there are are 9 successes out of 10, p = 0.9 Based on the data that give p = 0.9, nd the probability that the actual p is equal to 0.4 or 0.6. You would expect it to shift to be more likely to be the larger value. Joint p.f. for each value: f (x1 , ..., x10 |0.4) = 0.49 (0.6)1 f (x1 , ..., x10 |0.6) = 0.69 (0.4)1 Then, nd the posterior distributions: f (0.4|x1 , ..., xn ) = f (0.6|x1 , ..., xn ) = (0.49 (0.6)1 )(0.7) (0.49 (0.6)1 )(0.7) = 0.08 + (0.69 (0.4)1 )(0.3)

(0.69 (0.4)1 )(0.3) = 0.92 (0.49 (0.6)1 )(0.7) + (0.69 (0.4)1 )(0.3)

Note that it becomes much more likely that p = 0.6 than p = 0.4
Example: B(p), prior distribution on [0, 1]
Choose any prior to t intuition, but simplify by choosing the conjugate prior.
f (p|x1 , ..., xn ) = pxi (1 p)nxi f (p) (...)dp

Choose f(p) to simplify the integral. Beta distribution works for Bernoulli distributions. Prior is therefore: f (p) = ( + ) 1 p (1 p) 1 , 0 p 1 ()( )

Then, choose and to t intuition: makes E(X ) and Var(X ) t intuition. P P ( + xi + + n xi ) f (p|x1 ...xn ) = p(+ x1 )1 (1 p)( +n xi )1 ( + xi )( + n xi ) Posterior Distribution = Beta( + xi , + n xi )
The conjugate prior gives the same distribution as the data.
Example:

71

B (, ) such that EX = 0.4, Var(X ) = 0.1 Use knowledge of parameter relations to expectation and variance to solve: EX = 0.4 = , Var(X ) = 0.1 = + ( + )2 ( + + 1)

The posterior distribution is therefore: Beta( + 9, + 1) And the new expected value is shifted: EX = +9 + + 10

Once this posterior is calculated, choose the parameters by nding the expected value. Denition of Bayes Estimator: Bayes estimator of unknown parameter 0 is (X1 , ..., Xn ) = expectation of the posterior distribution. Example: B(p), prior Beta(, ), X1 , ..., Xn posterior Beta( + xi , + n xi ) + xi + xi = Bayes Estimator: + xi + + n xi ++n = /n + x /n + /n + 1

To see the relation to the prior, divide by n:

Note that it erases the intuition for large n.


The Bayes Estimator becomes the average for large n.
** End of Lecture 23

72

18.05 Lecture 24 April 8, 2005

Bayes Estimator. Prior Distribution f () compute posterior f (|X1 , ..., Xn ) Bayess Estimator = expectation of the posterior. E(X a)2 minimize a a = EX Example: B (p), f (p) = Beta(, ) f (p|x1 , ..., xn ) = Beta( + + xi (x1 , ..., xn ) = ++n Example: Poisson Distribution x x!e
P

xi , + n

xi )

(), f (x|) = Joint p.f.:

If f () is the prior distribution, posterior:

n  xi xi n e = e f (x1 , ..., xn |) = x! xi ! i=1 i

f (|x1 , ..., xn ) = Note that g does not depend on :

xi n f () xi ! e P x i g (x1 ...xn ) = xi ! en f ()d


P

f (|x1 , ..., xn ) Take f () - p.d.f. of (, ),

xi n

f ()

Need to choose the appropriate prior distribution, Gamma distribution works for Poisson.

f (|x1 , ..., xn ) Bayes Estimator:

xi +1 (n+ )

1 e () ( + xi , + n) f () =

(x1 , ..., xn ) = EX =

Once again, balances both prior intuition and data, by law of large numbers: /n + xi /n (x1 , ..., xn ) = n x E(X1 ) 1 + /n The estimator approaches what youre looking for, with large n. Exponential E (), f (x|) = ex , x P0 xi = n e( xi ) f (x1 , ..., xn |) = n i=1 e If f () - prior, the posterior:

+ xi n+

73

f (|x1 , ..., xn ) n e( Once again, a Gamma distribution is implied. Choose f () (u, v ) f () = New posterior: f (|x1 , ..., xn ) n+u1 e( Bayes Estimator: (x1 , ..., xn ) =
P

xi )

f ()

v u u1 v e (u)

xi +v )u

(u + n, v +

xi )

1 u+n u/n + 1 1 = n = v + xi v/n + xi /n x EX


2 1 1 e 22 (x) 2

Normal Distribution: N (, ), f (x|, ) = f (x1 , ..., xn |, ) =

1 1 e 22 n ( 2 )

Pn

i=1 (xi )

It is dicult to nd simple prior when both , are unknown. Say that is given, and is the only parameter:
2 1 1 Prior: f () = e 2b2 (a) = N (a, b) b 2

Posterior: f (|X1 , ..., Xn ) e 22 Simplify the exponent: = 1 2 1 1 a xi 2 n 2 2 2 2 x + ) + ( 2 a + a ) = ( + ) 2 ( + 2 ) + ... ( x i i 2 2 2 2 2 2 2b 2 2b 2 2b = 2 A 2B + ... = A(2 2 f (|X1 , ..., Xn ) eA( A ) = e Normal Bayes Estimator: (X1 , ..., Xn ) = 2 a + nb2 x 2 a/n + b2 x = n x E(X1 ) = 2 + nb2 2 /n + b2
B 2 1 2(1/ 2A)2 1

2 1 (xi )2 2b 2 (a)

B B2 B B + ( )2 ) + ... = A( )2 + ... A A A A

B 1 2 b2 2 A + nb2 x B 2 ) = N( , ) = N( 2 , 2 ) 2 A A + nb + nb2 2A

** End of Lecture 24

74

18.05 Lecture 25 April 11, 2005

Maximum Likelihood Estimators X1 , ..., Xn have distribution P0 {P : } Joint p.f. or p.d.f.: f (x1 , ..., xn ) = f (x1 |) ... f (xn |) = () - likelihood function. If P - discrete, then f (x|) = P (X = x), and () - the probability to observe X1 , ..., Xn Denition: A Maximum likelihood estimator (M.L.E.):
= (X1 , ..., Xn ) such that ( ) = max ()
Suppose that there are two possible values of the parameter, = 1, = 2
p.f./p.d.f. - f (x|1), f (x|2)
Then observe points x1 , ..., xn
view probability with rst parameter and second parameter:
(1) = f (x1 , ..., xn |1) = 0.1, (2) = f (x1 , ..., xn |2) = 0.001,
The parameter is much more likely to be 1 than 2. Example: Bernoulli Distribution B(p), P p [0.1], P (p) = f (x1 , ..., xn |p) = p xi (1 p)n xi () max log () max (log-likelihood)
log (p) = xi log p + (n xi ) log(1 p), maximize over [0, 1]
Find the critical point:
log (p) = 0 p n xi xi =0 p 1p xi p xi np + p xi = 0 xi (1 p) p(n xi ) = p =

xi = x E(X ) = p n For Bernoulli distribution, the MLE converges to the actual parameter of the distribution, p. Example: Normal Distribution: N (, 2 ),
2 1 1 f (x|, 2 ) = e 22 (x) 2

(, 2 ) = (

1 1 )n e 22 2

Pn

i=1 (xi )

Note that the two parameters are decoupled. First, for a xed , we minimize n
i=1 (xi

n 1 log (, 2 ) = n log( 2 ) 2 (xi )2 max : , 2 2 i=1

)2 over

75

To summarize, the estimator of for a Normal distribution is the sample mean. To nd the estimator of the variance:
n 1 (xi x)2 maximize over n log( 2 ) 2 2 i=1 n n 1 (xi x)2 = 0 = + 3 i=1

n i=1

n n 2(xi ) = 0, (xi )2 = i=1 i=1

xi n = 0, =

1 xi = x E(X ) = 0 n i=1

2 = Find 2
2 =

1
2 (xi x)2 - MLE of 0 ; 2 a sample variance n

1 2 1 2 1 1 2
xi + (x)2 = (xi 2xi x + (x)2 ) = xi 2x xi 2(x)2 + (x)2 = n n n n = 1 2 2 2 xi (x)2 = x2 (x)2 E(x2 1 ) E(x1 ) = 0 n

2 for a Normal distribution is the sample variance.


To summarize, the estimator of 0

Example: U (0, ), > 0 - parameter.


1 f (x|) = {
, 0 x ; 0, otherwise } Here, when nding the maximum we need to take into account that the distribution is supported on a nite interval [0, ]. () =
n  1 1 I (0 xi ) = n I (0 x1 , x2 , ..., xn ) i=1

The likelihood function will be 0 if any points fall outside of the interval.
If will be the correct parameter with P = 0,
you chose the wrong for your distribution.
() maximize over > 0

76

If you graph the p.d.f., notice that it drops o when drops below the maximum data point. = max(X1 , ..., Xn ) The estimator converges to the actual parameter 0 :
As you keep choosing points, the maximum gets closer and closer to 0
Sketch of the consisteny of MLE. () max Ln () = 1 log () max n
n

 1 1 1 log f (xi |) L() = E0 log f (x1 |). log () = log f (xi |) = n n n i=1

by denition of MLE. Let us show that L() is maximized at 0 . Ln () is maximized at , 0 . L() L(0 ) : Then, evidently, Expand the inequality: f (x|) log f (x|0 )dx f (x|0 ) f (x|) 1 f (x|0 )dx f (x|0 )

L() L(0 )

= =

(f (x|) f (x|0 )) dx = 1 1 = 0.

Here, we used that the graph of the logarithm will be less than the line y = x - 1 except at the tangent point. ** End of Lecture 25

77

18.05 Lecture 26 April 13, 2005

Condence intervals for parameters of Normal distribution.


2 2 Condence intervals for 0 , 0 in N (0 , 0 ) 2 2 2 = x (x) = x, 2 with large n, but how close exactly? 0 , 2 0

You can guarantee that the mean or variance are in a particular interval with some probability:
Denition: Take [0, 1], condence level
If P(S1 (X1 , ..., Xn ) 0 S2 (X1 , ..., Xn )) = ,
then interval [S1 , S2 ] is the condence interval for 0 with condence level . Consider Z0 , ..., Zn - i.i.d., N(0, 1)
2 2 2 + Z2 + ... + Zn is called a chi-square (2 ) distribution,
Denition: The distribution of Z1 with n degrees of freedom. 1 As shown in 7.2, the chi-square distribution is a Gamma distribution ( n 2 , 2) Denition: The distribution of Z0
1 2 n (Z1 2) + ... + Zn

is called a t-distribution with n d.o.f.

The t-distribution is also called Students distribution, see 7.4 for detail.
2 ), need the following: To nd the condence interval for N (0 , 0 Fact: Z1 , ..., Zn i.i.d.N (0, 1)

z=

Then, A = nz N (0, 1), B = n(z 2 (z )2 ) 2 n1 , and A and B are independent.


2 2 Take X1 , ..., Xn N (0 , 0 ), 0 , 0 unknown.

1 2 1 1 2 zi ) (Z1 + ... + Zn ), z 2 (z )2 = zi ( n n n

Z1 =

xn 0 x1 0 , ..., Zn = N (0, 1) 0 0 A= nz = n( x 0 ) 0

B = n(z 2 (z )2 ) = = To summarize:

n(

1 (xi 0 )2 x 0 2 n 1 (xi 0 )2 (x 0 )2 ) = ) ) = ( ( 2 2 n 0 0 0 n n 2 2 2 (x (x) ) 0

2 0

2 2 (x2 20 x + 2 0 x + 20 x 0 ) =

A=

x 0 n ) N (0, 1); B = 2 (x2 (x)2 ) 2 n( n1 0 0 78

and, A and B are independent.


You cant compute B, because you dont know 0 , but you know the distribution:
B= n(x2 (x)2 ) 2 n1 2 0

Choose the most likely values for B, between c1 and c2 .

Choose the c values from the chi-square tabled values, such that area = condence. With probability = condence (), c1 B c2 c1 Solve for 0 : n(x2 (x)2 ) 2 0 c2 n(x2 (x)2 ) c1 n(x2 (x)2 ) c2 2 0

Choose c1 and c2 such that the right tail has probability 1 2 , same as left tail. This results in throwing away the possibilities outside c1 and c2 c1 given Or, you could choose to make the interval as small as possible, minimize: c1 1 2

Why wouldnt you throw away a small interval in between c1 and c2 , with area 1 ? Though its the same area, you are throwing away very likely values for the parameter! ** End of Lecture 26

79

18.05 Lecture 27 April 15, 2005

Take sample X1 , ..., Xn N (0, 1) A=

n(x ) n(x2 (x)2 N (0, 1), B = 2 n1 2

A, B - independent.
To determine the condence interval for , must eliminate from A:
A
1 n1 B

Where Z0 , Z1 , .., Zn1 N (0, 1) The standard normal is a symmetric distribution, and

Z0
1 2 n1 (z1 2 + ... + zn 1 ) 1 2 n1 (Z1

tn1

2 2 + ... + Zn 1 ) EZ1 = 1

So tn -distribution still looks like a normal distribution (especially for large n), and it is symmetric about zero. Given (0, 1) nd c, tn1 (c, c) = c n(x ) / A
1 n1 B

with probability = condence () c

1 n(x2 (x)2 ) c 2 n1

By the law of large numbers, x EX =


The center of the interval is a typical estimator (for example, MLE). error estimate of variance
2 2 n

c (x)2 ) 1 1 (x2 (x)2 ) x + c (x2 (x)2 )


xc n1 n1
1 2 n1 (x

for large n.

x2

(x) is a sample variance and it converges to the true variance, 80

by LLN 2 2 1 1 2 + ... + x2 E 2 = E (x2 n ) E( (x1 + ... + xn )) = n 1 n


2 = EX1

1 1 2 2 EXi Xj = EX1 2 (nEX1 + n(n + 1)(EX1 )2 ) n2 i,j n n1 n1 2 EX1 (EX1 )2 = n n

Note that for i = j, EXi Xj = EXi EXj = (EX1 )2 = 2 , n(n - 1) terms with dierent indices. E 2 = = Therefore: n1 2 < 2 n Good estimator, but more often than not, less than actual. So, to compensate for the lower error: E 2 = E Consider ( )2 =
n 2 , n1

n1 n1 2 n1 2 (EX1 (EX1 )2 ) = Var(X1 ) = n n n

unbiased sample variance. 1 1 1 2 c (x2 (x)2 ) = c 2 = c ( ) n1 n1 n ( )2 ( )2 x+c xc n n

n 2 = 2 n1

7.5 pg. 140 Example: Lactic Acid in Cheese 0.86, 1.53, 1.57, ..., 1.58, n = 10 N (, 2 ), x = 1.379, 2 = x2 (x)2 = 0.0966 Predict parameters with condence = 95% Use a t-distribution with n - 1 = 9 degrees of freedom.

81

See table: (, c) = 0.975 gives c = 2.262 x 2.262

1 2 x + 2.262 9

1 2 9

0.6377 2.1203 Large interval due to a high guarantee and a small number of samples. If we change to 90% c = 1.833, interval: 1.189 1.569
Much better sized interval.
Condence interval for variance:
c1 n 2 c2 2

where the c values come from the 2 distribution

Not symmetric, all positive points given for 2 distribution. c1 = 2.7, c2 = 19.02 0.0508 2 0.3579 again, wide interval as result of small n and high condence. Sketch of Fishers theorem. z 1 , ..., zn N (0, 1) 1 nz = (z + ... + zn ) N (0, 1) n 1 n(z 2 (z )2 ) = n(

P 2 2 1 1 f (z1 , ..., zn ) = ( )n e1/2 zi = ( )n e1/2r 2 2

1 2 1 2 1 2 ( (z1 + ... + zn ))2 2 zi ( zi ) ) = zi n1 n n n

P 2 2 1 1 f (y1 , ..., yn ) = ( )n e1/2r = ( )n e1/2 yi 2 2

The graph is symmetric with respect to rotation, so rotating the coordinates gives again i.i.d. standard normal sequence. 
i
2 1 e1/2yi y1 , ..., yn i.i.d.N (0, 1) 2

Choose coordinate system such that: 1 1 1 y1 = (z1 + ... + zn ), i.e. v1 = ( , . . . , ) - new rst axis. n n n Choose all other vectors however you want to make a new orthogonal basis:

82

2 2 2 2 y1 + ... + yn = z1 + .. + zn

since the length does not change after rotation! n(z 2 (z )2 ) = ** End of Lecture 27 nz = y1 N (0, 1)
2 2 2 2 yi y1 = y2 + ... + yn 2 n1

83

18.05 Lecture 28 April 20, 2005

Review for Exam 2 pg. 280, Problem 5


= 300, = 10; X1 , X2 , X3 N (300, 100 = 2 )
P(X1 > 290 X2 > 290 X3 > 290) = 1 P(X1 290)P(X2 290)P(X3 290)
x2 300 x3 300 x1 300 1)P( 1)P( 1) 10 10 10 Table for x = 1 gives 0.8413, x = -1 is therefore 1 - 0.8413 = 0.1587 = 1 (0.1587)3 = 0.996 = 1 P( pg. 291, Problem 11
600 seniors, a third bring both parents, a third bring 1 parent, a third bring no parents.
Find P(< 650 parents)
Xi 0, 1, 2 parents for the ith student.
1
P(Xi = 2) = P(Xi = 1) = P(Xi = 0) = 3 P(X1 + ... + X600 < 650) - use central limit theorem. = 0(1/3) + 1(1/3) + 2(1/3) = 1 EX 2 = 02 (1/3) + 12 (1/3) + 22 (1/3) = 5 3 5 Var(X ) = EX 2 (EX )2 = 3 1= 2 3
= 2/3 P xi 650 1) 1) 600 ( 600 600( 600

P( < ) 2/3 2/3 n(x ) < 2.5) N (0, 1), P(Z 2.5) = (2.5) = 0.9938 P( pg. 354, Problem 10 Time to serve X E (), n = 20, X1 , ..., X20 , x = 3.8 min Prior distribution of is a Gamma dist. with mean 0.2 and std. dev. 1 / = 0.2, / 2 = 1 = 0.2, = 0.04 Get the posterior distribution: P f (x|) = ex , f (x1 , ..., xn |) = n e xi P 1 f () = ( e , f (|x1 , ..., xn ) (+n)1 e( + xi ) ) Posterior is ( + n, + xi ) = (0.04 + 20, 0.2 + 3.8(20)) Bayes estimator = mean of posterior distribution = = 20.04 3.8(20) + 0.2

Problem 4 f (x|) = {ex, x 0; 0, x < 0} Find the MLE of Likelihood () = f (x1 |) ... f (xn |) P = ex1 ...exn I (x1 , ..., xn ) = en xi I (min(x1 , ..., xn ) ) 84

Maximize over .

Note that the graph increases in , but must be less than the min value.
If greater, the value drops to zero.
Therefore:
= min(x1 , ..., xn ) Also, by observing the original distribution, the maximum probability is at the smallest Xi .

To nd c, use the t distribution with n - 1 degrees of freedom:

p. 415, Problem 7:
To get the condence interval, compute the average and sample variances:
Condence interval for :
1 1 2 2 (x (x) ) x c (x2 (x)2 ) xc n1 n1

tn1 = t19 (, c) = 0.95, c = 1.729 Condence interval for 2 : n(x ) n(x2 (x)2 ) N (0, 1), 2 n1 2

85

tn1 Use the table for 2 n1

N (0, 1) n(x )/ tn1 = 1 2 1 n(x2 (x)2 ) 2 n1 n1


n1

c1 From the Practice Problems: (see solutions for more detail)

n(x2 (x)2 ) c2 2

p. 196, Number 9 P(X1 = def ective) = p Find E(X Y ) Xi = {1, def ective; 1, notdef ective}; X Y = X1 + ... + Xn E(X Y ) = EX1 + ... + EXn = nEX1 = n(1 p 1(1 P )) = n(2p 1) p. 396, Number 10 X1 , ..., X6 N (0, 1) c ((X1 + X2 + X3 )2 + (X4 + X 5 + X 6 )2 ) 2 n 2 ( c(X1 + X2 + X3 )) + ( c(X4 + X5 + X6 ))2 2 2 But each needs a distribution of N(0, 1) E c( X1 + X2 + X3 ) = c(EX1 + EX2 + EX3 ) = 0 Var( c(X1 + X2 + X3 )) = c(Var(x1 ) + Var(X2 ) + Var(X3 )) = 3c In order to have the standard normal distribution, variance must equal 1. 3c = 1, c = 1/3 ** End of Lecture 28

86

18.05 Lecture 29 April 25, 2005

Score distribution for Test 2: 70-100 A, 40-70 B, 20-40 C, 10-20 D Average = 45 Hypotheses Testing. X1 , ..., Xn with unknown distribution P Hypothesis possibilities: H1 : P = P 1 H2 : P = P 2 ... Hk : P = P k There are k simple hypotheses. A simple hypothesis states that the distribution is equal to a particular probability distribution. Consider two normal distributions: N(0, 1), and N(1, 1).

There is only 1 point of data: X1


Depending on where the point is, it is more likely to come from either N(0, 1) or N(1, 1).
Hypothesis testing is similar to maximum likelihood testing
Within your k choices, pick the most likely distribution given the data.
However, hypothesis testing is NOT like estimation theory, as there is a dierent goal:
Denition: Error of type i
P(make a mistake |Hi is true) = i
Decision Rule: : X n (H1 , H2 , ..., Hk )
Given a sample (X1 , ..., Xn ), (X1 , ..., Xn ) {H1 , ..., Hk }
i = P( = Hi |Hi ) - error of type i
The decision rule picks the wrong hypothesis = error. Example: Medical test, H1 - positive, H2 - negative.
Error of Type 1: 1 = P( = H1 |H1 ) = P(negative|positive)
Error of Type 2: 2 = P( = H2 |H2 ) = P(positive|negative) These are very dierent errors, have dierent severity based on the particular situation. Example: Missile Detection vs. Airplane Type 1 P(airplane|missile), Type 2 P(missile|airplane) Very dierent consequences based on the error made. Bayes Decision Rules Choose a prior distribution on the hypothesis. 87

Assign a weight to each hypothesis, based upon the importance of the dierent errors.
(1), ..., (k ) 0, (i) = 1
Bayes error ( ) = (1)1 + (2)2 + ... + (k )k
Minimize the Bayes error, choose the appropriate decision rule.
Simple solution to nding the decision rule:
X = (X1 , ..., Xn ), let fi (x) be a p.f. or p.d.f. of Pi
fi (x) = fi (x1 ) ... fi (xn ) - joint p.f./p.d.f.
Theorem: Bayes Decision Rule:
= {Hi : (i)fi (x) = maxij k (j )fj (x) Similar to max. likelihood.
Find the largest of joint densities, but weighted in this case.
( ) = (i)Pi ( = Hi ) = (i)(1 Pi ( = Hi )) =
(i)Pi ( = Hi ) = 1 (i) I ( (x) = Hi )fi (x)dx =
=1 = 1 ( (i)I ( (x) = Hi )fi (x))dx - minimize, so maximize the integral:
Function within the integral:
I ( = H1 ) (1)f1 (x) + ... + I ( = Hk ) (k )fk (x) The indicators pick the term
= H1 : 1 (1)f1 (x) + 0 + 0 + ... + 0
So, just choose the largest term to maximize the integral.
Let pick the largest term in the sum.
Most of the time, we will consider 2 simple hypotheses:
= {H1 : (1)f1 (x) > (2)f2 (x), Example:
H1 : N (0, 1), H2 : N (1, 1)
(1)f1 (x) + (2)f2 (x) minimize

P 2 P 2 1 1 1 1 f1 (x) = ( )n e 2 xi ; f2 (x) = ( )n e 2 (xi 1) 2 2 P 2 1P P 2 1 n f1 (x) (2) = e 2 xi + 2 (xi 1) = e 2 xi > f2 (x) (1)

(2) f1 (x) ; H2 if <; H1 or H2 if =} > f2 (x) (1)

= {H1 :

Considering the earlier example, N(0, 1) and N(1, 1)

xi <

n (2) log ; H2 if >; H1 or H2 if =} 2 (1)

88

X1 , n = 1, (1) = (2) =

1 2

1 1 ; H2 x1 > ; H1 or H2 if =} 2 2 However, if 1 distribution were more important, it would be weighted. = {H1 : x1 <

If N(0, 1) were more important, you would choose it more of the time, even on 1 some occasions when xi > 2 Denition: H1 , H2 - two simple hypotheses, then: 1 ( ) = P( = H1 |H2 ) - level of signicance. ( ) = 1 2 ( ) = P( = H2 |H2 ) - power. For more than 2 hypotheses,
1 ( ) is always the level of signicance, because H1 is always the
Most Important hypothesis.
( ) becomes a power function, with respect to each extra hypothesis.
Denition: H0 - null hypothesis
Example, when a drug company evaluates a new drug,
the null hypothesis is that it doesnt work.
H0 is what you want to disprove rst and foremost,
you dont want to make that error!
Next time: consider class of decision rules.
K = { : 1 ( ) }, [0, 1]
Minimize 2 ( ) within the class K
** End of Lecture 29

89

18.05 Lecture 30 April 27, 2005

Bayes Decision Rule (1)1 ( ) + (2)2 ( ) minimize. = {H1 : Example: see pg. 469, Problem 3 H0 : f1 (x) = 1 for 0 x 1 H1 : f2 (x) = 2x for 0 x 1 Sample 1 point x1 Minimize 30 ( ) + 11 ( ) = {H0 : Simplify the expression: 3 3 ; H1 : x1 > } 2 2 Since x1 is always between 0 and 1, H0 is always chosen. = H0 always. = {H0 : x1 Errors:
H0 ) = 0
0 ( ) = P0 ( = 1 ( ) = P1 ( = H1 ) = 1 We made the 0 very important in the weighting, so it ended up being 0. Most powerful test for two simple hypotheses. Consider a class K = { such that 1 ( ) [0, 1]} Take the following decision rule: = {H1 : f1 (x) f1 (x) < c} c; H2 : f2 (x) f2 (x) 1 1 1 1 < ; either if equal} > ; H1 : 2x1 3 2x1 3 f1 (x) (2) > ; H2 : if <; H1 or H2 : if =} f2 (x) (1)

Calculate the constant from the condence level : 1 ( ) = P1 ( = H1 ) = P1 ( f1 (x) < c) = f2 (x)

Sometimes it is dicult to nd c, if discrete, but consider the simplest continuous case rst: (2) Find (1), (2) such that (1) + (2) = 1, (1) = c
Then, is a Bayes decision rule.
(1)1 ( ) + (2)2 ( ) (1)1 ( ) + (2)2 ( )
for any decision rule
If K then 1 ( ) . Note: 1 ( ) = , so: (1) + (2)2 ( ) (1) + (2)2 ( )
Therefore: 2 ( ) 2 ( ), is the best (mosst powerful) decision rule in K
Example:
H1 : N (0, 1), H2 : N (1, 1), 1 ( ) = 0.05

90

P 2 1P P 2 n 1 f1 (x) = e 2 xi + 2 (xi 1) = e 2 xi c f2 (x)

Always simplify rst: n n xi log(c), xi + log(c), xi c 2 2 The decision rule becomes: = {H1 : xi c ; H 2 : xi > c }

Now, nd c 1 ( ) = P1 ( xi > c )
recall, subscript on P indicates that x1 , ..., xn N (0, 1)
Make into standard normal:
c xi P1 ( > ) = 0.05 n n Check the table for P(z > c ) = 0.05, c = 1.64, c = n(1.64)

Note: a very common error with the central limit theorem: 1 n xi xi n ) xi n( n

These two conversions are the same! Dont combine techniques from both. The Bayes decision rule now becomes: = {H1 : xi 1.64 n; H2 : xi > 1.64 n}

Error of Type 2:
2 ( ) = P2 ( xi c = 1.64 n)
Note: subscript indicates that X1 , ..., Xn N (1, 1) xi n(1) 1.64 n n ) = P2 (z 1.64 n) = P2 ( n n

Use tables for standard normal to get the probability.


If n = 9 P2 (z 1.64 9) = P2 (z 1.355) = 0.0877
Example:
H1 : N (0, 2), H2 : N (0, 3), 1 ( ) = 0.05

2(2) P 2 1 3 n/2 12 f1 (x) xi P 1 2 =( ) c = e x 1 n f2 (x) 2 i 2(3) ( 3 2 ) e 2 = {H1 : x2 xi > c } i c ; H2 :

1 ( 2 )n e 2

x2 i

This is intuitive, as the sum of squares sample variance. If small = 2 If large = 3

91

x
2 2 c 2 i 1 ( ) = P1 ( x
i
> c ) = P1 ( 2
> 2 ) = P1 (n > c ) = 0.05 2
If n = 10, P1 (10 > c ) = 0.05; c = 18.31, c = 36.62 Can nd error of type 2 in the same way as earlier: c 2
P(2 n > 3 ) P(10 > 12.1) 0.7 A dierence of 1 in variance is a huge deal! Large type 2 error results, small n. ** End of Lecture 30

92

18.05 Lecture 31 April 29, 2005

t-test X1 , ..., Xn - a random sample from N (, 2 ) 2-sided Hypothesis Test: H1 : = 0 H2 : = 0 2 sided hypothesis - parameter can be greater or less than 0 Take (0, 1) - level of signicance (error of type 1) Construct a condence interval condence = 1 - If 0 falls in the interval, choose H1 , otherwise choose H2 How to construct the condence interval in terms of the decision rule: T = x 0 (x)2 ) t distribution with n - 1 degrees of freedom.

1 2 n1 (x

Under the hypothesis H1 , T is has a t-distribution.


See if the T value falls in the expected area of the t-distribution:
Accept the null hypothesis (H1 ), if c T c, Reject if otherwise.
Choose c such that area between c and -c is 1 , each tail area = /2 Error of type 1: 1 = P1 (T < c, T > c) = 2 + 2 = Denition: p-value

93

p-value = probability of values less likely than T


If p-value , accept the null hypothesis.
If p-value < , reject the null hypothesis.
Example: p-value = 0.0001, very unlikely that this T value would occur
if the mean were 0 . Reject the null hypothesis!
1-sided Hypothesis Test: H1 : 0 H2 : > 0 T = x 0 (x)2 )

1 2 n1 (x

See how the distribution behaves for three cases: 1) If = 0 , T tn1 .

2) If < 0 : T = x 0 (x)2 ) = 0 (x)2 ) + x (x)2 )

1 2 n1 (x

( 0 ) n 1 T tn1 +

1 2 n1 (x

1 2 n1 (x

3) If > 0 , similarly T + Decision Rule: = {H1 : T ; H : T > }

94

1 = P1 (T > c) =
p-value: Still the probability of values less likely than T ,
but since it is 1-sided,
you dont need to consider the area to the left of T as you would in the 2-sided case.

The p-value is the area of everything to the right of T Example: 8.5.1, 8.5.4 0 = 5.2.n = 15, x = 5.4, = 0.4226 5.2 H1 : = 5.2, H2 : = T is calculated to be = 1.833, which leads to a p-value of 0.0882

If = 0.05, accept H1 , = 5.2. because the p-value is over 0.05


Decision rule:
Such that = 0.05, the areas of each tail in the 2-sided case = 2.5%

95

From the table c = 2.145 = {H1 : 2.145 T 2.145; H2 otherwise} Consider 2 samples, want to compare their means: 2 2 X1 , ..., Xn N (1 , 1 ) and Y1 , ..., Ym N (2 , 2 ) Paired t-test: Example (textbook): Crash test dummies, driver and passenger seats (X, Y)
See if there is a dierence in severity of head injuries depending on the seat:
(X1 , Y1 ), ..., (Xn , Yn )
Observe the paired observations (each car) and calculate the dierence:
Hypothesis Test:
H1 : 1 = 2
H2 : 1 = 2
Consider Z1 = X1 Y1 , ..., Zn = Xn Yn N (1 2 = , 2 )
H1 : = 0; H2 : = 0
Just a regular t-test:
p-values comes out as < 106 , so they are likely to be dierent.
** End of Lecture 31

96

18.05 Lecture 32 May 2, 2005

Two-sample t-test X1 , ..., Xm N (1 , 2 )


Y1 , ..., Yn N (2 , 2 )
Samples are independent.
Compare the means of the distributions.
Hypothesis Tests:
H1 : 1 = 2 , 1 2
H2 : 1 = 2 , 1 > 2
By properties of Normal distribution and Fishers theorem: m(x 1 ) n(y 2 ) , N (0, 1)
2 2 x = x2 (x)2 , y = y 2 (y )2 2 2 ny mx 2 2 , m 1 n1 2 2

T =

1 2 n1 (x

x (x)2 )

tn1

Calculate x y x 1 1 1 1 y 2 N (0, 1) = N (0, ), N (0, ) m n m x 1 y 2 (x y ) (1 2 ) 1 1 = N (0, + ) m n (x y ) (1 2 ) N (0, 1) 1 1 m +n


2 2 ny mx + 2 m+n2 2 2

97

Construct the t-statistic: N (0, 1)


1 2 m+n2 (m+n2 )

tm+n2

Construct the test:


H1 : 1 = 2 , H2 : 1 = 2
If H1 is true, then:

(x y ) (1 2 ) (x y ) (1 2 ) T = tm+n2 = 2 +n 2 m 1 1 1 2 + n 2 ) x y 1 1 1 ( + ) ( m m +n ( ) x y 2 m n m + n 2 m+n2

T =

1 1 2 2 +n ) m+1 (m n2 (mx + ny )

xy

tm+n2

Decision Rule: = {H1 : c T c, H2 : otherwise} where the c values come from the t distribution with m + n - 2 degrees of freedom.
c = T value where the area is equal to /2, as the failure is both below -c and above +c
If the test were: H1 : 1 2 , H2 : 1 > 2 ,
then the T value would correspond to an area in one tail, as the failure is only above +c.

There are dierent functions you can construct to approach the problem,
based on dierent combinations of the data.
This is why statistics is entirely based on your assumptions and the resulting

98

distribution function! Example: Testing soil types in dierent locations by amount of aluminum oxide present.
m = 14, x = 12.56 N (1 , 2 ); n = 5, y = 17.32 N (2 , 2 )
H1 : 1 2 ; H2 : 1 > 2 T = 6.3 t14+52=17

c-value is 1.74, however this is a one-sided test. T is very negative, but we still accept H 1
If the hypotheses were: H1 : 1 2 ; H2 : 1 < 2 ,
Then the T value of -6.3 is way to the left of the c-value of -1.74. Reject H1

Goodness-of-t tests. Setup: Consider r dierent categories for the random variable. The probability that a data point takes value Bi is pi pi = p1 + ... + pr = 1 Hypotheses: H1 : pi = p0 i for all i = 1, ..., r; H2 : otherwise. Example: (9.1.1)
3 categories exist, regarding a familys nancial situation.
They are either worse, better, or the same this year as last year.
Data: Worse = 58, Same = 64, Better = 67 (n = 189)
1 Hypothesis: H1 : p1 = p2 = p3 = 3 , H2 : otherwise. Ni = number of observations in each category.
You would expect, under H1 , that N1 = np1 , N2 = np2 , N3 = np3
Measure using the central limit theorem:
N1 np1
N (0, 1) np1 (1 p1 ) 99

However, keep in mind that the Ni values are not independent!! (they sum to 1) Ignore part of the scaling to account for this (proof beyond scope):
N1 np1 1 p1 N (0, 1) = N (0, 1 p1 ) np1 T = If H1 is true, then: T = If H1 is not true, then:
r (Ni np0 )2 i i=1

Pearsons Theorem:

(N1 np1 )2 (Nr npr )2 2 + ... + r 1 np1 npr

np0 i

2 r 1

T + Proof: if p1 = p0 1, N1 np0 N1 np1 n(p1 p0 1) i




= N (0, 2 ) + () + 0 0 0 npi npi np1

However, is squared + Decision Rule:

= {H1 : T c, H2 : T > c}

2 The example yields a T value of 0.666, from the 2 r 1=31=2 = 2 c is much larger, therefore accept H1 .
The dierence among the categories is not signicant.

** End of Lecture 32

100

18.05 Lecture 33 May 4, 2005

Simple goodness-of-t test: H1 : p i = p 0 i , i r ; H2 : otherwise. T =


r (Ni np0 )2 i i=1

np0 i

2 r 1

Decision Rule: = {H1 : T c; H2 : T > c} If the distribution is continuous or has innitely many discrete points: Hypotheses: H1 : P = P0 ; H2 : P = P0

Discretize the distribution into intervals, and count the points in each interval.
You know the probability of each interval by area, then, consider a nite number of intervals.
This discretizes the problem.

New Hypotheses: H1 : pi = P(X Ii ) = P0 (X Ii ); H2 otherwise.


If H1 is true H1 is also true.

Rule of Thumb: np0 i = nP0 (X Ii ) 5


If too small, too unlikely to nd points in the interval,
does not approximate the chi-square distribution well.
Example 9.1.2 Data N (3.912, 0.25), n = 23
H1 : P N (3.912, 0.25)
1 Choose k intervals p0 i = k 1 23 n( k ) 5 k 5, k = 4 101

3.912 N (3.912, 0.25) X X N (0, 1) 0.25 Dividing points: c1 , c2 = 3.912, c3 Find the normalized dividing points by the following relation:

ci 3.912 = c i 0.5

The c i values are from the std. normal distribution. c 1


= 0.68 c1 = 0.68(0.5) + 3.912 = 3.575 c 2 = 0 c2 = 0(0.5) + 3.912 = 3.912
c 3 = 0.68 c3 = 0.68(0.5) + 3.912 = 4.249 Then, count the number of data points in each interval.
Data: N1 = 3, N2 = 4, N3 = 8, N4 = 8; n = 23
Calculate the T statistic:
T = Now, decide if T is too large. = 0.05 - signicance level. 2 2 r 1 3 , c = 7.815 (3 23(0.25))2 (8 23(0.5))2 + ... + = 3.609 23(0.25 23(0.25)

102

Decision Rule:
= {H1 : T 7.815; H2 : T > 7.815}
T = 3.609 < 7.815, conclusion: accept H1
The distribution is relatively uniform among the intervals.
Composite Hypotheses: H1 : pi = pi (), i r for - parameter set.
H2 : not true for any choice of
Step 1: Find that best describes the data.
Find the MLE of
Likelihood Function: () = p1 ()N1 p2 ()N 2 ... pr ()Nr
Take the log of () maximize is good enough. Step 2: See if the best choice of ) for i r, H2 : otherwise. H1 : pi = pi ( T =
i=1 r ))2 (Ni npi (

where s - dimension of the parameter set, number of free parameters. Example: N (, 2 ) s = 2


If there are a lot of free parameters, it makes the distribution set more exible.
Need to subtract out this exibility by lowering the degrees of freedom.
Decision Rule:
= {H1 : T c; H2 : T > c}

Choose c from 2 r s1 with area =

) npi (

2 r s1

Example: (pg. 543) Gene has 2 possible alleles A1 , A2 Genotypes: A1 A1 , A1 A2 , A2 A2 Test that P(A1 ) = , P(A2 ) = 1 ,

103

but you only observe genotype. H1 : P(A1 A2 ) = 2(1 ) N2 P(A1 A1 ) = 2 N1 P(A2 A2 ) = (1 )2 N3 r = 3 categories. s = 1 (only 1 parameter, ) () = (2 )N1 (2(1 ))N2 ((1 )2 )N3 = 2N2 2N1 +N2 (1 )2N3 +N2 log () = N2 log 2 + (2N1 + N2 ) log + (2N3 + N2 ) log(1 ) 2N3 + N2 2N1 + N2 = =0 1 (2N1 + N2 )(1 ) (2N3 + N2 ) = 0 = 2N1 + N2 2N1 + N2 = 2N1 + 2N2 + 2N3 2n

based on data. compute 2 , p0 = 2 (1 ), p0 = (1 )2 p0 = 2 3 i T =

(Ni np0 )2
i

np0 i

2 2 r s1 = 1

For an = 0.05, c = 3.841 from the 2 1 distribution. Decision Rule: = {H1 : T 3.841; H2 : T > 3.841} ** End of Lecture 33

104

18.05 Lecture 34 May 6, 2005

Contingency tables, test of independence. Feature 2 = 1 N11 ... ... ... Na1 N+a F2 = 2
F2 = 3
...
...
...
...
...
...
...
...
...
...
...
...
... ... ... ... ... ... ... F2 = b N1b ... ... ... Nab N+b row total N1+ ... ... ... Na+ n

Feature 1 = 1 F1 = 2 F1 = 3 ... F1 = a col. total Xi1 {1, ..., a} Xi2 {1, ..., b}

Random Sample: 1 2 1 2 X1 = (X1 , X1 ), ..., Xn = (Xn , Xn ) Question: Are X 1 , X 2 independent? Example: When asked if your nances are better, worse, or the same as last year, see if the answer depends on income range: 20K 20K - 30K 30K Worse 20 24 14 Same 15 27 22 Better 12 32 23

Check if the dierences and subtle trend are signicant or random. ij = P(i, j ) = P(i) P(j ) if independent, for all cells ij Independence hypothesis can be written as: H1 : ij = pi qj where p1 + ... + pa = 1, q1 + ... + qb = 1 H2 : otherwise. r = number of categories = ab s = dimension of parameter set = a + b 2 The MLE p i , qj needs to be found T =
2 (Nij np i qj ) 2 r s1=ab(a+b2)1=(a1)(b1) q np i j i,j

Distribution has (a - 1)(b - 1) degrees of freedom. Likelihood: ( p , q)= 


i,j

(pi qj )Nij =


i

pi

Ni+


j

qj

N+j

Note: Ni+ = j Nij and N+j = i Nij Maximize each factor to maximize the product. 105

Use Lagrange multipliers to solve the constrained maximization: N log p ( p 1) maxp min i + i i i i
i

Ni+ log pi max, p1 + ... + pa = 1

Ni+ Ni+ = = 0 pi = pi pi pi = n Ni+ = 1 = n p i = n p i = N+j Ni+ , qj = n n

T =

(Nij Ni+ N+j /n)2 2 (a1)(b1) N N /n i + + j i,j

Decision Rule:
= {H1 : T c; H2 : T > c}
Choose c from the chi-square distribution, (a - 1)(b - 1) d.o.f., at a level of signicance = area.
From the above example:
N1+ = 47, N2+ = 83, N3+ = 59
N+1 = 58, N+2 = 64, N+3 = 67
n = 189
For each cell, the component of the T statistic adds as follows:
T = Is T too large? 2 T 2 (31)(31) = 4 (20 58(47)/189)2 + ... = 5.210 58(47)/189

For this distribution, c = 9.488 According to the decision rule, accept H1 , because 5.210 9.488 Test of Homogeniety - very similar to independence test. 106

Group 1 ... Group a

Category 1 N11 ... Na1

... ... ... ...

Category b N1b ... Nab

1. Sample from entire population. 2. Sample from each group separately, independently between the groups. Question: P(category j | group i) = P(category j) This is the same as independence testing! P(category j, group i) = P(category j)P(group i) P(Cj |Gi ) = P(Cj Gi ) P(Cj )P(Gi ) = = P(Cj ) P(Gi ) P(Gi )

Consider a situation where group 1 is 99% of the population, and group 2 is 1%.
You would be better o sampling separately and independently.
Say you sample 100 of each, just need to renormalize within the population.
The test now becomes a test of independence.
Example: pg. 560
100 people were asked if service by a re station was satisfactory or not.
Then, after a re occured, the people were asked again.
See if the opinion changed in the same people.
Before Fire After Fire 80 72 satised 20 28 unsatised

But, you cant use this if you are asking the same people! Not independent! Better way to arrange: Originally Satised Originally Unsatised 70 2 After, Satised 10 18 After, Not Satised

If taken from the entire population, this is ok. Otherwise you are taking from a dependent population. ** End of Lecture 34

107

18.05 Lecture 35 May 9, 2005

Kolmogorov-Smirnov (KS) goodness-of-t test Chi-square test is used with discrete distributions.
If continuous - split into intervals, treat as discrete.
This makes the hypothesis weaker, however, as the distribution isnt characterized fully.
The KS test uses the entire distribution, and is therefore more consistent.
Hypothesis Test:
H1 : P = P0
H2 : P = P0
P0 - continuous In this test, the c.d.f. is used. Reminder: c.d.f. F (x) = P(X x), goes from 0 to 1.

The c.d.f. describes the entire function. Approximate the c.d.f. from the data Empirical Distribution Function: 1 #(points x) Fn (x) = I (X x) = n i=1 n by LLN, Fn (x) EI (X1 x) = P(X1 x) = F (x)
n

From the data, the composed c.d.f. jumps by 1/n at each point. It converges to the c.d.f. at large n. Find the largest dierence (supremum) between the disjoint c.d.f. and the actual. sup |Fn (x) F (x)| n 0
x

108

For a xed x: n(Fn (x) F (x)) = (I (Xi x) EI (X1 x)) n

By the central limit theorem: N 0, Var(I (Xi x)) = p(1 p) = F (x)(1 F (x) You can tell exactly how close the values should be! Dn =
x

n sup |Fn (x) F (x)|

a) Under H1 , Dn has some proper known distribution. b) Under H2 , Dn + If F (x) implies a certain c.d.f. which is away from that predicted by H0

Fn (x) F (x), |Fn (x) F0 (x)| > /2 n n|Fn (x) F0 (x)| > 2 + The distribution of Dn does not depend on F(x), this allows to construct the KS test. Dn = n supx |Fn (x) F (x)| = n supy |Fn (F 1 (y )) y | y = F (x), x = F 1 (y ), y [0, 1] Fn (F 1 (y )) = 1 1 1 I (Xi F 1 (y )) = I (F (Xi ) y ) = I (Yi y ) n i=1 n i=1 n i=1
n n n

Y values generated independently of F .


P(Yi y ) = P(F (Xi ) y ) = P (Xi F 1 (y )) = F (F 1 (y )) = y
Xi F (x)
F (Xi ) uniform on [0, 1], independent of Y.

Dn is tabulated for dierent values of n, since not dependent on the distribution.


(nd table on pg. 570)
For large n, converges to another distribution, whose table you can alternatively use.
2 2 P(Dn t) H (t) = 1 2 i=1 (1)i1 e2i t The function represents Brownian Motion of a particle suspended in liquid.

109

Distribution - distance the particle travels from the starting point. The maximum distance is the distribution of Dn H(t) = distribution of the largest deviation of particle in liquid (Brownian Motion) Decision Rule:
} = {H1 : Dn c; H2 : Dn > c
Choose c such that the area to the right is equal to

Example:
Set of data points as follows
n = 10,
0.58, 0.42, 0.52, 0.33, 0.43, 0.23, 0.58, 0.76, 0.53, 0.64
H1 : P uniform on [0, 1]
Step 1: Arrange in increasing order.
0.23, 0.33, 0.42, 0.43, 0.52, 0.53, 0.58, 0.64, 0.76
Step 2: Find the largest dierence.
Compare the c.d.f. with data.
Note: largest dierence will occur before or after the jump, so only consider end points. x: F(x): Fn (x) before: Fn (x) after: Calculate the dierences: |Fn (x) F (x)| Fn (x) before and F(x): Fn (x) after and F(x): 0.23 0.13 0.23 0.13 0.22 0.12 ... ... 0.23 0.23 0 0.1 0.33 0.33 0.1 0.2 0.42 0.42 0.2 0.3 ... ... ... ...

The largest dierence occurs near the end: |0.9 0.64| = 0.26 Dn = 10(0.26) = 0.82 Decision Rule: = {H1 : 0.82 c; H2 : 0.82 > c} c for = 0.05 is 1.35. Conclusion - accept H1 . ** End of Lecture 35

110

18.05 Lecture 36 May 11, 2005

Review of Test 2 (see solutions for more details)


Problem 1:
1 P(X = 2c) = 2 , P(X = 1 2
c) = 5 n EXn = ( 4 ) c
1 2 1 1 5 EX = 2c( 2 )+ 1 2 c( 2 ) = 4 c

Problem 2: X1 , ..., Xn n = 1000 1


P(Xi = 1) = 2 , P (Xi = 0) = 1 2 1
= EX = 2 , Var(X1 ) = p(1 p) = Sn = X1 + ... + Xn P(440 Sn k ) = 0.5

1 4

Sn 1000(1/2) Sn 500 Sn nEX1


=
250 nVar(X1 ) 1000(1/4) P( 440 500 k 500 ) = 0.5 z 250 250 by the Central Limit Theorem: 440 500 k 500 k 500 k 500 ) ( ) = ( ) (3.75) = ( ) 0.0001 = 0.5 ( 250 250 250 250 Therefore: k 500 k 500 = 0, k = 500 ) = 0.5001 ( 250 250 Problem 3: f (x) = n en e I (x e); () = +1 max +1 x ( xi )  xi

Easier to maximize the log-likelihood:

log () = n log() + n ( + 1) log

Problem 5:
Condence Intervals, keep in mind the formulas!
1 1 2 2 xc (x x ) x + c (x2 x2 ) n1 n1 Find c from the T distribution with n - 1 degrees of freedom.

 n n + n log xi = 0 = log xi n

111

Set up such that the area between -c and c is equal to 1 In this example, c = 1.833 n(x2 x2 ) n(x2 x2 ) 2 c1 c2 Find c from the chi-square distribution with n - 1 degrees of freedom.

Set up such that the area between c1 and c2 is equal to 1


In this example, c1 = 3.325, c2 = 16.92
Problem 4:
Prior Distribution:
f () = 1 e ()

Posterior Distribution:

n en f (x1 , ..., xn |) = +1 ( xi ) f (|x1 , ..., xn ) f ()f (x1 , ..., xn |)

Q Q n en 1 e = +n1 e+n e log xi = (+n)1 e( n+log xi ) ( xi ) Posterior = ( + n, n + log xi ) Bayes Estimator:

+n n + log xi 112

Final Exam Format Cumulative, emphasis on after Test 2.


9-10 questions.
Practice Test posted Friday afternoon.
Review Session on Tuesday Night - 5pm, Bring Questions!
Optional PSet:
pg. 548, Problem 3:
Gene has 3 alleles, so there are 6 possible combinations.
2 2 p1 = 1 , p2 = 2 , p3 = (1 1 2 )2
p4 = 21 2 , p5 = 21 (1 1 2 ), p6 = 22 (1 1 2 )
Number of categories r = 6, s = 2.
2 Free Parameters.
T =
r (Ni npi )2 i=1

npi

2 r s1=3 1 2 ))N6

2N1 2N2 (1 , 2 ) = 1 2
(1 1 2 )2N3 (21 2 )N4 (21 (1 1 2 ))N5 (22 (1 N4 +N5 +N6 2N1 +N4 +N5 2N2 +N4 +N6 (1 1 2 )2N3 +N5 +N6 2
1 =2

Maximize the log likelihood over the parameters.


log = const. + (2N1 + N4 + N5 ) log 1 + (2N2 + N4 + N6 ) log 2 + (2N3 + N5 + N6 ) log(1 1 2 )
Max over 1 , 2
log = a log1 + b log2 + c log (1 1 2 ) a c b c = = 0; = =0 1 1 2 1 1 2 2 1 1 2 a b = a2 = b1 1 2 a a1 a2 c1 = 0, a a1 b1 c1 = 0 1 = Write in terms of the givens:
1 = where n = Ni
2N1 + N2 + N5 1 2N2 + N4 + N6 1 = , 2 = = 2n 5 2n 2
a b , 2 = a+b+c a + b + c

Solve for 1 , 2

113

Decision Rule:
= {H1 : T c, H2 : T < c}
Find c values from chi-square dist. with r
s - 1 d.o.f. Area above c = c = 7.815
Problem 5:
There are 4 blood types (O, A, B, AB)
There are 2 Rhesus factors (+, -)
Test for independence:
+ - O 82 13 95 T = A 89 27 116 (82 B 54 7 61 AB 19 9 28 244 56 300

244(95) 2 300 ) 244(95) 300

+ ...

Find the T statistic for all 8 cells. 2 2 (a1)(b1) = 3 , and the test is same as before. ** End of Lecture 36

114

18.05 Lecture 37 May 17, 2005

Final Exam Review - solutions to practice nal.


u 1. f (x|v ) = {uv u xu1 e ( x v ) for x 0; 0 otherwise.} Find the MLE of v. Maximize v in the likelihood function = joint p.d.f.  P u (v ) = un v nu ( xi )u1 e (xi /v) i u log (v ) = n log u nu log v + (u 1) log( xi ) ( x v ) Maximize with respect to v.

nu nu u u (log (v )) (u+1) xi = v = +u xu =0 iv v v v v u xi v u xu i n = ( u+1 ) = u v vu vu = v=( 2. Xi , ..., Xn U [0, ] f (x|) = 1 I (0 x ) Prior: 1 u xi n

1 u 1 xi ) u MLE n

f () =

192 I ( 4) 4

Data: X1 = 5, X2 = 3, X3 = 8 Posterior: f (|x1 , ..., xn ) f (x1 , ..., xn |)f () 1 f (x1 , ..., xn |) = 1 n I (0 all xs ) = n I (max(X1 , ..., Xn ) ) 1 f (|x1 , ..., xn ) n+4 I ( 4)I (max(x1 , ..., xn ) ) n1 +4 I ( 8) Find constant so it integrates to 1. c c6 1 d n = 3 1 = | 8 = 8 6 = 1 1= c7 d n +4 6 6 8 8 c = 6 86 3. Two observations (X1 , X2 ) from f(x)
H1 : f (x) = 1/2, I (0 x 2)
H2 : f (x) = {1/2, 0 x 1, 2/3, 1 < x
2} H3 : f (x) = {3/4, 0 x 1, 1/4, 1 < x 2} minimizes 1 ( ) + 22 ( ) + 23 ( ) = P( = Hi |Hi ) i ( ) Find (i)i , Decision rule picks (i)fi (x1 , ..., xn ) max for each region. 115

(i)fi (x1 )fi (x2 ) both x1 , x2 [0, 1] point in [0, 1], [1, 2] both in [1, 2]

H1 (1)(1/2)(1/2) = 1/4 (1)(1/2)(1/2) = 1/4 (1)(1/2)(1/2) = 1/4

H2 (2)(1/3)(1/3) = 1/3 (2)(1/3)(2/3) = 4/9 (2)(1/3)(2/3) = 8/9

H3 (2)(3/4)(3/4) = 9/8 (2)(3/4)(1/4) = 3/8 (2)(1/4)(1/4) = 1/8

Decision Rule:
= {H1 : never pick , H2 : both in [1, 2], one in [0, 1], [1, 2] , H3 : both in [0, 1]}
If two hypotheses:
f1 (x) (2) > f2 (x) (1) Choose H1 , H2 4.
1 2 1 f (x|) = { e 2 (ln x) for x 0, 0 for x < 0} x 2 If X has this distribution, nd distribution of ln X . Y = ln X ey c.d.f. of Y: P(Y y ) = (ln x y ) = P(x ey ) = 0 f (x)dx However, you dont need to integrate. P(Y y ) = f (ey ) ey p.d.f. of Y, f (y ) = y

ey

y 2 1 2 (ln e )

2 1 1 ey = e 2 (y) N (, 1) 2

5. n = 10, H1 : = 1, H2 : = 1 f1 = 0.05 1 ( ) = P1 ( f < c) 2 = {H1 :

f1 ( x) c, H2 if less than} f2 ( x )
P 2 1 1 e 2 (ln xi +1) n xi ( 2 ) P 2 1 1 e 2 (ln xi 1) n xi ( 2 )

c n = 1.64, c = 4.81 n Power = 1 - type 2 error = 1 P2 ( = H2 ) = 1 P2 ( ln xi < c) = 1 P2 ( N (1, 1) < c) 4.81 10 xi n(1) )1 < = 1 P2 ( n 10

P f1 (x) ln xi c = e2 ln xi c f2 (x) = {H1 : ln xi c = 4.81, H2 : ln xi > c = 4.81} P n n i n c ) = P1 (Z c ) 0.05 = P1 ( ln xi c) = P1 ( N (1, 1) c) = P1 ( x n n n

f2 ( x)=

f1 ( x)=

116

5 , p2 = 6. H1 : p1 = 2 3 , p3 = 1 6 , [0, 1]

Step 1) Find MLE 5
Step 2) p 1 = 2 , p2 = 3 , p 3 = 1 6 Step 3) Calculate T statistic. r (Ni np )2 i i=1

T =

npi

2 r s1=311=1

5 () = ( )N1 ( )N2 (1 )N3 2 3 6 log () = (N1 + N2 ) log() + N3 log(1 5 ) N1 log(2)N2 log(3) max 6

N1 + N2 5/6 = + N3 1 56 5 5 N1 + N2 (N1 + N2 ) N3 = 0 6 6 solve for = Compute statistic, T = 0.586 = {H1 : T 3.841, H2 : T > 3.841} Accept H1 7. n = 17, x = 3.2, x = 0.09 From N (, 2 ) H1 : 3 H2 : > 3 at = 0.05 T = 3.2 3 tn1 = = 2.67 1 1 2 (x)2 ) (0 . 09) ( x n1 16 x 0 23 6 N1 + N 2 ( )= 5 n 25

Choose decision rule from the chi-square table with 17-1 degrees of freedom:

: {H1 : T < 1.746, H2 : T > 1.746} H1 is rejected. 8. Calculate t statistic: T = (Nij


i,j N+j Ni+ 2 ) n N+j Ni+ n

= 12.1

117

2 2
2 (a1)(b1) = 32 = 6 at 0.05, c = 12.59 : {H1 : T 12.59, H2 : T > 12.59}
Accept H1 . But note if condence level changes, 0.10, c = 10.64, would reject H1

9.
f (x) = 1 2, I (0 x 2)
/x F (x) = f (t)dt = x/2, x 2

x:
F(x):
F(x) before:
F(x) after:
di F(x) before: di F(x) after:

0.02 0.01 0 0.1 0.01 0.09

0.18 0.09 0.1 0.2 0.01 0.11

0.20 0.10 0.2 0.3 0.1 0.2

... ... ... ... ... ...

n = |F (x) Fn (x)| max n = 0.295 c for = 0.05 is 1.35 Dn = 10(0.295) = 0.932872 = {H1 : 0.932872 1.35, H2 : 0.932872 > 1.35} Accept H1 ** End of Lecture 37

*** End of 18.05 Spring 2005 Lecture Notes.

118

18.05. Practice test 1. (1) Suppose that 10 cards, of which ve are red and ve are green, are placed at random in 10 envelopes, of which ve are red and ve are green. Determine the probability that exactly two envelopes will contain a card with a matching color. (2) Suppose that a box contains one fair coin and one coin with a head on each side. Suppose that a coin is selected at random and that when it is tossed three times, a head is obtained three times. Determine the probability that the coin is the fair coin. (3) Suppose that either of two instruments might be used for making a certain measurement. Instrument 1 yields a measurement whose p.d.f. is f1 (x) =

2x, 0 < x < 1 0, otherwise

Instrument 2 yields a measurement whose p.d.f. is f2 (x) =

3x2 , 0 < x < 1 0, otherwise

Suppose that one of the two instruments is chosen at random and a mea surement X is made with it. (a) Determine the marginal p.d.f. of X . (b) If X = 1/4 what is the probability that instrument 1 was used? (4) Let Z be the rate at which customers are served in a queue. Assume that Z has p.d.f. 2e2z , z > 0, f (z ) = 0, otherwise
1 Find the p.d.f. of average waiting time T = Z . (5) Suppose that X and Y are independent random variables with the following p.d.f. ex , x > 0, f (x) = 0, otherwise

Determine the joint p.d.f. of the following random variables: U= X and V = X + Y. X +Y

18.05. Practice test 2. (1) page 280, No. 5 (2) page 291, No. 11 (3) page 354, No. 10 (4) Suppose that X1 , . . . , Xn form a random sample from a distribution with p.d.f. ex , x f (x| ) = 0, x < . Find the MLE of the unknown parameter . (5) page 415, No. 7. (Also compute 90% condence interval for 2 .)

Extra practice page 196, No. page 346, No. page 396, No. page 409, No. page 415, No.

problems: 9; 19; 10; 3. 3.

Go over psets 5, 6, 7 and examples in class.

18.05. Test 1. (1) Consider events A = {HHH at least once} and B = {TTT at least once}. We want to nd the probability P (A B ). The complement of A B will be Ac B c , i.e. no TTT or no HHH, and P (A B ) = 1 P (Ac B c ). To nd the last one we can use the probability of a union formula P (Ac B c ) = P (Ac ) + P (B c ) P (Ac B c ). Probability of Ac , i.e. no HHH, means that on each toss we dont get HHH. The probability not to get HHH on one toss is 7/8 and therefore, P (Ac ) =
7 10 8 .

The same for P (B c ). Probability of Ac B c , i.e. no HHH and no TTT, means that on each toss we dont get HHH and TTT. The probability not to get HHH and TTT on one toss is 6/8 and, therefore, P (A B ) = Finally, we get, P (A B ) = 1

7 10 8
c c

6 10 8

7 10 8

6 10 8

(2) We have P (F ) = P (M ) = 0.5, P (CB |M ) = 0.05 and P (CB |F ) = 0.0025. Using Bayes formula, P (M |CB ) = P (CB |M )P (M ) 0.05 0.5 = P (CB |M )P (M ) + P (CB |F )P (F ) 0.05 0.5 + 0.0025 0.5

(3) We want to nd f (y |x) = f (x, y ) f1 (x)

which is dened only when f (x) > 0. To nd f1 (x) we have to integrate out y, i.e. f1 (x) = f (x, y )dy.

2 To nd the limits we notice that for a given x, 0 < y 2 < 1 x which is not 2 2 empty only if x < 1, i.e. 1 < x < 1. Then 1 x < y < 1 x2 . So if 1 < x < 1 we get, 2 y3 1 1x f1 (x) = c(x +y )dy = c(x y + ) = 2c(x2 1 x2 + (1x2 )3/2 ). 2 3 1x 3 1x2 2 2 2

1x2

Finally, for 1 < x < 1, f (y |x) = 2c(x2

if 1 x2 < y < 1 x2 , and 0 otherwise.

x2 + y 2 c(x2 + y 2 ) = 1 2 (1 x2 )3/2 ) 2x2 1 x2 + 3 (1 x2 )3/2 1 x2 + 3

(4) Let us nd the c.d.f rst. P (Y y ) = P (max(X1 , X2 ) y ) = P (X1 y, X2 y ) = P (X1 y )P (X2 y ). The c.d.f. of X1 and X2 is P (X1 y ) = P (X2 y ) = If y 0, this is P (X1 y ) = and if y > 0 this is P (X1 y ) = Finally, the c.d.f. of Y, P (Y y ) =
0

f (x)dx.

y ex dx = ex
x x

= ey

0 e dx = e

= 1.

e2y , y 0 1, y > 0.

Taking the derivative, the p.d.f. of Y, 2y 2e , y 0 f (y ) = 0, y > 0.

y z1

PSfrag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

z > 1.

Figure 1: Region {x zy } for z 1 and z > 1. (5) Let us nd the c.d.f. of Z = X/Y rst. Note that for X, Y (0, 1), Z can take values only > 0, so let z > 0. Then P (Z z ) = P (X/Y z ) = P (X zY ) = f (x, y )dxdy.
{xzy }

To nd the limits, we have to consider the intersection of this set {x zy } with the square 0 < x < 1, 0 < y < 1. When z 1, the limits are 1 zy 1 2 1 2 zy x z z2 z 2 (x + y )dxdy = ( + xy ) dy = ( + z )y dy = + . 2 2 6 3 0 0 0 0 0 When z 1, the limits are dierent 1 2 1 1 1 y 1 1 (x + y )dydx = ( + xy ) dx = 1 2 . 2 6z 3z x/z 0 x/z 0 So the c.d.f. of Z is P (Z z ) = The p.d.f. is f (z ) = +z , 0<z1 3 1 1 3z , z>1
1 6z 2 z 3 1 3z 3 z2 6

and zero otherwise, i.e. for z 0.

+1 , 0<z1 3 1 + 3z 2 , z>1

18.05. Test 2. (1) Let X be the players fortune after one play. Then
P (X = 2c) = and the expected value is EX = 2c 1 c 1 5 + = c. 2 2 2 4 1 c 1
and P (X = ) = 2 2 2

Repeating this n times we get the expected values after n plays (5/4)n c. (2) Let Xi , i = 1, . . . , n = 1000 be the indicators of getting heads. Then Sn = X1 + . . . + Xn is the total number of heads. We want to nd k such that P (440 Sn k ) 0.5. Since = EXi = 0.5 and 2 = Var(Xi ) = 0.25 by central limit theorem, Z= Sn n Sn 500 = n 250

is approximately standard normal, i.e. P (440 Sn k ) = P ( k 500 440 500 = 3.79 Z ) 250 250 k 500 ( ) (3.79) = 0.5. 250

From the table we nd that (3.79) = 0.0001 and therefore k 500 ( ) = 0.4999. 250 Using the table once again we get (3) The likelihood function is
k 500 250

0 and k 500.

and the log-likelihood is

n en ( ) = ( Xi )+1
Xi .

log ( ) = n log + n ( + 1) log

We want to nd the maximum of log-likelihood so taking the derivative we get


n + n log Xi = 0 and solving for , the MLE is = (4) The prior distribution is f ( ) = log n . Xi n

1 e ()

and the joint p.d.f. of X1 , . . . , Xn is n en f (X1 , . . . , Xn | ) = . ( Xi )+1

Therefore, the posterior is proportional to (as usual, we keep track only of the terms that depend on ) 1 +n1 e+n n en = f ( |X1 , . . . , Xn ) 1 e ( Xi )+1 Xi ( Xi ) Q Q +n1 e+n log Xi = (+n)1 e( n+log Xi ) . This shows that the posterior is again a gamma distribution with parameters
( + n, n + log Xi ). = +n . n + log Xi

Bayes estimate is the expectation of the posterior which in this case is

(5) The condence interval for is given by 1 1 2) X +c 2) c (X 2 X X (X 2 X n1 n1 where c that corresponds to 90% condence is found from the condition t101 (c) t101 (c) = 0.9

or t9 (c) = 0.95 and c = 1.833. The condence interval for 2 is 2) 2) n(X 2 X n(X 2 X 2 c2 c2 where c1 , c2 satisfy
2 2 101 (c1 ) = 0.05 and 101 (c2 ) = 0.95,

and c1 = 3.325, c2 = 16.92.

You might also like