Professional Documents
Culture Documents
Consider observing some phenomena with exactly two possible outcomes (say,
success and failure) until the first success occurs, when the phenomena are
independent of one another. The it can be shown that the probability function of
the number Y of trials until the first success occurs is given by
The parameter is the probability of success. For example suppose we are looking
for some astronomical object at random and count the number of objects examined
until the first occurrence of the object is found
E(X2) = k 2 P( X k)
The mean and variance of the Poisson distribution are both equal to . For values
of ‘large’, say > 25 (or even smaller), the Poisson distribution is approximately
normal. A probability histogram of the Poisson distribution with = 25 is given
below.
Scatterplot of p(y|25) vs y
0.08
0.07
0.06
0.05
p(y|25)
0.04
0.03
0.02
0.01
0.00
10 15 20 25 30 35 40
y
What does the distribution look like? Yeah, normal! So, if is large, one can
approximate Poisson probabilities using the normal distribution with mean and
standard deviation .
Definition. A continuous random variable X is one for which the outcome can be
any number in an interval or collection of intervals.
Probabilities are obtained as areas under a curve, called the probability density
1 x / 20
function f(x). Below is a graph of the pdf f(x|20) = 20
e , for x > 0 and 0
elsewhere--it is called the exponential pdf with mean µ =20; the standard deviation
1 2
/ 2 2
x
The graph of a normal pdf is the (familiar) uni-modal symmetric bell-shaped curve.
The CDF Φ(x) is an elongated ess-shaped curve. The mean and variance of a
normal distribution are the parameters µ and σ2. Many natural phenomena have
normal distributions—physical measurements, astronomical variables etc.
Page 5
Descriptive Statistics.
Types of Data: We classify all ‘data’ about a variable into two types:
Numerical (also called quantitative) variables are divided into two types: discrete
and continuous.
Samples. When we obtain a sample from the population we also say we obtained a
sample from the probability distribution.
Numerical Summaries:
1. Measures of Location:
Three commonly used measures of the center of a set of numerical values
are the mean, median, and trimmed mean.
The First Quartile Q1 is the median of the numbers below the median or the
th
25 percentile
The Third Quartile Q3 is the median of the numbers above the median or the
th
75 percentile.
97.1 97.2 97.4 97.6 97.8 97.8 98.0 98.2 98.2 98.2 98.4
98.4 98.5 98.6 98.6 99.0 99.2 99.7
Boxplot of BodyTemp
100.0
99.5
99.0
BodyTemp
98.5
98.0
97.5
97.0
Outliers:
An observation is a mild outlier if it is more than 1.5 IQR’s below Q1 or 1.5 IQR’s
above Q3. It is an extreme outlier if it is more than 3 IQR’s below Q1 or above
Q3.
Software packages often identify outliers in some fashion; e.g., Minitab puts an ‘*’
for outliers (not necessarily all of them though).
Example. Number of CD’s owned by college students at Penn State University Stat
students:
Extreme Outliers: 3IQR = (3)(75) = 225. Extreme outliers are #CDs < 25- 225
(negative value) or > 100 + 225 = 325. By this rule, there are several extreme
outliers. See the boxplot below.
Boxplot of CDs
500
400
300
CDs
200
100
‘
Stem-and-Leaf Plots: A stem-and-leaf plot is a graphical display of data
consisting of a stem--the most important part of a number--and leafs—the second
most important part of a number.
109 0 00000001111111111111111111111112222222222222222222222222222+
(56) 0 55555555555555555555555555555556666666667777777888899999
71 1 00000000000000000000000001122
42 1 555555555555555
27 2 00000000001
16 2 55555
11 3 000000
5 3 5
4 4 0
3 4 55
1 5 0
The average salary is $200K. Delete the highest salary and find that the mean is
$100K. Delete the two highest salaries and calculate the mean to be $20K. The
median is $20 in both situations. The median is a resistant statistic and the
average is not.
Example 2. Remove the 5 extreme outliers in the CDs dataset and redo the
descriptive statistics.
Note that the only statistic in the 5%-number summary that changed was the Max
(which had to change!) Note also that the mean decreased.
Examples. Resistant Statistics: Median, 1st and 3rd quartiles and IQR (for
moderate samples—n = 10 or more roughly))
The Standard Deviation (SD) is roughly the average distance values are from the
mean. The actual definition of the standard deviation (sd), denoted by s, is the
square root of the sample variance s2, where
s2 = ∑ (xi - x )2 / (n – 1)
Page 10
is the sum of squared deviations of the values from the mean.. The sd is not a
resistant statistic
Empirical Rule says that if the data are symmetric and bell-shaped (unimodal),
indicative of a normal distribution, then
Page 11
For the variable ‘ln(flu_tot)’, we find that the intervals and the percentages are as
follows:
0.0007
-4
0.0006
0.0005
-5
lf_tot
flutot
0.0004
-6 0.0003
0.0002
-7
0.0001
0.0000
-8
1 2 3 4 1 2 3 4
gmark gmark
Standard Errors. The sample mean xbar is a statistic (a quantity calculated from
a sample). As such, it varies from sample to sample and hence is a random
variable with a probability distribution. It can be shown that
where is the standard deviation of the population from which the sample is taken
and n is the size of the sample.
The standard error of the mean, denoted by se mean or s.e.(xbar) is the estimated
standard deviation of Xbar:
s.e.( x ) = s/n.
1/n at each value in the sample (or k/n if there are k identical values at
some value).
Example. Body Temperature. Here are the values of n = 18 body
temperatures and a graph of the ECDF: Empirical CDF of BodyTemp
Normal
Mean 98.22
100
StDev 0.6836
80
97.8 98.0 98.2 98.2 98.2
98.4 98.4 98.5 98.6 98.6 60
Percent
20
The ECDF equals
0 for x < 97.1, 0
These plots are used to determine whether a particular distribution fits data or to
compare different sample distributions. Q-Q Plots graph quantiles of one variable
against quantiles of a second variable. Two uses for quantile-quantile plots are:
The basic idea of Q-Q Plots is to plot the quantiles of the two distributions
(either 1 or 2 above) against one another. If the two distributions are roughly
the same, the their quantiles should be about the same, so a plot of the
quantiles against one another should be roughly a straight line,
Example Normal Q-Q and Probability Plots:
Sample We have n = 5 observations: 20, 24, 25, 27, and 30. Their mean = 25.2
and their standard deviation s = 3.70135
Qp=pth other other
x i p=pi quantile p's q’s
20 1 0.129630 21.0243 0.1 20.4565
24 2 0.314815 23.4150 0.2 22.0849
25 3 0.500000 25.2000 0.3 23.2590
27 4 0.685185 26.9850 0.4 24.2623
30 5 0.870370 29.3757 0.5 25.2000
The graph on the bottom left is a plot of quantiles vs. p (so a normal probability
plot)
The graph on the bottom right is a normal probability plot (Terminology depends
on scale of y-axis)
Page 14
29 Probability Plot of x
Normal
28 99
M ean 25.2
S tD ev 3.701
27 95 N 5
pth quantile
AD 0.161
26 P -V alue 0.884
80
25
Percent
24 50
23
20
22
5
21
20 22 24 26 28 30 1
15 20 25 30 35
x
x
0.8
0.7
0.6
0.5
p
0.4
0.3
0.2
0.1
21 22 23 24 25 26 27 28 29 30
pth quantile
Example. Let’s re-examine the astronomy example on page 10: data from
Mukherjee, Feigelson, Babu, etal in “Three types of Gamma-Ray Bursts. The five
number summary and histograms of the two variables ‘flu_tot’ and ‘ln (flu_tot) are
given again along with histograms of the data.
600
80
500
Frequency
Frequency
400 60
300
40
200
20
100
0 0
0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 -18.0 -16.5 -15.0 -13.5 -12.0 -10.5 -9.0 -7.5
flu_tot C13
60 80
Percent
Percent
50
40
20
20
5
1
0
0.01
-18 -16 -14 -12 -10 -8 -6 -20.0 -17.5 -15.0 -12.5 -10.0 -7.5 -5.0
C13 ln(flu_tot)
The information in the box within the right hand graph is reproduced below:
Page 16