Lecture3 Handouts

Lecture 3: Descriptive Statistics
Matt Golder & Sona Golder

Pennsylvania State University
Introduction
In a broad sense, making an inference implies partially or completely describing
a phenomenon.
Before discussing inference making, we must have a method for characterizing
or describing a set of numbers.
The characterizations must be meaningful so that knowledge of the descriptive
measures enable us to clearly visualize a set of numbers.
Generally, we can characterize a set of numbers using either graphical or
numerical methods.
Were now going to look at numerical methods.
Numerical Methods
Graphical displays are not usually adequate for the purpose of making
inferences.
We need rigorously dened quantities for summarizing the information
contained in the sample. These sample quantities typically have mathematical
properties that allow us to make probability statements regarding the goodness
of our inferences.
The quantities we dene are descriptive measures of a set of data.
Notes
Notes
Notes
Descriptive Measures
Typically, we are interested in two types of descriptive numbers: (i) measures of
central tendency and (ii) measures of dispersion or variation.
Measures of central tendency, sometimes called location statistics,
summarize a distribution by its typical value.
Measures of dispersion indicate, through a single number, the extent to which
a distributions observations dier from one another.
Together with information on its shape, the measures convey the most
distinctive aspects of a distribution.
Central Tendency
There are three basic measures of central tendency:
1 Mean
2 Median
3 Mode
Central Tendency
Redskins 7 Giants 16 Bengals 10 Ravens 17
Jets 20 Dolphins 14 Chiefs 10 Patriots 17
Texans 17 Steelers 38 Jaguars 10 Titans 17
Lions 21 Falcons 34 Seahawks 10 Bills 34
Rams 3 Eagles 38 Buccaneers 20 Saints 24
Bears 29 Colts 13 Panthers 26 Chargers 24
Cardinals 23 49ers 13 Cowboys 28 Browns 10
Vikings 19 Packers 24 Broncos 41 Raiders 14
Figure: Points Scored in Week One of the 2008 NFL Season
.01
. 0
1
.01 .02
. 0
2
.02 .03
. 0
3
.03 .04
. 0
4
.04 Density
D
e
n
s i t y
Density 0
0
01
1
12
2
23
3
34
4
45
5
5Frequency
F
r e
q
u
e
n
c
y
Frequency 0
0
010
10
10 20
20
20 30
30
30 40
40
40 Points Scored
Points Scored
Points Scored
Notes
Notes
Notes
Mean
The mean is also known as an expected value or, colloquially, as an average.
The mean of a sample of N measured responses, X1, X2, . . . , Xn is given by:
X =
1
n
n
i=1
Xi =
1
n
(X1 + X2 + . . . + Xn)
=
X1 + X2 + . . . + Xn
n
The symbol X, read X bar, refers to the sample mean. is used to denote
the population mean.
Mean
Calculating a mean only makes sense if the data are numeric, and (in general)
if the variable in question is measured at least at the interval level.
While the mean of an ordinal variable might provide some useful information,
the fact that the data are only ordinal means that the assumption necessary for
summation that the values of the variable are equally spaced is
questionable.
It never makes sense to calculate the mean of a nominal-level variable.
Mean
Its common to think of the mean as the balance point of the data that is,
the point in X at which, if every value of X was given weight according to
its size, the data would balance.
Thus, if we add a single data point (call it XN+1) to an existing variable X,
the new mean (using N + 1 observations) will:
be greater than the old one if XN+1 >

X,
be less than the old one if xN+1 <

X, and
be the same as the old one i XN+1 =

X.
Notes
Notes
Notes
Mean
A mean can also be thought of as the value of X that minimizes the squared
deviations between itself and every value of Xi. That is, suppose we are
interested in choosing a value for X (call it ) that minimizes:
f(X) =
N
i=1
(Xi )
2
=
N
i=1
(X
2
i
+
2
2Xi)
Mean
To nd the minimum, we need to calculate
f(X)
X
, set that equal to zero, and
solve.
f(X)
X
=
N
i=1
(2 2Xi)
N
i=1
(2 2Xi) = 0
2N 2
N
i=1
Xi = 0
2N = 2
N
i=1
Xi
=
1
N
N
i=1
Xi

X
Mean
Note that the mean is very susceptible to the eect of outliers in the data.
Exceptionally large values in X will tend to pull the mean in their direction,
and so can provide a misleading picture of the actual central location of the
variable in question.
Notes
Notes
Notes
Geometric Mean
So far, we have looked at what formally is called the arithmetic mean.
There are two other potentially useful variants on the mean: the geometric
mean and the harmonic mean.
The geometric mean is dened as:
XG =
_
N
i=1
Xi
_ 1
N
which can also be written as:
XG =
N
X1 X2 ... XN
Geometric Mean
The geometric mean can also be written using logarithms:
_
N
i=1
Xi
_ 1
N
= exp
_
1
N
N
i=1
ln Xi
_
That is, the geometric mean of a variable X is equal to the exponent of the
arithmetic mean of the natural logarithm of that variable.
Geometric Mean
The geometric mean has a geometric interpretation. You can think of the
geometric mean of two values X1 and X2 as the answer to the question:
What is the length of one side of a square that has an area equal to
that of a rectangle with width X1 and height X2?
Similarly, for three values X1, X2, and X3, one could ask:
What is the length of one side of a cube that has a volume equal to
that of a box with width X1, height X2, and depth X3?
Notes
Notes
Notes
Geometric Mean
This idea is represented graphically (for the two-value case) in Figure 2.
Figure: The Geometric Mean: A (Geometric!) Interpretation

X1
X2 =

X
G

Geometric Mean
Finally, note a few things:
The geometric mean is only appropriate for variables with positive values.
The geometric mean is always less than or equal to the arithmetic mean:
X

XG
The two are only equal if the values of X are the same for all N
observations.
While rare, the geometric mean is actually the more appropriate measure
of central tendency to use for phenomena (such as percentages) that are
more accurately multiplied rather than summed.
Geometric Mean
Example: Suppose that the price of something doubled (went up 200 percent
of the original price) in 2008, and then decreased by 50 percent (back to its
original price) in 2009.
The actual average change in the price across the two years is not
(200 + 50)/2 = 125 percent, but rather
200 50 =
10000 = 100, or zero

percent average (annualized) net change.
Notes
Notes
Notes
Harmonic Mean
The harmonic mean is dened as:
XH =
N
N
i=1
1
Xi
i.e., it is the product of N and the reciprocal of the sum of the reciprocals of
the Xis.
Equivalently:
XH =
1
_
1
X
_,
i.e., the harmonic mean is the reciprocal of the (arithmetic) mean of the
reciprocals of X.
Harmonic Mean
Because it considers reciprocals, the harmonic mean is the friendliest toward
small values in the data, and the least friendly to large values.
This means, among other things, that:
the harmonic mean will always be the smallest of the means in value (the
arithmetic mean will be the biggest, and the geometric will be in
between the other two)
the harmonic mean tends to limit the impact of large outliers, and
increase the weight given to small values of X.
To be honest, there are not a lot of instances where one is likely to use the
harmonic mean.
Median
The median of a variable X sometimes labeled XMed or

X is the middle
observation or the 50
th
percentile.
Practically speaking, the median is the value of:
the
_
(N1)
2
+ 1
_
th-largest value of X when N is odd, and
the mean of the
_
N
2
_
th- and
_
N+2
2
_
th- largest values of X when N is
even.
Notes
Notes
Notes
Median
Example: For our NFL opening day data, there are 32 teams.
The median is therefore the average of the numbers of points scored by the
16th- and 17th-most point-scoring teams.
Those values are 17 and 19, making the median 18 (even though exactly 18
points were not scored by any team that week).
The median is typically only calculated for ordinal-, interval- or ratio-level data.
There is no particular problem with ordinal variables since it just reects a
middle value (and an ordinal variable orders the observations).
The median is not calculated for nominal data.
Median
While the mean is the number that minimizes the squared distance to the data,
the median is the value of c that minimizes the absolute value of the distance
to the data:
XMed = min
_
N
i=1
|Xi c|
_
.
The median is (relatively) unaected by outliers and, as a result, is often
known as a robust statistic.
Mode
The mode XMode is nothing more than the most commonly-occurring
value of X.
In our NFL data, for example, the most common number of points scored was
ten i.e. XMode = 10.
The mode is the only measure of central tendency appropriate for use with data
measured at any level, including nominal.
That means that the mode is the only descriptive statistic appropriate for
nominal variables.
The mode is (technically) undened for any variable that has equal maximum
frequencies for two or more variable values.
Notes
Notes
Notes
Dichotomous Variables
Think about a dichotomous (binary) variable, D. Note a few things:
The mean

D =
1
N
N
i=1
Di [0, 1] is equal to the proportion of 1s in
the data.
The median DMed {0, 1}, depending on whether there are more 0s
or 1s in the data.
The mode DMode = DMed.
The mean of a binary variable tells us a lot about it: not just its mean, but, as
well see, its variance as well.
It is never the case that a binary variables mean is equal to a value actually
present in the data.
It is common to use the median/mode as a measure of central tendency for
binary variables.
Relationships
Figure: Central Tendency in a Symmetric Distribution with a Single Peak
Mean
Median
Mode

In a perfectly symmetrical continuous variable, the mean and median are
identical.
If the same variable is unimodal, then the mode is also equal to the mean and
the median.
Relationships
If a continuous variable Z is right-skewed, then
ZMode ZMed

Z
Similarly, if a continuous variable Z is left-skewed, then
Z ZMed ZMode
In both cases, the mean is the most aected by outliers, as can be seen in
the next gure.
Notes
Notes
Notes
Relationships
Figure: Points Scored in Week One of the 2008 NFL Season Redux
0
0
01
1
12
2
23
3
34
4
45
5
56
6
67
7
78
8
8Frequency
F
r
e
q
u
e
n
c
y
Frequency 0
0
010
10
10 20
20
20 30
30
30 40
40
40 Points Scored
Points Scored
Points Scored Mode = 10, Median = 18, Mean = 20.03
Mode = 10, Median = 18, Mean = 20.03
Mode = 10, Median = 18, Mean = 20.03
Cautions
The mean, mode and median are all poor description of the data when you
have a bimodal distribution.
Strongly bimodal distributions are an indication that what youve really got is a
dichotomous variable, or possibly that there are multiple dimensions in the
intermediate values (Polity).
With skewed distributions, notably exponential and power-law distributions,
the mean can be much larger than the median and the mode.
Dispersion
A measure of central tendency does not provide an adequate description of
some variable X on its own because it only locates the center of the
distribution of data.
Figure: Frequency Distributions with Equal Means but Dierent Amounts of
Variation

Notes
Notes
Notes
Range
The range is the most basic measure of dispersion it is the dierence
between the highest and lowest values of a (interval- or ratio-level) variable:
Range(X) = Xmax Xmin
Example: The range of our NFL points variable is (41 3) = 38.
Things to note about the range:
The range tells you how much variation there is in your variable, and does
so in units that are native to the variable itself.
The range also scales with the variable i.e., if you rescale the variable, the
range adjusts as well.
Percentiles, IQR, etc.
We can also get a handle on the variation in a variable by looking at
percentiles of a variable.
The kth percentile is the value of the variable below which k percent of the
observations fall.
If we have data on N = 100 observations, the 50th percentile is the value
of the 50th-largest-valued observation.
If we have data on N = 6172, then the 50th percentile is the value of the
6172 0.5 = 3086th-largest-valued observation in the data, and
Thus,
The 50th percentile is the same thing as the median Xmed.
The 0th percentile is the same thing as Xmin.
the 100th percentile is the same thing as Xmax.
We generally look at evenly-numbered percentiles, in particular quartiles and
deciles.
Quartiles are the 25th, 50th, and 75th percentiles of a variable.
Example: In our NFL data
Lower quartile = 13 (25% of teams scored less than or equal to 13 points)
Middle quartile = 18.
Upper quartile = 24.5 (only 25% of teams scored more than 24.5 points)
Notes
Notes
Notes
The inter-quartile range (IQR) sometimes called the midspread is
dened as:
IQR(X) = 75th percentile(X) 25th percentile(X)
Example: The IQR for our NFL data is (24.5 13) = 11.5, which means that
the middle 50 percent of teams scored points that fell within about a
12-point range.
The IQR is analogous to the range, except that it is robust to outlying data
points.
Example: If Denver had scored 82 points instead of 41, the range would then
be (82 3) = 79, rather than 38, but the IQR would be unaected.
Deciles are percentiles by tens the tenth, twentieth, thirtieth, etc.
percentiles of the data.
Deciles can be useful when the data are skewed i.e., when there are small
numbers of relatively high or low values in the data. This is because they
provide a ner-grained picture of the variation in X than quartiles.
Deciles are often used to analyze data where there are large disparities between
large and small values.
Deviations
A limitation of percentiles and ranges is that they do not make use of the
information in all of the data.
Approaches that make use of all the information are typically based on the idea
of a deviation.
A deviation is just the extent to which an observations value on X diers from
some benchmark value.
The typical benchmarks are (i) the median and (ii) the mean.
In either case, the deviation is just the (signed) dierence between an
observations value and that benchmark.
Deviations from the mean are (Xi

X).
Deviations from the median are (Xi XMed).
Notes
Notes
Notes
Mean Deviation
Individual deviations are not particularly interesting. Instead we want, say, a
typical deviation.
We might think to start with the average deviation from the mean value of
X:
Mean Deviation =
1
N
N
i=1
(Xi

X)
But note that this measure is actually useless because the mean deviation
always equals 0, no matter the spread of the distribution.
A corollary of the fact that the mean minimizes the sum of squared deviations
is that the mean is also the value of X that makes the sum of deviations from
it equal to zero.
Mean Deviation
Proof.
Let the sum of the dierences be:
d =
(X

X)
Take
through the brackets. Note that

X is the same as N

X and so
d =
X N

X
And we know that

X =
X
N
. And so
d =
X
N
X
N
The Ns cancel, leaving
d =
X = 0
Mean Squared Deviation
One way to avoid having the positive and negative deviations cancel each
other out is to use the squared deviation from the mean, (Xi

X)
2
.
If we consider the average of this value, we get the mean squared deviation
(MSD):
Mean Squared Deviation = MSD =
1
N
N
i=1
(Xi

X)
2
The MSD is intuitive enough, in that it is the average squared deviation from
the mean.
Notes
Notes
Notes
But, think for a minute about the idea of variability. Imagine a dataset with a
single observation:
team points
------------------
Rams 14
------------------
The mean is (obviously) 14; but whats the MSD?
The MSD is zero because the mean and Xi are identical.
The point is that from one observation, we can know something about how
many points were scored on average, but we cant know anything about the
distribution (spread) of those points. Were all the games 14-14 ties? Were
they mostly 28-0 blowouts? Theres no way to know.
But, think for a minute about the idea of variability. Imagine a dataset with a
single observation:
team points
------------------
Rams 14
------------------
The mean is (obviously) 14; but whats the MSD?
The MSD is zero because the mean and Xi are identical.
The point is that from one observation, we can know something about how
many points were scored on average, but we cant know anything about the
distribution (spread) of those points. Were all the games 14-14 ties? Were
they mostly 28-0 blowouts? Theres no way to know.
Now, add an observation:
team points
------------------
Rams 14
Titans 20
------------------
The (new) mean is now 17, and the MSD is
1
2
_
(14 17)
2
+ (20 17)
2
_
=
1
2
(9 + 9) =
18
2
= 9
At this point, we can begin to learn something not just about the mean of the
data, but also about its variability.
Notes
Notes
Notes
Informally, this suggests a principle that is wise to always remember:
You cannot learn about more characteristics of data than you have
observations.
If you have one observation, you can learn about the mean, but not the
variability.
With two observations, you can begin to learn about the mean and the
variation around that mean, but not the skewness (see below). And so forth.
The formal name for this intuitive idea is degrees of freedom.
Degrees of freedom are essentially pieces of information that are equal to the
sample size, N, minus the number of parameters, P, estimated from the data.
Variance
The relevance of degrees of freedom is that, in the simple little example above,
we actually only have one eective observation that is telling us about the
variation in X, not two.
As a result, we should consider revising the denominator of our estimate of the
MSD downwards by one.
This gives us the variance:
Variance = s
2
=
1
N 1
N
i=1
(Xi

X)
2
.
The variance in the sample is denoted by s
2
, whereas the variance in the
population is denoted by
2
.
Note that as N , s
2
MSD, but that the two can be quite dierent in
small samples.
Variance
We can provide a shortcut for calculating
(X

X)
2
.
(X

X)
2
= (X

X)(X

X) = X
2
2X

X +

X
2
Now apply the summation:
(X

X)(X

X) =
X
2
2
X
N
X + N
X
N
2
We have taken advantage of the fact that

X = N

X and that

X =
X
N
. We can
now collect terms for
(X

X)
2
=
X
2
2
(
X)
2
N
+ N
(
X)
2
N
2
=
X
2
X)
2
N
In other words, you can calculate the variance with just two pieces of information:
X
2
and (
X)
2
.
Notes
Notes
Notes
Standard Deviation
The variance is expressed in terms of squared units rather than the units of
X.
To put s
2
back on the same scale as X, we can take its square root and obtain
the standard deviation:
Standard Deviation = s =
_
1
N 1
N
i=1
(Xi

X)
2
s (or for a population) is the closest analogue to an average deviation from
the mean that we have.
s is also the standard measure of empirical variability used with interval- and
ratio-level data.
In fact, its generally good practice to report s every time you report

X.
Standard Deviation
Notice that, as with the mean, s is expressed in the units of the original
variable X.
Example: In our 32-team NFL data, the variance s
2
is 93.6, while the standard
deviation is about 9.67.
That means that an average (or, more accurately, typical) teams score
was about 9-10 points away from the empirical mean of (about) 20.
Changing the Origin and Scale
For a change in the origin, we have the following results:
If x
= x + a
then x = x + a
and s
x
= sx
For a change in the scale, we have the following results:
If x
= bx
then x
= b x
and sx
= |b|sx
We can combine these results into one by considering y = a + bx.
If y = a + bx
then y = a + b x
and sy = |b|sx
Notes
Notes
Notes
A Variant: Geometric s
As with the (arithmetic) mean, theres a variant of s that is better used when
the geometric mean is the more appropriate statistic.
The geometric standard deviation is dened as:
sG = exp
_
_
_
N
i=1
(ln Xi ln

XG)
2
N
_
_
sG is the geometric analogue to s, and is best applied in every circumstance
where the geometric mean is also more appropriately used.
Mean Absolute Deviation
Variances and standard deviations, because they are based on means, have
many of the same problems that means have.
In particular, they can be drastically aected by outlier values of X, with the
result that one or two small or large values can articially distort the picture
presented by s.
Example: Suppose that Denver put up 82 points instead of 41, but the other
teams scores were identical.
The result is that the mean of points rises (from 20 to about 21.3), but the
standard deviation rises even more dramatically (from 9.6 to about 14.2).
That is, changing the points scored by a single team caused s to increase by 50
percent.
An alternative to the variance and standard deviation is to consider not squared
deviations, but rather absolute values of deviations from some benchmark.
Because the concern is over resistance to outliers, we typically substitute the
median for the mean, in two ways:
1 We consider deviations around the median, rather than the mean, and
2 When considering a typical value for the deviations, we use the median
value rather than the mean.
The result is the median absolute deviation (MAD), dened as:
Median Absolute Deviation = MAD = median|Xi XMed|.
Notes
Notes
Notes
Note that some textbooks refer to the mean absolute deviation when referring
to MAD.
Mean Absolute Deviation = MAD2 =
1
N
N
i=1
|Xi

X|
This is merely a substitution of the absolute value into the standard formula for
a mean deviation, but one that lacks the robustness to outliers that MAD does.
The median absolute deviation (MAD) is often presented where there are
highly inuential outlier observations i.e., in the same situations where the
median is used as a measure of central tendency.
Example
Table: Mean Squared Deviation (MSD), Variance (s
2
), and Standard Deviation (s)
Squared Squared
Data Deviations Deviations Deviations
X (X

X) (X

X)
2
(X

X)
2
10 -30 900 900
20 -20 400 400
30 -10 100 100
50 10 100 100
90 50 2500 2500
X =
200
5
= 40 0 MSD =
4000
5
= 800 s
2
=
4000
4
= 1000
s = 32
Example
Table: Deviations, MAD, and MAD2
Absolute Absolute
Deviations Deviations
Data (Median) (Mean)
X |X X
Med
| |X

X|
10 20 30
20 10 20
30 0 10
50 20 10
90 60 50
X =
200
5
= 40 MAD = 20 MAD2 =
120
5
= 24
X
Med
= 30
Notes
Notes
Notes
Grouped (Frequency) Data
If we have grouped (frequency) data, then it is possible to calculate our various
measures of central tendency and dispersion using relative frequencies.
This can be useful if we only have a summary of, rather than the, original data.
Mean for Grouped Data:
X =
1
N
J
j=1
Xjfj
=
1
N
_
_
_(X1 + X1 + . . . + X1)
. .
f1times
+(X2 + X2 + . . . + X2)
. .
f2times
+. . .
_
_
=
1
N
(X1f1 + X2f2 + . . . + XJfJ)
where j indexes distinct values of X, and fj are the frequencies (counts) of
each of the J distinct values of X.
Grouped (Frequency) Data
Variance for Grouped Data:
Variance = s
2
=
1
N 1
J
j=1
(Xj

X)
2
fj
Standard Deviation for Grouped Data:
Standard Deviation = s =
_
1
N 1
J
j=1
(Xj

X)
2
fj
Example
Table: Calculation of Mean and Standard Deviation from Relative Frequency
Distribution (n = 200)
Weighted
Squared Squared
Height Weighting Deviation Deviations Deviations
X fj |X

X| (X

X)
2
(X

X)
2
fj
60 4 -9 81 324
63 12 -6 36 432
66 44 -3 9 396
69 64 0 0 0
72 56 3 9 504
75 16 6 36 576
78 4 9 81 324
X = 69.3
s
2
=
2556
199
= 12.8
s =
12.8 = 3.6
Notes
Notes
Notes
Moments
Means and variances are examples of moments.
Moments are typically used to characterize random variables (rather than
empirical distributions of variables), but they give rise to some useful additional
statistics.
Think of the moment as a description of a distribution . . .
We can take a moment around some number, usually the mean (or zero).
In general, the kth moment around a variables mean is Mk = E[(X )
k
].
The mean is the rst moment: M1 =

X = E(X) =
X
N
.
The variance is the second moment: M2 = s
2
= E[(X )
2
] =
(
X

X)
2
N1
.
Skew
Skew is the dimensionless version of the third moment:
M3 = E[(X )
3
] =
(
X

X)
3
N
The third moment is rendered dimensionless by dividing by the cube of the
standard deviation of X:
s3 = s.d(X)
3
= (
s
2
)
3
Thus, the skew of a distribution is given as:
Skew = 1 =
M3
s3
=
1
N
N
i=1
(Xi

X)
3
_
1
N
N
i=1
(Xi

X)
2
_
3/2
Skew
Skew is a measure of symmetry it measures the extent to which a distribution
has long, drawn out tails on one side or the other.
Skew = 0 is symmetrical .
Skew > 0 is positive (tail to the right).
Skew < 0 is negative (tail to the left).
A Normal distribution is symmetrical and has a skew of 0. This will be useful in
determining whether a variable is normally distributed.
Notes
Notes
Notes
Skew
Figure: Symmetrical Distribution Zero Skew
0
0
0.1
. 1
.1 .2
. 2
.2 .3
. 3
.3 .4
. 4
.4 .5
. 5
.5 0
0
02
2
24
4
46
6
68
8
8
Skew
Figure: Right-Skewed Distribution Skew > 0
0
0
0.02
. 0
2
.02 .04
. 0
4
.04 .06
. 0
6
.06 0
0
010
10
10 20
20
20 30
30
30 40
40
40 50
50
50 X
X
X
Skew
Figure: Left-Skewed Distribution Skew < 0
0
0
0.5
. 5
.5 1
1
11.5
1
. 5
1.5 2
2
20
0
0.5
.5
.5 1
1
11.5
1.5
1.5 2
2
2X
X
X
Notes
Notes
Notes
Kurtosis
Kurtosis is the dimensionless version of the fourth moment:
M4 = E[(X )
4
] =
(
X

X)
4
N
The fourth moment is rendered dimensionless by dividing by the square of the
variance of X:
s4 = (s
2
)
2
Thus, the kurtosis of a distribution is given as:
Kurtosis = 2 =
M4
s4
3 =
1
N
N
i=1
(Xi

X)
4
_
1
N
N
i=1
(Xi

X)
2
_
2
3
Kurtosis
Kurtosis has to do with the peakedness of a distribution.
Thin-tailed/Peaked = leptokurtic M4 is large...
Medium-tailed = mesokurtic.
Fat-tailed/Flat = platykurtic M4 is small.
Kurtosis can also be viewed as a measure of non-Normality.
The Normal distribution is bell shaped, whereas a kurtotic distribution is not.
The Normal distribution has zero kurtosis, while a at-topped (platykurtic)
distribution has negative values, and a pointy (leptokurtic) distribution has
positive values.
Moments
Figure: Kurtosis: Examples
0
0
0.5
. 5
.5 1
1
11.5
1
. 5
1.5 -10
-10
-10 -5
-5
-5 0
0
05
5
510
10
10 -10
-10
-10 -5
-5
-5 0
0
05
5
510
10
10 -10
-10
-10 -5
-5
-5 0
0
05
5
510
10
10 Mesokurtic
Mesokurtic
Mesokurtic Leptokurtic
Leptokurtic
Leptokurtic Platykurtic
Platykurtic
Platykurtic Density
D
e
n
s
i t y
Density Graphs by type
Graphs by type
Graphs by type
Notes
Notes
Notes
Special Cases
If the data are symmetrical, then a few things are true:
The median is equal to the average (mean) of the rst and third
quartiles. That, in turn, means that:
Half the IQR equals the MAD (i.e., the distance between the median
and (say) the 75th percentile is equal to the MAD).
The skew is zero.
Special Cases
If you have nominal-level variables, all you can do is report category
percentages.
You should not use any of the measures discussed above with
nominal-level data.
If you have ordinal-level variables, it is usually a good idea to stick with
robust measures of variation/dispersion for ordinal-level variables.
That means using MADs and the like, rather than variances and standard
deviations.
While this is not usually (read: almost never) done in practice, it remains the
right thing to do.
Special Cases
If you have dichotomous variables, then the descriptive statistics described
above have some special characteristics.
The range is always 1.
The percentiles are always either 0 or 1.
The IQR is always either 0 (if there are fewer than 25 or more than 75
percent 1s) or 1 (if there are between 25 and 75 percent 1s).
The variance of a dichotomous variable D is related to the mean.
s
2
D
=

D (1

D)
Since

D is necessarily between zero and one, then s
2
D
is as well.
It follows that sD > s
2
D
.
sD and s
2
D
will always be greatest when

D = 0.5 i.e., an equal numbers of
zeros and ones, and will decline as one moves away from this value.
Notes
Notes
Notes
Special Cases
Example: If if we added an indicator called NFC to our NFL Week One data, it
would look like this:
. sum NFC, detail
NFC
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 0 0
10% 0 0 Obs 32
25% 0 0 Sum of Wgt. 32
50% .5 Mean .5
Largest Std. Dev. .5080005
75% 1 1
90% 1 1 Variance .2580645
95% 1 1 Skewness 0
99% 1 1 Kurtosis 1
Best Practices
It is useful to present summary statistics for every variable you use in a paper.
Presenting summary measures is a good way of ensuring that your reader can
(better) understand what is going on.
Typically, one presents means, standard deviations, and minimums &
maximums, as well as an indication of the total number of observations in the
data.
For an important dependent variable, its also generally a good idea to
present some sort of graphical display of the distribution of the response
variable. This is less important if the variable is dichotomous.
Best Practices
Summary Statistics
Standard
Variable Mean Deviation Minimum Maximum
Assassination 0.01 0.09 0 1
Previous Assassinations Since 1945 0.45 0.76 0 4
GDP Per Capita / 1000 5.83 6.04 0.33 46.06
Political Unrest 0.01 1.01 -1.67 20.11
Political Instability -0.03 0.92 -4.66 10.08
Executive Selection 1.54 1.34 0 4
Executive Power 3.17 2.39 0 6
Repression 1.67 1.19 0 3
Note: N = 5614. Statistics are based on all non-missing observations in the model in Table X.
Notes
Notes
Notes
Best Practices
Figure: Conditioning Plots / Statistics
0
0
0.05
. 0
5
.05 .1
. 1
.1 0
0
010
10
10 20
20
20 30
30
30 40
40
40 0
0
010
10
10 20
20
20 30
30
30 40
40
40 AFC
AFC
AFC NFC
NFC
NFC Points Scored
Points Scored
Points Scored Graphs by NFC
Graphs by NFC
Graphs by NFC
Stata
. summarize points, detail
points
-------------------------------------------------------------
Percentiles Smallest
1% 3 3
5% 7 7
10% 10 10 Obs 32
25% 13 10 Sum of Wgt. 32
50% 18 Mean 20.03125
Largest Std. Dev. 9.673657
75% 25 34
90% 34 38 Variance 93.57964
95% 38 38 Skewness .5219589
99% 41 41 Kurtosis 2.518387
. ameans points
Variable | Type Obs Mean [95% Conf. Interval]
-------------+----------------------------------------------------------
points | Arithmetic 32 20.03125 16.54352 23.51898
| Geometric 32 17.57103 14.36262 21.49615
| Harmonic 32 14.65255 11.30963 20.8009
------------------------------------------------------------------------
Stata
. tabstat points, stats(mean median sum max min range sd variance skewness kurtosis q)
variable | mean p50 sum max min range
-------------+-------------------------------------------------------------
points | 20.03125 18 641 41 3 38
---------------------------------------------------------------------------
sd variance skewness kurtosis p25 p50 p75
--------------------------------------------------------------------------
9.673657 93.57964 .5219589 2.518387 13 18 25
--------------------------------------------------------------------------
Notes
Notes
Notes
Stata
. summarize population, detail
. return list
scalars:
r(N) = 42
r(sum_w) = 42
r(mean) = 18531.88095238095
r(Var) = 871461347.2293844
r(sd) = 29520.52416928576
r(skewness) = 2.480049617027598
r(kurtosis) = 9.764342116660073
r(sum) = 778339
r(min) = 27
r(max) = 146001
r(p1) = 27
r(p5) = 32
r(p10) = 276
r(p25) = 2405
r(p50) = 5372
r(p75) = 15892
r(p90) = 59330
r(p95) = 65667
r(p99) = 146001
. scalar mean = r(mean)
. display mean
18531.881
Stata
. sort NFC
. by NFC: summarize points
----------------------------------------------------------------------
-> NFC = AFC
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
points | 16 19.125 10.07224 10 41
----------------------------------------------------------------------
-> NFC = NFC
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
points | 16 20.9375 9.497149 3 38
R
> install.packages("foreign")
> library(foreign)
> install.packages("psych")
> library(psych)
> NFL <-read.dta("NFLWeekOne.dta")#change path
> attach(NFL)
> summary(points)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 13.00 18.00 20.03 24.50 41.00
> geometric.mean(points)
[1] 17.57103
> harmonic.mean(points)
[1] 14.65255
Notes
Notes
Notes
R
> library(Hmisc)
> describe(points)
points
n missing unique Mean .05 .10 .25 .50 .75 .90 .95
32 0 18 20.03 8.65 10.00 13.00 18.00 24.50 34.00 38.00
3 7 10 13 14 16 17 19 20 21 23 24 26 28 29 34 38 41
Frequency 1 1 5 2 2 1 4 1 2 1 1 3 1 1 1 2 2 1
% 3 3 16 6 6 3 12 3 6 3 3 9 3 3 3 6 6 3
> library(pastecs)
> stat.desc(points)
nbr.val nbr.null nbr.na min max range sum median
32.0000 0.0000 0.0000 3.0000 41.0000 38.0000 641.0000 18.0000
mean SE.mean CI.mean.0.95 var std.dev coef.var
20.0312 1.7101 3.4877 93.5796 9.6737 0.4829
R
> var(points)
[1] 93.58
> sd(points)
[1] 9.674
> mad(points)
[1] 8.896
> library(moments)
> skewness(points)
[1] 0.522
> kurtosis(points)
[1] 2.518
> by(points,NFC,summary)
NFC: AFC
10.0 12.2 17.0 19.1 21.0 41.0
------------------------------------------------------------------------------------
NFC: NFC
3.0 15.2 22.0 20.9 26.5 38.0
Notes
Notes
Notes

Lecture3 Handouts

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture3 Handouts

Uploaded by

Copyright:

Available Formats

Lecture 3: Descriptive Statistics

Matt Golder & Sona Golder

10000 = 100, or zero

through the brackets. Note that

You might also like