You are on page 1of 15

2.

Descriptive Statistics STAT 226


1

Learning Outcomes:
Calculate the mean, median, mode, range, standard deviation and variance.
Recognize skewed and symmetric histograms.
Produce and interpret frequency distributions, histograms, ogives, stem and leaf diagrams, and box plots.

Measure of location: Mean, Median and Mode

Measures of Dispersion (Spread or Variability): Range, Mean Deviation, Variance, Standard Deviation


2.1 MEASURES OF LOCATION (CENTRAL TENDENCY)

1. THE MEAN
The mean (arithmetic mean or average) of a data set is obtained by adding all the values and dividing
the total by the number of values.



Example 2.1: The number of hours 8 hospital patients slept following the administration of a certain
anesthetic are as follows:
7 10 12 4 8 7 3 8

The sample mean number of hours slept is




2. THE MEDIAN
The median of a data set is the middle value when the values are arranged in order (increasing or
decreasing). Half of the data are less than the median and half are greater.
To find the median, first rank the vales (arrange them in order), and then choose the middle value.

Example 2.2: The following data set consist of white blood cell counts taken from 7 patients:
5 10 12 4 8 7 22

To obtain the median, we first arrange the data in (ascending) order:
4 5 7 8 10 12 22
Because the number of values is odd (7), the median is the value that is located in the exact middle of
the list. The median is 8.

2. Descriptive Statistics STAT 226
2


Example 2.3: Find the median of the following values:
7 10 12 4 8 7 3 8

Solution: To obtain the median, we first arrange the data in (ascending) order:
3 4 7 7 8 8 10 12

Since the number of value is even (8), the median is found by computing the mean of the two middle
values. The median is





3. THE MODE
The mode is the most frequently occurring value/category. When two different values occur with
the same greatest frequency, each one is a mode and the distribution is referred to as bimodal. When
more than two values occur with the same greatest frequency, each is a mode and the distribution is
said to be multimodal. When no value is repeated, we say that there is no mode.


Example 2.4: Find the mode for the following values: 7 10 12 4 6 7 3 8
Solution: 3 4 6 7 7 8 10 12 the mode is 7.



Example 2.5: Find the mode for the following values: 7 10 12 4 8 7 3 8
Solution: 3 4 7 7 8 8 10 12

The distribution is bimodal with modes 7 and 8.

Example 2.6: Find the mode for the following values: 8 8 12 7 8 7 3 8

Solution: 3 7 7 8 8 8 8 12

The mode is 8.

Example 2.7: Find the mode for the following values: 7 10 12 4 8 9 3 5

Solution: The distribution has no mode (no value occurred more than once).

2. Descriptive Statistics STAT 226
3


Measures of Location & Levels of Measurement
The Mean is used for interval level data. The mean is sensitive to extreme values (outliers).
The Median can be used for ordinal or interval level data. It is usually employed over the mean when
there are extreme values in a data set because it is not sensitive to the extreme values.
The Mode is appropriate for nominal data.


Exercise 1: The following are the ages of 12 patients present in the emergency room of a hospital on a
Saturday night:
36 32 26 37 46 41 62 42 33 49 35 24

Find the mean, median, and mode for this data set.

Mean =

Median =

Mode =



2.2 MEASURES OF SPREAD (DISPERSION)
A measure of spread (or dispersion) depicts how close (or far apart) data values are

1. THE RANGE
The range is the difference between the highest and lowest value (Range = highest value lowest
value)

Example 2.8: Find the range of the following random sample of 10 patients in waiting line at a Toronto
hospital reveals the following waiting times (in minutes):
26 63 29 31 29 62 41 42 24 38 30 42

Solution: Range = 63 24 = 39


2. THE MEAN DEVIATION
The mean deviation is the average of the absolute distances (deviations or differences) from the
mean.

| |


2. Descriptive Statistics STAT 226
4

Example 2.9: Calculate the mean deviation of the following sample of birthweights (in kg) of 5 live-born
infants at a Toronto hospital
2.9 3.5 3.2 2.4 4.5


Solution:


| |
2.9 -0.4 0.4
3.5 0.2 0.2
3.2 -0.1 0.1
2.4 -0.9 0.9
4.5 1.2 1.2

= 16.5 ( ) | | 2.8



| |




3. THE VARIANCE
The variance is the average squared distances (deviations) from the mean.

( )

( )




Example 2.10: Calculate the variance of the sample birthweights in the previous example

Solution:


( )


( )


2.9 -0.4 0.16
3.5 0.2 0.04
3.2 -0.1 0.01
2.4 -0.9 0.81
4.5 1.2 1.44

= 16.5 ( ) ( )

2.46
2. Descriptive Statistics STAT 226
5


4. THE STANDARD DEVIATION
The standard deviation is the square root of the average squared distances (deviations) from the mean.
Or simply, the square root of the variance. The standard deviation is expressed in the same unit as the
original measurement. That is, if measurements are in cm, the standard deviation will be in cm as well.

( )

( )




Example 2.11: Calculate the sample standard deviation of the birthweights in the previous example

Solution:





5. COEFFICIENT OF VARIATION
The coefficient of variation expresses the standard deviation as a percentage of the mean. It allows us to
compare variability of data sets when two data sets have different units or when they have the same
units with similar standard deviations but their means are different.



Example 2.13: Calculate the coefficient of variation (CV) for the following two samples and
determine which sample is more variable.


Age Mean weight Standard Deviation
Sample 1 30 years 150 pounds 15 pounds
Sample 2 12 years 90 pounds 11 pounds

Solution:



Sample 2 has more variability

2. Descriptive Statistics STAT 226
6

THE EMPIRICAL RULE
For a symmetric unimodal (bell-shaped or normal) distribution, the Empirical Rule states that:

Approximately 68% of all observations fall within one standard deviation of the mean,

.
Approximately 95% of all observations fall within two standard deviations of the mean,

.
Approximately 99.7% of all observations fall within three standard deviations of the mean,

.




Example 2.14: The average score on a STAT 226 midterm exam is 72 with a standard deviation
of 8. Assume the distribution of the midterm scores is bell shaped,
a) Approximately 95% of the scores will be within what values?
b) what percentage of the scores will fall between 48 and 96?

Solution:


a) 95% of the scores will be between

()
Lower Limit = 72-16=56 Upper Limit = 72+16=88

b)

= 72 3(8) = 72 24 = 48

= 72 + 3(8) = 72 + 24 = 96
By empirical rule, approximately 99.7% of the scores are between 48 and 96.


Exercise 2:
Arinami et al, 1988 (Table 1) analyzed the auditory brain-stem responses using a sample of 12 mentally
retarded males with fragile X syndrome. The subjects had the following IQ scores:
17 22 17 18 17 19 34 26 14 33 21 29

Source: Am J Hum Genet. 1988 July; 43(1): 4651 (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1715284/?page=2)
Calculate the
a) Mean b) Range c) Mean Deviation d) Variance e) Standard deviation
f) Coefficient of variation.
2. Descriptive Statistics STAT 226
7


2.3 FREQUENCY DISTRIBUTIONS

A frequency distribution lists categories (or intervals) and the number of times each occurs (frequency).
A relative frequency distribution lists categories (or intervals) and the proportion of times each occurs.






Frequency Distribution for Qualitative Data
Example 2.15: A sample of 20 patients at a Toronto Hospital reveals the following blood type distribution:
O B O O A AB AB A B O A AB A O B B O AB A O

Construct a frequency distribution and a relative frequency distribution for the data:

Solution:
Frequency Distribution
Table 2.1
Blood Type Frequency
A 5
AB 4
B 4
O 7
Total 20
Relative Frequency Distribution
Table 2.2
Blood Type Frequency Relative Frequency
A 5 5/20 = 0.25
AB 4 4/20 = 0.20
B 4 4/20 = 0.20
O 7 7/20 = 0.35
Total 20 1


Graphical Representation of Qualitative Data (Bar Chart & Pie Chart)
Bar Chart - consists of rectangular bars showing
the frequencies of each category of a distribution.
The bar chart for the frequency distribution in
Table 2.1 is seen below:

Pie Chart - a circle divided into sections reflecting
the percentage of frequencies in each category of
the distribution. Pie Chart for Table 2.2

0
1
2
3
4
5
6
7
8
A AB B O
F
r
e
q
u
e
n
c
y

Blood Type
Bar Chart
A
25%
AB
20%
B
20%
O
35%
Pie Chart
2. Descriptive Statistics STAT 226
8


Frequency Distribution for Quantitative Data

Example 2.16: The following are ages of the 25 patients on the 6
th
floor of a hospital. Construct a
frequency distribution and a relative frequency distribution for these data:

12 21 51 53 26 21 21 38 55 52 38 37 40
13 38 55 31 54 48 49 54 42 56 26 54

Solution:
Table 2.3
Class Interval Frequency Relative Frequency
10 - 19 2 2/25 = 0.08 or 8%
20 - 29 5 5/25 = 0.20 or 20%
30 - 39 5 5/25 = 0.20 or 20%
40 - 49 4 4/25 = 0.16 or 16%
50 - 59 9 9/25 = 0.36 or 36%
Total 25 1

Note that it is more convenient to use classes with intervals widths of 5 or 10.

HISTOGRAM
A histogram is a graph that displays interval data by using vertical bars to represent the frequencies.

Histogram for the frequency distribution in Table 2.3 above:


0
1
2
3
4
5
6
7
8
9
10
10-19 20-29 30-39 40-49 50-59
F
r
e
q
u
e
n
c
y

Age
Histogram
2. Descriptive Statistics STAT 226
9

Shapes of Distributions
SYMMETRY
A distribution is said to be symmetric if, when a vertical line drawn through the center of the distribution,
the two sides are identical in shape and size. In symmetric distributions, the mean and median are
equal.
Examples of Symmetric Distributions

Symmetric, bell-shaped Symmetric, bimodal Symmetric, Uniform
SKEWNESS
A distribution is said to be skewed if it has a long tail extending to either the right or the left.
Examples of Skewed Distributions

Positively Skewed
Mode < Median < Mean
Negatively Skewed
Mean < Median < Mode


0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6
Variable
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6
0
2
4
6
8
1 2 3 4 5 6 7 8 9 10 11
0
2
4
6
8
1 2 3 4 5 6 7 8 9 10 11
Variable Variable
F
r
e
q
u
e
n
c
y

F
r
e
q
u
e
n
c
y

F
r
e
q
u
e
n
c
y

Variable
Variable
F
r
e
q
u
e
n
c
y

F
r
e
q
u
e
n
c
y

2. Descriptive Statistics STAT 226
10


Exercise: Determine whether each of the following set of measures of centre represents a symmetric,
positively skewed or negatively skewed distribution.


Mean Median Mode Shape
a) 7 10 14
b) 8 8 8
c) 7 5 2


Cumulative Frequency Distribution
The cumulative frequency of a particular class is the sum of all previous frequencies in the distribution up
to that class.

Example 2.17: Construct a table showing the cumulative frequency distribution and the cumulative
relative frequency distribution for the data in Table 2.3.

Solution:
Table 2.4

Class Interval Frequency
Cumulative
Frequency
Cumulative Relative
Frequency
10 - 19 2
2 2/25 = 0.08 or 8%
20 - 29 5
2+5 = 7 7/25 = 0.28 or 28%
30 - 39 5
7+5 = 12 12/25 = 0.48 or 48%
40 - 49 4
12+4 = 16 16/25 = 0.64 or 64%
50 - 59 9
16+9 = 25 25/25 = 1.00 or 100%
Total 25


A cumulative frequency graph (ogive) is a graph that represents the cumulative frequencies for the
classes in a frequency distribution.

Example 2.18: Construct an ogive for the following distribution:

Class Interval 1 - 5 6 - 10 11 - 15 15 - 20 21 - 25 26 - 30 31 - 35
Frequency 3 7 5 4 6 4 1


2. Descriptive Statistics STAT 226
11

Solution:











Cumulative Frequency Graph (Ogive)



From the ogive we can see that about 33% of the score are less than 10, about 50% are less than 15, and
about 83% are less than 25.


0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35
C
u
m
u
l
a
t
i
v
e

P
e
r
c
e
n
t

C
u
m
u
l
t
a
i
v
e

F
r
e
q
u
e
n
c
y

Ogive
Class
Interval
Frequency Cumulative
Frequency
Relative
Cumulative
Frequency (%)
1 - 5 3 3 10%
6 - 10 7 10 33%
11 - 15 5 15 50%
15 - 20 4 19 63%
21 - 25 6 25 83%
26 - 30 4 29 97%
31 - 35 1 30 100%
Total 30
2. Descriptive Statistics STAT 226
12

2.4 MEASURES OF POSITION (Quantiles)

Quartiles, Deciles, and Percentiles
Quartiles divide a set of observations into four equal parts
Deciles divide a set of observations into 10 equal parts
Percentiles divide a set of observations into 100 equal parts

Quartiles
The three quartiles, denoted by Q
1
, Q
2
, and Q
3
, divide sorted data (data arranged in order) into four
equal parts.
Q
1
is the data value that separates the bottom 25% of the sorted data from the top 75%.
Q
2
is the median that divides the data in half, 50%, and
Q
3
separates the top 25% of the sorted data from the bottom 75%.

The difference between the third quartile and the first quartile is called the Inter-quartile Range
denoted by IQR. This is also a good measure of variation in a data set that has extreme values or
outliers. Inter-quartile Range: IQR = Q
3
Q
1
.

Deciles
There are nine deciles, D
1
, D
2
, D
3
, , D
9
, which partition the sorted data into 10 groups with about
10% of the data in each group.
Percentiles
There are also 99 percentiles, denoted by P
1
, P
2
, P
3
, , P
99,
which partition sorted data into 100 groups
with about 1% of the scores in each group.

Quartiles, Deciles and Percentiles:
Quartiles: Q
1
= P
25
Q
2
= P
50
Q
3
= P
75

Deciles: D
1
= P
10
D
2
= P
20
D
8
= P
80
D
9
= P
90

Finding the value of k
th
percentile (P
K
)
To find the value of P
K

we have to follow the following 3 steps:
Step 1: First, arrange the data values in ascending order

Step 2: Then, find the locator L using the formula ( )


where n is the number of values and P is the percentile in the question.

Step 3: If L is a whole number, then the value of P
k
is the L
th
value in the data set, counting from
the lowest. If L is a NOT whole number, then the P
k
is defined as in the following examples.
2. Descriptive Statistics STAT 226
13


Example 2.19: The following data represents the quarterly incomes (in thousand dollars) of part-time
employees in a small hospital.
5, 11, 12, 19, 14, 21, 13, 8, 29, 24, 22, 16, 24, 6, 1, 3, 6, 26, 16
Arrange the ungrouped data into an array and calculate the value for Q
3
, and P
94
.

Solution: 1, 3, 5, 6, 6, 8, 11, 12, 13, 14, 16, 16, 19, 21, 22, 24, 24, 26, 29

Finding Q
3:
Since Q
3
= P
75, we first find the locator L
75:
k = 75; n = 19; Locator

( )

( )

which is a whole number.


Therefore the value of P
75
corresponds to the 15
th
value in the sorted data which is 22.
P
75 = 22 75% of the part-time employees have incomes less than 22 thousand dollars.


Finding Q
94:
W
e first find the locator L
94:
k = 94; n = 19; Locator

( )

( )


Therefore P
94
corresponds to the point 80% of the distance between the 18
th
and 19
th
data values.
Since 18
th
value = 26 and 19
th
value = 29, P
94 is computed as follows:
P
94 = 26 + 0.8(29 26) = 26 + 0.8(3) = 26 + 2.4 = 28.4
(Note: 0.8 is the decimal part of 18.8)
P
94 = 28.4 94% of the part-time employees have incomes less than 28.4 thousand dollars.

BOX PLOT
A box plot is a graphical display, based on quartiles, that helps picture the distribution of a data set.
The following values are needed to construct a box plot
1. Minimum value (lowest value) 2. First quartile (Q
1
) 3. Median (Q
2
)
4. Third quartile (Q
3
) 5. Maximum value (highest value)


Example 2.20: Listed are the weekly incomes earned, in dollars, last month by a sample of 15 nurses:
$2038 1758 1721 1637 2097 2047 2205 1787 2287 1940 2311 2054 2406 1471 1460
1. Draw a Box Plot for this data.
2. Use the boxplot to determine the skewness of the distribution.
3. What is the Interquartile Range

Solution:

1. The first step is to arrange the data from the smallest to the largest.
$1460 1471 1637 1721 1758 1787 1940 2038 2047 2054 2097 2205 2287 2311 2406

Minimum = $1460

( )

( )

( )

( )


2. Descriptive Statistics STAT 226
14


( )

( )



Maximum = $2460




2. The box plot shows the left tail is longer negative skewness.
3. Inter-quartile Range: IQR = Q
3
Q
1
= 2205 1721 = 484

Exercise:
The Apollo space program lasted from 1967 until 1972 and included 13 missions. The duration
of each flight were: 7, 9, 10, 142, 147, 192, 195, 216, 241, 244, 260, 295, 301
a) Find the 45
th
and 82
nd
percentiles.
b) Draw a Box Plot for this data.
c) What is the Interquartile Range?


1400 1600 1800 2000 2200 2400 2600
Box Plot for Commissions Earned
Min
Q1
Q2 Q3
Max
2. Descriptive Statistics STAT 226
15

STEM & LEAF PLOT

A stem and leaf plot is constructed by dividing each data value into a leaf (the leading digit(s)) and a
stem (the trailing digit(s)) and collecting all data points with the same stem on a single row (or column).
The stem & leaf plot reveals all the numerical values in a data set. They are most effective for small data
sets.

Example 2.21: Construct a Stem & Leaf plot for the following data. Describe the shape of each
distribution.

a) 12 21 31 53 29 21 24 35 44 32 38 37 40
13 38 32 31 34 48 49 54 42 36 26 34

b) 106 127 119 109 102 114 122 105 115 114 107
102 148 104 129 140 115 92 161 134 109


Solution:

a)
Sorted Data:
12 13 21 21 24 26 29 31 31
32 32 34 34 35 36 37 38 38
40 42 44 48 49 54 53

Stem Leaf
1 2 3
2 1 1 4 6 9
3 1 1 2 2 4 4 5 6 7 8 8
4 0 2 4 8 9
5 3 4
b)
Sorted Data:
92 102 102 104 105 106 107
109 109 114 114 115 115 119
122 127 129 134 140 148 161

Stem Leaf
09 2
10 2 2 4 5 6 7 9 9
11 4 4 5 5 9
12 2 7 9
13 4 8
14 0 8
15
16 1

You might also like