Professional Documents
Culture Documents
Objectives
Summarize data using measures of central tendency, such as the mean, median, mode, and midrange.
Data Description
Describe data using the measures of variation, such as the range, variance, and standard deviation. Identify the position of a data value in a data set using various measures of position, such as percentiles, deciles, and quartiles.
3-1 3-2
Objectives (contd.)
Use the techniques of exploratory data analysis, including boxplots and five-number summaries to discover various aspects of data.
Introduction
Statistical methods can be used to summarize data. Measures of average are also called measures of central tendency and include the mean, median, mode, and midrange. Measures that determine the spread of data values are called measures of variation or measures of dispersion and include the range, variance, and standard deviation.
3-3 3-4
Introduction (contd.)
Measures of position tell where a specific data value falls within the data set or its relative position in comparison with other data values. The most common measures of position are percentiles, deciles, and quartiles.
Introduction (contd.)
The measures of central tendency, variation, and position are part of what is called traditional statistics. This type of data is typically used to confirm conjectures about the data.
3-5
3-6
Introduction (contd.)
Another type of statistics is called exploratory data analysis. These techniques include the the box plot and the five-number summary. They can be used to explore data to see what they show.
Basic Vocabulary
A statistic is a characteristic or measure obtained by using the data values from a sample. A parameter is a characteristic or measure obtained by using all the data values for a specific population. When the data in a data set is ordered it is called a data array.
3-7 3-8
X=
X
n
3-10
Assume that data are obtained from a sample unless otherwise specified.
X=
wX w
3-12
3-11
Midrange
The midrange is defined as the sum of the lowest and highest values in the data set divided by 2. The symbol for midrange is MR.
3-14
Eighty randomly selected light bulbs were tested to determined their lifetimes (in hours). This frequency Distribution was obtained Find the (a) mean and (b) modal classes.
Class Boundaries 52.5-63.5 63.5-74.5 74.5-85.5 85.5-96.5 96.5-107.5 107.5-118.5 Frequency 6 12 25 18 14 5
3-18
Distribution Shapes
In a positively skewed or right skewed distribution, the majority of the data values fall to the left of the mean and cluster at the lower end of the distribution.
3-21
3-22
3-23
3-24
The Range
The range is the highest value minus the lowest value in a data set. The symbol R is used for the range.
2 =
(x )2
N
3-25
3-26
(X
X )2 n
, s2 is an biased estimator
= =
2
(x )
N
s2 =
3-27
(X X )
n 1
3-28
s2 =
[( X ) 2 / n] n 1
s2 =
f X 2
( f X )2
n
n1
s = s2 =
(X X )
n 1
[( X ) 2 / n] n 1
3-29
3-30
Coefficient of Variation
The coefficient of variation is the standard deviation divided by the mean. The result is expressed as a percentage. The coefficient of variation is used to compare standard deviations when the units are different for the two variables being compared. The number of incidents where police were needed for a sample of ten schools in Allegheny County is 7, 37, 3, 8, 48, 11, 6, 0, 10, 3. Assume the data represent samples. Find the range, variance, standard deviation and coefficient of variation
3-31
The data shows the number of murders in 25 selected cities. Find the variance and standard deviation
Variances and standard deviations can be used to determine the spread of the data. If the variance or standard deviation is large, the data are more dispersed. The information is useful in comparing two or more data sets to determine which is more variable. The measures of variance and standard deviation are used to determine the consistency of a variable.
3-34
Chebyshevs Theorem
The proportion of values from a data set that will fall within k standard deviations of the mean will be at least 1 1/k2; where k is a number greater than 1. This theorem applies to any distribution regardless of its shape.
3-35
3-36
3-38
Standard Scores
A standard score or z score is used when direct comparison of raw scores is impossible. A standard score or z score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation.
X X z= s
z=
42 39 4
3 4
3-39
Percentiles
Percentiles are position measures used in educational and health-related fields to indicate the position of an individual in a group. A percentile, P, is an integer between 1 and 99 such that the Pth percentile is a value where P % of the data values are less than or equal to the value and 100 P % of the data values are greater than or equal to the value.
3-41
Find the percentile ranks of each weight in the data set. The weights are in pounds.
Data: 78, 82, 86, 88, 92, 97
number of values below + 0.5 total number of values
Percentile =
For 78,
100%
0 + 0.5 100% = 8th percentile 6 th For 82, 1 + 0.5 100% = 25 percentile 6 For 97, 5 + 0.5 100% = 92nd percentile 6
3-43
Outliers
An outlier is an extremely high or an extremely low data value when compared with the rest of the data values. Outliers can be the result of measurement or observational error. When a distribution is normal or bell-shaped, data values that are beyond three standard deviations of the mean can be considered suspected outliers.
3-45
Outliers (contd.)
6 12 13 18 20 51 2
Q-to-Q Inflation in New Zealand 35 25 15 5 -5 1985 1987 1989 1991 1993 1995
3-46
Detecting Outliers
1. 2. 3. 4.
Arrange the data in order and find Q1 and Q3 Find the interquartile range: IQR = Q3 Q1 Multiply IQR by 1.5 Subtract the value obtained in step 3 from Q1 and add the value to Q3 Check the data set for any data value that is smaller than Q1 1.5(IQR) or larger than Q3 + 1.5(IQR)
3-47
5.
3-48
These data represent the volumes in cubic yards of the largest dams in the U.S. and South American. Construct a boxplot of the data for each region and compare the distributions.
United States 125,628 92,000 78,008 77,700 66,500 62,850 52,435 50,000 South America 311,539 274,026 105,944 102,014 56,242 46,563
3-49
Summary
Some basic ways to summarize data include measures of central tendency, measures of variation or dispersion, and measures of position. The three most commonly used measures of central tendency are the mean, median, and mode. The midrange is also used to represent an average.
3-51
Summary (contd.)
The three most commonly used measurements of variation are the range, variance, and standard deviation. The most common measures of position are percentiles, quartiles, and deciles. Data values are distributed according to Chebyshevs theorem and in special cases, the empirical rule.
3-52
Summary (contd.)
The coefficient of variation is used to describe the standard deviation in relationship to the mean. These methods are commonly called traditional statistics. Other methods, such as the boxplot and fivenumber summary, are part of exploratory data analysis; they are used to examine data to see what they reveal.
3-53
Conclusions
By combining all of these techniques together, the student is now able to collect, organize, summarize and present data.
3-54