You are on page 1of 26

Lecture 2 Descriptive Statistics

Source of Data Types of Data Graphical methods for describing a set of data Numerical methods for describing a set of data Summary Readings: Chap. 1
IELM151/ Stuart X. Zhu 1

Source of Data
Data distributed by an organization or an individual
HK Census and Statistics Department

A survey A designed experiment An observational study

IELM151/ Stuart X. Zhu

Types of Data
Quantitative
Numerical, computable, describes quantity E.g., height, weight, salary, cost, time, distance

Qualitative
Nonnumerical, categorical, describes an attribute E.g., blood type, gender, grading, professional rank
IELM151/ Stuart X. Zhu 3

Types of Data
Data

Categorical

Numerical

Examples:

Marital Status Political Party Eye Color (Defined categories)

Discrete

Continuous

Examples:

Examples:

Number of Children Defects per hour (Counted items)


IELM151/ Stuart X. Zhu

Weight Voltage (Measured characteristics)


4

Graphical Approaches
(Relative) Frequency Histogram Stem and Leaf Plot

IELM151/ Stuart X. Zhu

PR Example: Here are the data concerning the percentages of revenue (PR) spent on R&D for 30 HK companies. Company 1 2 3 4 5 6 7 8 9 10 % 9.4 8.4 12.5 6.7 10 7.8 10.2 9.5 6.5 11.4 Company 11 12 13 14 15 16 17 18 19 20 % 7.5 10.2 9.9 8.2 8.8 11.7 7.9 10.3 7.5 11 Company 21 22 23 24 25 26 27 28 29 30 % 11.1 8.5 9.4 9.7 12.3 10.6 8.9 8.1 6.9 10
6

IELM151/ Stuart X. Zhu

(Relative) Frequency Histogram


Principles for constructing histograms: Determine the number of classes
Sturges Formula: k = 1 + 3.3 log10 (n)

Determine class width


Approximate to (Maximum Minimum ) / k

Locate the class boundaries


Start from the lowest class boundary which is smaller than the minimum and locate the others one by one. Note that a measurement cannot fall on a boundary. (See L02_Descriptive_s1.xls)

Usage of Relative Frequency Histogram


Proportion of PR > 10.05% in the 30 HK companies (Sample) Estimate the fraction of PR > 10.05% for all HK companies (Population)
IELM151/ Stuart X. Zhu 7

Stem and Leaf Plot


A stem-and-leaf plot organizes data into groups (called stems) so that the values within each group (the leaves) branch out to the right on each row.
IELM151/ Stuart X. Zhu 8

Stem-and-leaf Plot of PR Example


Split each sample into two parts consisting of a stem and a leaf

Stem

Leaf

Frequency 3 4 6 5 6 4 2
IELM151/ Stuart X. Zhu 9

6 579 7 5589 8 124589 9 44579 10 0 02236 11 0147 12 35

Stem and Leaf Plot


Rotate the plot counterclockwise 90o
Frequency Histogram
8 7 6 5 4 3 2 1 0 6.55 7.55 8.55 9.55 10.55 11.55 12.55 Percentage of Revenue

(Relative) Frequency Histogram versus Stem and Leaf Plot


IELM151/ Stuart X. Zhu 10

Frequency

Double-Stem-and-leaf Plot of PR Example


Stem Leaf 6 579 7 5589 8. 124 8* 589 9. 44 9* 579 10. 00223 10* 6 11. 014 11* 7 12. 3 12* 5
IELM151/ Stuart X. Zhu

Frequency 3 4 3 3 2 3 5 1 3 1 1 1
11

Numerical Approaches
Parameters
Numerical descriptive measures computed from POPULATION measurements

Statistics
Numerical measures computed from SAMPLE measures

Facts
Parameters are constant though they may be unknown Statistics change from sample to sample (random)
IELM151/ Stuart X. Zhu 12

Measure of Location (Central Tendency, (Center of the Distribution)


Mean x1 + x2 + L + xn 1 n = xi x= Sample mean n n i =1 Population mean x1 + x2 + L + x N 1 N = = xi Sample: 5, 15, 7, 34, 450 n N i =1 Median The middle value of xis (even / odd) Can be obtained from the stem If n is odd x( n +1) / 2 , and leaf plot
IELM151/ Stuart X. Zhu 13

~ x = 1 ( xn / 2 + xn / 2+1 ), If n is even 2

Mode
The value that appears most frequently in all the xis Sample: 3, 7, 24, 5, 9, 11, 13, 15, 66, 66 Modal Class in (Relative) Frequency Histogram

IELM151/ Stuart X. Zhu

14

Skewness of Data
Three types
Symmetric: mean = median Skewed to the right: mean > median Skewed to the left: mean < median

Remark
More skewed, the differences among the measures of central tendency become greater When skewed or contain extreme values, median& mode are better descriptions. Advantage of mean over median & mode
More amenable to mathematical & theoretical treatment More stable if n is large
IELM151/ Stuart X. Zhu 15

PR Example
mean = 9.36 < median = 9.45 A little bit skewed to the left

Measure of Variability (Dispersion of the Distribution)


Range = Maximum Minimum
Sample 1: 5, 15, 7, 34, 30 Sample 2: 5, 19, 18, 20, 34

Measure of deviation from the mean


Variance

1 n Mean Absolute Deviation (MAD) MAD = | xi x | n i =1


1 N 2 Population variance = ( x ) i N i =1 Sample variance n 1 2 s2 = ( x x ) i n 1 i =1
2
IELM151/ Stuart X. Zhu 16

An Exercise
Exercise 1: (a) Sample 1: 12, 6, 15, 3, 8, 7, 10, 10, 9, 9. Compute mean, median, range, MAD, and sample variance. (b) Sample 2: 12, 6, 15, 3, 4, 7, 10, 15, 9, 7. Do the same calculation. (See L02-04_Descriptive_s2-1.xls) An important and useful short-cut formula

n x xi n 1 i =1 i =1 2 2 s = ( x x ) = i n 1 i =1 n(n 1)
n n 2 i
IELM151/ Stuart X. Zhu

17

Another Exercise
Exercise 2: Sample: 17, 5, 4, 10, 2, 11 Verify that mean = 8.1667, median = 7.5, s =5.5648, n=6, Now, the and the maximum of three additional data are respectively 75, 991, and 20. What are the sample mean, sample median, and the sample standard deviation of the combined set of 9 data?
IELM151/ Stuart X. Zhu 18

Measure of Relative Variation


Coefficient of variation CV (for positive measurements only)

s CV = 100% x

Sample A: 3, 10, 7, 4, 6, mean =6, s = 2.7386, CV = 45.64% Sample B: 30, 100, 70, 40, 60, mean = 60, s= 27.386, CV = 45.64% Sample C: 13, 20, 17, 14, 16, mean = ?, s = ?, CV =?
IELM151/ Stuart X. Zhu 19

Measure of Relative Standing


Percentile
Population: The pth percentile is the value of x such that p% of the measurements are less than that value of x and (100 p) % greater. Sample: Lower Quartile (QL) 25%, Upper Quartile (QU) 75%

Sample 1: 12, 6, 15, 3, 8, 7, 10, 10, 9, 9 Sorted: 3, 6, 7, 8, 9, 10, 10, 12, 15


QL = (10 + 1)/4 th obs. = (6 + 7)/2 = 6.5 QU = (10 + 1)3/4 th obs. = (10 + 12)/2 = 11
IELM151/ Stuart X. Zhu 20

Concepts based on Percentile


Interquartile range = QU QL Trimmed mean
The mean of trimmed sample by eliminating data below pth percentile and above (1 p) th percentile

Sample 1: 12, 6, 15, 3, 8, 7, 10, 10, 9, 9


Interquartile range = QU QL = 10.5 6.75 = 3.75 The 10% trimmed mean is the mean of the following sample 12, 6, 8, 7, 10, 10, 9, 9, which is 8.875

IELM151/ Stuart X. Zhu

21

Z-Score (Standard Score)


The sample z-score corresponding to a particular observation x is Criterion

xx z score = s

2<|z-score|<3 is quite likely |z-score|>3 is very unlikely

If |z-score|>2, the observation is a possible OUTLIER. Sample: 5, 15, 27, 14, 20, 35, 27, 450; |z-score| = 2.47 for the last data; check the source of data before further analysis
IELM151/ Stuart X. Zhu 22

Summary
Graphical methods are good in presenting data, not easy for comparison, and difficult to use of statistical inference. Numerical methods mainly focus on the CENTRAL VALUE and the SPREAD of data. Different measures have their own advantages and disadvantages. Be careful and smart in using them.
IELM151/ Stuart X. Zhu 23

IELM151/ Stuart X. Zhu

24

Cathay Pacifics Experiment


The marketing group wanted to increase the number of business class seats sold on its off-peak flights. Key factors were identified as advertising level and pricing strategy. There exist two levels of advertising campaigns and three pricing strategies in geographically. Question
Which level and strategy are the best?
IELM151/ Stuart X. Zhu 25

334 Reform
Effect on HKUST undergraduate education
The development of students ability Employment opportunity

IELM151/ Stuart X. Zhu

26

You might also like