1 B Review of Statistical Concepts

REVIEW OF STATISTICAL
CONCEPTS
Numerical Descriptive Measures
R K BANSAL NITJ
Topics
 Measures of Central Tendency

 Mean, Median, Mode
 Measure of Variation
 Range, Variance and Standard Deviation,
Coefficient of Variation
 Shape
 Symmetric, Skewed, Using Box-and-Whisker Plots
R K BANSAL 2
Topics
(continued)
 The Empirical Rule and the Bienayme-

Chebyshev Rule
 Pitfalls in Numerical Descriptive Measures and
Ethical Issues
R K BANSAL 3
Summary Measures
Summary Measures
Central Tendency Variation
Mean Mode
Median Range Coefficient of
Variation
Variance
Standard Deviation
R K BANSAL 4
Measures of Central Tendency
Central Tendency
Mean Median Mode

n
X i
X  i 1
n
N for a sample/ subgroup
X i
 i 1
N for a population or universe

R K BANSAL 5
Mean (Arithmetic Mean)
 Mean (Arithmetic Mean) of Data Values

 Sample mean
n Sample Size
X i
X1  X 2   Xn
X i 1

n n
 Population mean
N Population Size
X i
X1  X 2   XN
 i 1

N N
R K BANSAL 6
(continued)
 The Most Common Measure of Central

Tendency
 Affected by Extreme Values (Outliers)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 5 Mean = 6
R K BANSAL 7
(continued)
 Approximating the Arithmetic Mean

 Used when raw data are not available
c
m
j 1
j fj
X

n
n  sample size
c  number of classes in the frequency distribution
m j  midpoint of the jth class
f j  frequencies of the jth class
R K BANSAL 8
Median
 Robust Measure of Central Tendency
 Not Affected by Extreme Values
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14
Median = 5 Median = 5
 In an Ordered Array, the Median is the

‘Middle’ Number
 If n or N is odd, the median is the middle number
 If n or N is even, the median is the average of the
2 middle numbers
R K BANSAL 9
Median
M = L + (h / f) (N/2 – C)
Where
L is the lower limit of the median class
f is the frequency of the median class
h is the class interval
C is the cumulative frequency of class
preceding the median class
R K BANSAL 10
Mode
 A Measure of Central Tendency
 Value that Occurs Most Often
 Not Affected by Extreme Values
 There May Not Be a Mode
 There May Be Several Modes
 Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9 No Mode
R K BANSAL 11
Mode
Mode = L + h(f1 – f0) /(2f1 – f0 –f2)
Where
L is the lower limit of the modal class
f1 is the frequency of the modal class
f0 is the frequency of class preceding the
modal class
f2 is the frequency of class proceeding the
modal class
h is the class interval
R K BANSAL 12
Measures of Variation
Variation
Variance Standard Deviation Coefficient

of Variation
Range Population
Population
Variance Standard
Sample Deviation
Variance Sample
Standard
Deviation
R K BANSAL 13
Range
 Measure of Variation
 Difference between the Largest and the
Smallest Observations:
Range  X Largest  X Smallest
 Ignores How Data are Distributed (SPREAD)
Range = 12 - 7 = 5 Range = 12 - 7 = 5
7 8 9 10 11 12 7 8 9 10 11 12
R K BANSAL 14
Variance
 Important Measure of Variation

 Shows Variation about the Mean
 Sample Variance: n
 X X
2
i
S 
2 i 1
n 1
 Population Variance:
N
 X 
2
i
 
2 i 1
R K BANSAL
N 15
Standard Deviation
 Most Important Measure of Variation
 Shows Variation about the Mean
 Has the Same Units as the Original Data
 Sample Standard Deviation: n
 X X
2
i
S i 1
n 1
 Population Standard Deviation: N
 X 
2
i
 i 1
N
R K BANSAL 16
Standard Deviation
(continued)
 Approximating the Standard Deviation

 Used when the raw data are not available and the
only source of data is a frequency distribution
m  X  fj
c
2
j

j 1
S
n 1
n  sample size
c  number of classes in the frequency distribution
m j  midpoint of the jth class
f j  frequencies of the jth class
R K BANSAL 17
Comparing Standard Deviations
Data A Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = .9258
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57
R K BANSAL 18
 Measure of Relative Variation
 Always in Percentage (%)
 Shows Variation Relative to the Mean
 Used to Compare Two or More Sets of Data
Measured in Different Units
 S 
CV   100%
X 
 Sensitive to Outliers
R K BANSAL 19
Comparing Coefficient
of Variation
 Stock A:
 Average price last year = $50
 Standard deviation = $2
 Stock B:
 Average price last year = $100
 Standard deviation = $5
 Coefficient of Variation:
 Stock A: S   $2 
CV    100%    100%  4%
X   $50 
 Stock B:
S   $5 
CV   100%   100%  5%
R K BANSAL
X   $100  20
Shape of a Distribution
 Describe How Data are Distributed
 Measures of Shape
 Skeweness – indicates lack of symmetry of a distb.
 Kurtosis – Indicates peakedness of a distbn
Left-Skewed Symmetric Right-Skewed

Mean < Median < Mode Mean = Median =Mode Mode < Median < Mean
Sk < 0 (-ve) Sk = 0 Sk > 0 (+ve)
R K BANSAL 21
Skeweness and Kurtosis
 Sk = Σ f. (x-μ)3/σs3 Σ f
x is the cell midpoint of the grouped data
σs is the standard deviation of the sample data
 Kurtosis is represented by K
K = Σ f. (x-μ)4/σs4 Σ f
K = 3 for Normal Distribution
K > 3 for Distribution more peaked than Normal
K < 3 for Distribution less peaked than Normal
 Sk and K tell us about shape of the
distribution
R K BANSAL 22
The Empirical Rule
 For Most Data Sets, Roughly 68% of the

Observations Fall Within 1 Standard Deviation
Around the Mean
 Roughly 95% of the Observations Fall Within
2 Standard Deviations Around the Mean
 Roughly 99.7% of the Observations Fall
Within 3 Standard Deviations Around the
Mean
R K BANSAL 23
The Bienayme-Chebyshev Rule
 The Percentage of Observations Contained
Within Distances of k Standard Deviations
Around the Mean Must Be at Least 1  1/ k 2 100%
 Applies regardless of the shape of the data set
 At least 75% of the observations must be contained
within distances of 2 standard deviations around the
mean
 At least 88.89% of the observations must be
contained within distances of 3 standard deviations
around the mean
 At least 93.75% of the observations must be
contained within distances of 4 standard deviations
R K BANSAL
around the mean 24
Camp Meidell Inequality
 The Percentage of Observations Contained

Within Distances of k Standard Deviations
Around the Mean Must Be at Least(1-1/2.25k2)
 Following conditions apply for above to be

applicable
 Applies for uni-modal distribution (data set)
 Also if the mean = mode for the data set
 When K=3 then area within (mean±3σ) >95.1%
R K BANSAL 25
Scatter Plots of Data with
Various Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y Y
X X
R K BANSAL
r = .6 r=1 26
Pitfalls in Numerical Descriptive
Measures and Ethical Issues
 Data Analysis is Objective
 Should report the summary measures that best
meet the assumptions about the data set
 Data Interpretation is Subjective
 Should be done in a fair, neutral and clear manner
 Ethical Issues
 Should document both good and bad results
 Presentation should be fair, objective and neutral
 Should not use inappropriate summary measures
to distort the facts 27
R K BANSAL
Summary
 Described Measures of Central Tendency

 Mean, Median, Mode, Geometric Mean
 Described Measures of Variation
 Range, Variance and Standard Deviation,
 Illustrated Shape of Distribution
 Symmetric, Skewed
R K BANSAL 28
Summary
(continued)
 Described the Empirical Rule and the

Bienayme - Chebyshev Rule
 Discussed Correlation Coefficient
 Addressed Pitfalls in Numerical Descriptive
Measures and Ethical Issues
R K BANSAL 29
Reasons for Drawing a Sample
 Less Time Consuming Than a Census

 Less Costly to Administer Than a Census
 Less Cumbersome and More Practical to
Administer Than a Census of the Targeted
Population
R K BANSAL 30
Types of Sampling Methods
Samples
Non-Probability Probability Samples

Samples
(Convenience)
Simple
Random Stratified
Judgement Chunk
Cluster
Systematic
Quota
R K BANSAL 31
Probability Sampling
 Subjects of the Sample are Chosen Based on

Known Probabilities
Probability Samples
Simple
Systematic Stratified Cluster
Random
R K BANSAL 32
Simple Random Samples
 Every Individual or Item from the Frame Has

an Equal Chance of Being Selected
 Selection May Be With Replacement or
Without Replacement
 One May Use Table of Random Numbers or
Computer Random Number Generators to
Obtain Samples
R K BANSAL 33
Systematic Samples
 Decide on Sample Size: n
 Divide Frame of N individuals into Groups of k
Individuals: k=N/n
 Randomly Select One Individual from the 1st
Group
 Select Every k-th Individual Thereafter
N = 64
n=8 First Group
R K BANSAL
k=8 34
Stratified Samples
 Population Divided into 2 or More Groups
According to Some Common Characteristic
 Simple Random Sample Selected from Each
Group
 The Two or More Samples are Combined into
One
R K BANSAL 35
Cluster Samples
 Population Divided into Several “Clusters,”
Each Representative of the Population
 A Random Sampling of Clusters is Taken
 All Items in the Selected Clusters are Studied
Randomly Population
selected 2 divided
clusters into 4
clusters
R K BANSAL 36
Advantages and Disadvantages
 Simple Random Sample & Systematic Sample
 Simple to use
 May not be a good representation of the
population’s underlying characteristics
 Stratified Sample
 Ensures representation of individuals across the
entire population
 Cluster Sample
 More cost effective
 Less efficient (need larger sample to acquire the
same level of precision)
R K BANSAL 37

1 B Review of Statistical Concepts

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 B Review of Statistical Concepts

Uploaded by

Copyright:

Available Formats

REVIEW OF STATISTICAL

Numerical Descriptive Measures

 Measures of Central Tendency

 The Empirical Rule and the Bienayme-

Central Tendency Variation

Mean Median Mode

N for a population or universe

 Mean (Arithmetic Mean) of Data Values

 The Most Common Measure of Central

 Approximating the Arithmetic Mean

 In an Ordered Array, the Median is the

Variance Standard Deviation Coefficient

 Important Measure of Variation

 Approximating the Standard Deviation

Left-Skewed Symmetric Right-Skewed

Sk < 0 (-ve) Sk = 0 Sk > 0 (+ve)

 For Most Data Sets, Roughly 68% of the

 The Percentage of Observations Contained

 Following conditions apply for above to be

 When K=3 then area within (mean±3σ) >95.1%

 Described Measures of Central Tendency

 Described the Empirical Rule and the

 Less Time Consuming Than a Census

Non-Probability Probability Samples

 Subjects of the Sample are Chosen Based on

 Every Individual or Item from the Frame Has

You might also like