Professional Documents
Culture Documents
CONCEPTS
R K BANSAL NITJ
Topics
R K BANSAL 2
Topics
(continued)
R K BANSAL 3
Summary Measures
Summary Measures
Mean Mode
Median Range Coefficient of
Variation
Variance
Standard Deviation
R K BANSAL 4
Measures of Central Tendency
Central Tendency
X i
X i 1
n
N for a sample/ subgroup
X i
i 1
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 5 Mean = 6
R K BANSAL 7
Mean (Arithmetic Mean)
(continued)
m
j 1
j fj
X
n
n sample size
c number of classes in the frequency distribution
m j midpoint of the jth class
f j frequencies of the jth class
R K BANSAL 8
Median
Robust Measure of Central Tendency
Not Affected by Extreme Values
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14
Median = 5 Median = 5
M = L + (h / f) (N/2 – C)
Where
L is the lower limit of the median class
f is the frequency of the median class
h is the class interval
C is the cumulative frequency of class
preceding the median class
R K BANSAL 10
Mode
A Measure of Central Tendency
Value that Occurs Most Often
Not Affected by Extreme Values
There May Not Be a Mode
There May Be Several Modes
Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9 No Mode
R K BANSAL 11
Mode
Mode = L + h(f1 – f0) /(2f1 – f0 –f2)
Where
L is the lower limit of the modal class
f1 is the frequency of the modal class
f0 is the frequency of class preceding the
modal class
f2 is the frequency of class proceeding the
modal class
h is the class interval
R K BANSAL 12
Measures of Variation
Variation
Measure of Variation
Difference between the Largest and the
Smallest Observations:
Range X Largest X Smallest
Ignores How Data are Distributed (SPREAD)
Range = 12 - 7 = 5 Range = 12 - 7 = 5
7 8 9 10 11 12 7 8 9 10 11 12
R K BANSAL 14
Variance
X X
2
i
S
2 i 1
n 1
Population Variance:
N
X
2
i
2 i 1
R K BANSAL
N 15
Standard Deviation
Most Important Measure of Variation
Shows Variation about the Mean
Has the Same Units as the Original Data
Sample Standard Deviation: n
X X
2
i
S i 1
n 1
Population Standard Deviation: N
X
2
i
i 1
N
R K BANSAL 16
Standard Deviation
(continued)
m X fj
c
2
j
j 1
S
n 1
n sample size
c number of classes in the frequency distribution
m j midpoint of the jth class
f j frequencies of the jth class
R K BANSAL 17
Comparing Standard Deviations
Data A Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = .9258
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57
R K BANSAL 18
Coefficient of Variation
Measure of Relative Variation
Always in Percentage (%)
Shows Variation Relative to the Mean
Used to Compare Two or More Sets of Data
Measured in Different Units
S
CV 100%
X
Sensitive to Outliers
R K BANSAL 19
Comparing Coefficient
of Variation
Stock A:
Average price last year = $50
Standard deviation = $2
Stock B:
Average price last year = $100
Standard deviation = $5
Coefficient of Variation:
Stock A: S $2
CV 100% 100% 4%
X $50
Stock B:
S $5
CV 100% 100% 5%
R K BANSAL
X $100 20
Shape of a Distribution
Describe How Data are Distributed
Measures of Shape
Skeweness – indicates lack of symmetry of a distb.
Kurtosis – Indicates peakedness of a distbn
R K BANSAL 21
Skeweness and Kurtosis
Sk = Σ f. (x-μ)3/σs3 Σ f
x is the cell midpoint of the grouped data
σs is the standard deviation of the sample data
Kurtosis is represented by K
K = Σ f. (x-μ)4/σs4 Σ f
K = 3 for Normal Distribution
K > 3 for Distribution more peaked than Normal
K < 3 for Distribution less peaked than Normal
Sk and K tell us about shape of the
distribution
R K BANSAL 22
The Empirical Rule
R K BANSAL 23
The Bienayme-Chebyshev Rule
The Percentage of Observations Contained
Within Distances of k Standard Deviations
Around the Mean Must Be at Least 1 1/ k 2 100%
Applies regardless of the shape of the data set
At least 75% of the observations must be contained
within distances of 2 standard deviations around the
mean
At least 88.89% of the observations must be
contained within distances of 3 standard deviations
around the mean
At least 93.75% of the observations must be
contained within distances of 4 standard deviations
R K BANSAL
around the mean 24
Camp Meidell Inequality
R K BANSAL 25
Scatter Plots of Data with
Various Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y Y
X X
R K BANSAL
r = .6 r=1 26
Pitfalls in Numerical Descriptive
Measures and Ethical Issues
Data Analysis is Objective
Should report the summary measures that best
meet the assumptions about the data set
Data Interpretation is Subjective
Should be done in a fair, neutral and clear manner
Ethical Issues
Should document both good and bad results
Presentation should be fair, objective and neutral
Should not use inappropriate summary measures
to distort the facts 27
R K BANSAL
Summary
R K BANSAL 28
Summary
(continued)
R K BANSAL 29
Reasons for Drawing a Sample
R K BANSAL 30
Types of Sampling Methods
Samples
Probability Samples
Simple
Systematic Stratified Cluster
Random
R K BANSAL 32
Simple Random Samples
R K BANSAL 33
Systematic Samples
Decide on Sample Size: n
Divide Frame of N individuals into Groups of k
Individuals: k=N/n
Randomly Select One Individual from the 1st
Group
Select Every k-th Individual Thereafter
N = 64
n=8 First Group
R K BANSAL
k=8 34
Stratified Samples
Population Divided into 2 or More Groups
According to Some Common Characteristic
Simple Random Sample Selected from Each
Group
The Two or More Samples are Combined into
One
R K BANSAL 35
Cluster Samples
Population Divided into Several “Clusters,”
Each Representative of the Population
A Random Sampling of Clusters is Taken
All Items in the Selected Clusters are Studied
Randomly Population
selected 2 divided
clusters into 4
clusters
R K BANSAL 36
Advantages and Disadvantages
Simple Random Sample & Systematic Sample
Simple to use
May not be a good representation of the
population’s underlying characteristics
Stratified Sample
Ensures representation of individuals across the
entire population
Cluster Sample
More cost effective
Less efficient (need larger sample to acquire the
same level of precision)
R K BANSAL 37