You are on page 1of 72

Measures of Central Tendency

Data
Data are the values associated with a trait or
property that helps distinguish the occurrences of
something (Data plural and Datum Singular)
Data point, observation, response
Variable: A trait or property of something with which
values (data) are associated
Variable: A characteristic of an item or individual
Data: The set of individual values associated with a
variable
Some Characteristics of Data
Not all data is the same. There are some limitations
as to what can and cannot be done with a data set,
depending on the characteristics of the data
Some key characteristics that must be considered
are:
A. Continuous vs. Discrete
B. Grouped vs. Individual
C. Scale of Measurement
A. Continuous vs. Discrete Data
Continuous data can include any value (i.e.,
real numbers)
e.g., 1, 1.43, and 3.1415926 are all acceptable values.
Geographic examples: distance, tree height, amount of
rainfall, etc
Discrete data only consists of discrete values,
and the numbers in between those values are not
defined (i.e., whole or integer numbers)
e.g., 1, 2, 3.
Geographic examples: # of vegetation types,
B. Grouped vs. Individual Data
The distinction between individual and
grouped data is somewhat self-explanatory, but
the issue pertains to the effects of grouping data
While a family income value is collected for each
household (individual data), for the purpose of
analysis it is transformed into a set of classes
(e.g., <Rs10K, Rs10K-20K, > Rs20K)
e.g., elevation (1000m vs. < 500m, 500-1000m,
1000-2000m, > 2000m)
B. Grouped vs. Individual Data
In grouped data, the raw individual data is
categorized into several classes, and then analyzed
The act of grouping the data, by taking the central
value of each class, as well as the frequency of the
class interval, and using those values to calculate a
measure of central tendency has the potential to
introduce a significant distortion
Grouping always reduces the amount of
information contained in the data
C. Scales of Measurement
Data is the plural of a datum, which are generated
by the recording of measurements
Measurements involves the categorization of
an item (i.e., assigning an item to a set of types)
when the measure is qualitative
or makes use of a number to give something a
quantitative measurement
C. Scales of Measurement
The data used in statistical analyses can divided
into four types:
1. The Nominal Scale
As we progress through
2. The Ordinal Scale these scales, the types
of data they describe
3. The interval Scale have increasing
information content
4. The Ratio Scale
The Nominal Scale
Nominal scale data are data that can simply be
broken down into categories, i.e., having to do
with names or types
It simply labels objects
Dichotomous or binary nominal data has just
two types, e.g., yes/no, female/male, is/is not,
hot/cold, etc
Multichotomous data has more than two types,
e.g., vegetation types, soil types, counties, eye
color, etc
Not a scale in the sense that categories cannot
be ranked or ordered (no greater/less than)
Nominal data nominal categories
a more crude form of data: arent hierarchical, one
limited possibilities for category isnt better
statistical analysis or higher than another
categories, classifications, or assignment of numbers
groupings to the categories has no
pigeon-holing or labeling mathematical meaning
merely measures the nominal categories
presence or absence of should be mutually
something exclusive and
gender: male or female exhaustive
immigration status;

documented,
undocumented
zip codes, 90210, 92634,

91784
Nominal data-continued

nominal data is usually


represented descriptively
graphic representations include
tables, bar graphs, pie charts.
there are limited statistical
tests that can be performed on
nominal data
if nominal data can be
converted to averages,
advanced statistical analysis is
possible
The Ordinal Scale
Ordinal scale data can be categorized AND can be placed in
an order, i.e., categories that can be assigned a relative
importance and can be ranked such that numerical category
values have
star-system restaurant rankings
5 stars > 4 stars, 4 stars > 3 stars, 5 stars > 2 stars
BUT ordinal data still are not scalar in the sense that
differences between categories do not have a quantitative
meaning i.e there is no information regarding the differences
(intervals) between points on the scale
i.e., a 5 star restaurant is not superior to a 4 star
restaurant by the same amount as a 4 star restaurant is
than a 3 star restaurant
Ordinal data
more sensitive than nominal data, but examples:
still lacking in precision 1st, 2nd, 3rd places
exists in a rank order, hierarchy, or finishes in a horse race
sequence top 10 movie box
highest to lowest, best to worst,
office successes of
first to last
2016
allows for comparisons along some
bestselling books (#1,
dimension
example: X is prettier than Y, A is #2, #3 bestseller, etc.)
taller than B
no assumption of equidistance of
numbers
1st
increments or gradations arent 2nd 3rd
necessarily uniform
researchers do sometimes treat ordinal
data as if it were interval data
there are limited statistical tests available
with ordinal data
The Interval Scale
Interval scale data take the notion of ranking
items in order one step further, since the distance
between adjacent points on the scale are equal
For instance, the Fahrenheit scale is an interval
scale, since each degree is equal but there is no
absolute zero point.
This means that although we can add and
subtract degrees (100 is 10 warmer than 90),
we cannot multiply values or create ratios (100 is
not twice as warm as 50)
Interval Scale
An interval scale is a scale on which equal
intervals between objects, represent equal
differences
gradations, increments, or units of measure are
uniform, constant
The interval differences are meaningful
But, we cant defend ratio relationships
examples:
Scale data: Likert scales, Semantic Differential scales
The Ratio Scale
Similar to the interval scale, but with the
addition of having a meaningful zero value,
which allows us to compare values using
multiplication and division operations, e.g.,
precipitation, weights, heights, etc
e.g., rain We can say that 2 inches of rain is
twice as much rain as 1 inch of rain because this
is a ratio scale measurement
e.g., age a 100-year old person is indeed twice
as old as a 50-year old one
The Ratio Scale
the most sensitive, powerful type of data
ratio measures contain the most precise

information about each observation that is made


more prevalent in the natural sciences, less common
in social science research
includes a true zero point (complete absence of the
phenomenon being measured)
allows for absolute comparisons
If X can lift 200 lbs and Y can lift 100 lbs, X can lift twice as
much as Y, e.g., a 2:1 ratio
CLASSIFY THE FOLLOWING AS TO QUALITATIVE OR
QUANTITATIVE MEASUREMENT. THEN STATE THE
LEVEL OF MEASUREMENT.

Eye Color (blue, brown, green, hazel)


Rating scale (poor, good, excellent)
ACT score
Salary
Age
Ranking of IPL cricket teams in India
Nationality
Temperature
Zip code
Which one is better: mean,
median, or mode?
The mean is valid only for interval data or ratio
data.
The median can be determined for ordinal data
as well as interval and ratio data.
The mode can be used with nominal, ordinal,
interval, and ratio data
Mode is the only measure of central tendency that
can be used with nominal data
Descriptive Statistics
Two Sorts of Statistics
Descriptive statistics
To describe and summarize the characteristics
of the sample

Inferential statistics
To infer something about the population from
the sample
Terminology
Population
A collection of items of interest in research
A complete set of things
A group that you wish to generalize your
research to
An example All the trees in Botanical Park
Sample
A subset of a population
The size smaller than the size of a population
An example 100 trees randomly selected
from Botanical Park
Terminology
Representative An accurate reflection of the
population (A primary problem in statistics)
Variables The properties of a population that
are to be measured (i.e., how do parts of the
population differ?
Constant Something that does not vary
Parameter A constant measure which describes
the characteristics of a population
Statistic The corresponding measure for a
sample
Descriptive Statistics
Descriptive statistics Statistics that describe
and summarize the characteristics of a dataset
(sample or population)
these measures describe scores that group
around a central value
The most common way of describing a variable
distribution is in terms of two of its properties:
central tendency & dispersion
Descriptive Statistics
Measures of central tendency
Measures of the location of the middle or the
center of a distribution
Mean, median, mode
Measures of dispersion
Describe how the observations are distributed
Variance, standard deviation, range, etc
Terminology/Notation
A data distribution = A set of data/scores (the whole
thing)
1, 2, 4, 7
X = A raw, single score (i.e., 2 from above)
= Summation (added up)
X = 14 (each individual score added up)
n = sample size (distribution size, or number of
scores)
n = 4 (from above)
Descriptive Statistics
Descriptive statistics = describing the
data
n = 50, a test score of 83%
Where does it fit in the class??

Making sense out of chaos


Descriptive Statistics
Transform a set of numbers or
observations into indices that describe
or characterize the data
Summary statistics
A large group of statistics that are used in
all research manuscripts
Even the most complex statistical tests and
studies start with descriptive statistics
Descriptive Statistics
Descriptive statistics usually accomplish
two major goals:
1) Describe the central location of the data
2) Describe how the data are dispersed
about that point
In other words, they provide:
1) Measures of Central Tendency
2) Measures of Variability
Descriptive Statistics

Measurement Relationship
Scales
Nominal
Scatterplot
Ordinal
Correlation
Interval Descriptive Regression
Ratio Statistics

Graphic
Variability
Portrayals Central
Tendency Range
Frequencies
Histograms Standard deviation
Mean
Bar graphs Standardized scores
Median
Normal distribution Mode
Measure of Central Tendency
What SINGLE summary value best describes
the CENTRAL location of an entire
distribution?
Mode: which value occurs most often
Median: the value above and below which 50%
of the cases fall (the middle; 50th percentile)
Mean: mathematical balance point;
arithmetic/mathematical average
Mode
Most frequent occurrence
What if data were?
17, 19, 20, 20, 22, 23, 25, 28
17, 19, 20, 20, 22, 23, 23, 28
Problem: set of numbers can be bimodal,
or trimodal, depending on the scores
Not a stable measure
Ex. 17, 19, 20, 22, 23, 28, 28
Croxton and Cowden : defined it as the mode
of a distribution is the value at the point armed
with the item tend to most heavily concentrated.
It may be regarded as the most typical of a series
of value

The exact value of mode can be obtained by the


following formula.

Z=L1+
Example: Calculate Mode for the distribution of
monthly rent Paid by Libraries in Karnataka

Monthly rent (Rs) Number of Libraries (f)


500-1000 5
1000-1500 10
1500-2000 8
2000-2500 16
2500-3000 14
3000 & Above 12
Total 65
Z=2000+

Z =2000+

Z=2000+0.8 500=400
Z=2400
Advantages of Mode :
Mode is readily comprehensible and easily
calculated
It is the best representative of data
It is not at all affected by extreme value.
The value of mode can also be determined
graphically.
It is usually an actual value of an
important part of the series.
Disadvantages of Mode :

It is not based on all observations.


It is not capable of further mathematical
manipulation.
Mode is affected to a great extent by
sampling fluctuations.
Choice of grouping has great influence
on the value of mode.
Median
Rank numbers, pick middle one
What if data were?
17, 19, 20, 23, 23, 28
Solution: add up two middle scores, divide
by 2 (=21.5)
Best measure in asymmetrical distribution
(i.e. skewed), not sensitive to extreme scores
Ex. 17, 19, 20, 23, 23, 428
Calculation of Median Discrete series :

i. Arrange the data in ascending or descending


order.

ii. Calculate the cumulative frequencies.

iii. Apply the formula.


Calculation of median Continuous series
The median is the middle of the distribution.
It is the 50th percentile
50% of the scores in the frequency distribution fall
below and above the median
For calculation of median in a continuous
frequency distribution the following formula will
be employed.
Example: Median of a set Grouped
Data in a Distribution of Respondents
by age
Age Group Frequency of Cumulative
Median class(f) frequencies(cf)
0-20 15 15
20-40 32 47
40-60 54 101
60-80 30 131
80-100 19 150
Total 150
Median (M)=40+

= 40+

= 40+0.52X20
= 40+10.37
= 50.37
Advantages of Median:
Median can be calculated in all distributions.

Median can be understood even by common


people.

Median can be ascertained even with the extreme


items.

It can be located graphically

It is most useful dealing with qualitative data


Disadvantages of Median:
It is not based on all the values.
It is not capable of further mathematical
treatment.
It is affected fluctuation of sampling.
In case of even no. of values it may not the
value from the data.
Mean = X
Add up the numbers and divide by the sample
size (the number of numbers!)
X
Try this one X=
2,3,5,6,9
n
2+3+5+6+9 = 25 / 5 = 5
(Usually) best measure of the three uses the
most information (all values from distribution
contribute)
Measures of Central Tendency
Mean

Sample mean: Population mean:

n N

xi x i

x= i =1 = i =1

n N
Arithmetic Mean Calculated Methods :
Direct Method :

Short cut method :

Step deviation Method :


Mean Group Data
If data is collected in group form Channel Midpoint Frequency Midpoint
depth cm (x) (f) X
then the calculation for mean is a frequency
little different. (f x)

0-9 4.5 5 22.5


Data set for channel depth:
45, 36, 36, 28, 24, 19, 16, 16, 12, 10-19 14.5
7, 3, 3, 3, 1. 20-29 24.5
30-39 34.5

40-49 44.5
To find the mean you take the
midpoint of the group and multiply
it by the frequency. n= Efx=
You then find the sum of the final
column and divide it by the total
Mean = E f x
number in the data set. Mean = ____
n
Mean Grouped Data - Answers
If data is collected in group form Channel Midpoint Frequency Midpoint
depth cm X
then the calculation for mean is a (x) (f) frequency
little different. (f x)

0-9 4.5 5 22.5


10-19 14.5 4 58
20-29 24.5 2 29
To find the mean you take the
midpoint of the group and multiply 30-39 34.5 2 69
it by the frequency. 40-49 44.5 1 44.5
You then find the sum of the final
n = 14 E f x = 223
column and divide it by the total
number in the data set.

Mean = E f x Mean = 223 = 15.9cm


n 14
Weighted Mean
We can also calculate a weighted mean using
some weighting factor:
e.g. What is the average income of all
n people in cities A, B, and C:

w x i i
City
A
Avg. Income
23,000
Population
100,000
x= i =1
n
B 20,000 50,000

w
C 25,000 150,000
i
i =1 Here, population is the weighting factor
and the average income is the variable
of interest
Characteristics of the Mean
Balance point
Point around which deviations sum to zero
Deviation = X X

For instance, if scores are 2,3,5,6,9


Mean is 5
Sum of deviations: (-3)+(-2)+0+1+4=0
(X X) = 0
Characteristics of the Mean
Affected by extreme scores
Example 1
Scores 7, 11, 11, 14, 17
Mean = 12, Mode and Median = 11
Example 2
Scores 7, 11, 11, 14, 170
Mean = 42.6, Mode & Median = 11
Characteristics of the Mean
Balance point
Affected by extreme scores
Appropriate for use with interval or ratio
scales of measurement
More stable than Median or Mode when
multiple samples drawn from the same
population
Basis for inferential stats
Guidelines to Choose Measure
of Central Tendency
Mean is preferred because it is the basis
of inferential statistics
Median may be better for skewed data
Distribution of wealth in the US ex. annual
household income in Washington state for
2000: mean=$76,818; median=$42,024
Mode to describe average of nominal
data (eye color, hair color, etc)
Which Measure of Central
Tendency is best to use?

The mean is used the most because it takes


every number into account. It is best to use for
sets with no outliers.

The median is a good choice to represent a set


with one or two outliers.

The mode is only useful for sets of data that


have many identical values.
Comparison of mean, median
and mode
Mode
Good for nominal variables
Good if you need to know most frequent
observation
Quick and easy
Median
Good for bad distributions
Good for distributions with arbitrary ceiling or
floor
Comparison of mean, median
& mode
Mean
Used for inference as well as description;
best estimator of the parameter
Based on all data in the distribution
Generally preferred except for bad
distribution. Most commonly used statistic
for central tendency.
Summary of the Measures of
Central Tendency

Measures of Central Best used to describe data when


Tendency

Mean The data has no outliers


Median The data has outliers
Mode The data has many
repeated numbers
Matching!

Mean Middle
Median Difference
Mode Most
Range Average
Outliers
Sometimes there are extreme values that are
separated from the rest of the data. These
extreme values are called outliers. Outliers affect
the mean.
Daily High Temperatures (for any given date) Over the Last Decade
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
59 50 49 13 40 46 50 53 58 47

The daily high temperature in 1996 is the outlier.


Mean
59+50+49+13+40+46+50+53+58+57=494
46510=46.5
Because outliers can affect the mean, the median
may be better measures of central tendency. You
might consider the median to best represent the
expected temperature.

Median
13 40 46 47 49 50 50 53 58 59
49+50=99
992=49.5
Sometimes the mode is more helpful when
analyzing data. If you were trying to determine
what clothes to wear for a day trip, you might
base your decision on the mode temperature
because the mode temperature is the
temperature which occurred most often.
13 40 46 47 49 50 50 53 58 59
Dropping the outlier may help when determining
the mean.

59+50+49+13+40+46+50+53+58+57=494
46510=46.5

40+46+47+49+50+50+53+58+59=452
4529=50.2

When the 13 outlier is dropped, the average


daily temperature increases by more than 4 to
50.2, which is closer to both the median of
49.5 and the mode of 50.
You Try It!
Jayeshs test scores in Statistics for the first
semester are 93, 79, 88, 77, 92, 88, 80, 34, 84,
88. Calculate the range, mean, median, and
mode. Then make and explain a prediction for
next semesters test scores.
Range: 59 Mean: 80.3
Median: 86 Mode: 88
Predictions will vary: Jayesh will score an
estimated average of 85 on his tests. I determined
this by removing the outlying score of 34 and
recalculated the mean.
Which one is better: mean,
median, or mode?
It also depends on your goals
Consider a company that has nine employees with
salaries of 35,000 a year, and their supervisor
makes 150,000 a year.
If you want to describe the typical salary in the
company, which statistics will you use?
I will use mode or median (35,000), because it
tells what salary most people get
Which one is better: mean,
median, or mode?
It also depends on your goals
Consider a company that has nine employees with
salaries of 35,000 a year, and their supervisor
makes 150,000 a year
What if you are a recruiting officer for the
company that wants to make a good impression
on a prospective employee?
The mean is (35,000*9 + 150,000)/10 = 46,500 I
would probably say: "The average salary in our
company is 46,500" using mean
Puzzle and Quiz
Across
1 A chance pick from a number of
items.
3 The middle number or the average
of the two middle numbers in an
ordered set of data.
4 A section of a whole group.
8 A set of data is _______ if it has 2
modes.
9 Another word for a mean.
10 The collection, organization,
presentation, interpretation and
analysis of data.
11 An arithmetic _______ is the
number obtained by dividing the
Down sum of a set of numbers by the
2 Numbers which have been collected for study. total number of items in the set.
3 The number that occurs most often in a set of data. 12 The difference between the highest
5 A whole set of data from which a statistical sample is and lowest values in a set of data.
drawn.
6 Mean, median and mode are _______ of central
tendency.
7 The number of times a particular item appears in a set
of data.
10 A numerical value.
Puzzle Answer
Thank You

You might also like