Professional Documents
Culture Documents
Data
Data are the values associated with a trait or
property that helps distinguish the occurrences of
something (Data plural and Datum Singular)
Data point, observation, response
Variable: A trait or property of something with which
values (data) are associated
Variable: A characteristic of an item or individual
Data: The set of individual values associated with a
variable
Some Characteristics of Data
Not all data is the same. There are some limitations
as to what can and cannot be done with a data set,
depending on the characteristics of the data
Some key characteristics that must be considered
are:
A. Continuous vs. Discrete
B. Grouped vs. Individual
C. Scale of Measurement
A. Continuous vs. Discrete Data
Continuous data can include any value (i.e.,
real numbers)
e.g., 1, 1.43, and 3.1415926 are all acceptable values.
Geographic examples: distance, tree height, amount of
rainfall, etc
Discrete data only consists of discrete values,
and the numbers in between those values are not
defined (i.e., whole or integer numbers)
e.g., 1, 2, 3.
Geographic examples: # of vegetation types,
B. Grouped vs. Individual Data
The distinction between individual and
grouped data is somewhat self-explanatory, but
the issue pertains to the effects of grouping data
While a family income value is collected for each
household (individual data), for the purpose of
analysis it is transformed into a set of classes
(e.g., <Rs10K, Rs10K-20K, > Rs20K)
e.g., elevation (1000m vs. < 500m, 500-1000m,
1000-2000m, > 2000m)
B. Grouped vs. Individual Data
In grouped data, the raw individual data is
categorized into several classes, and then analyzed
The act of grouping the data, by taking the central
value of each class, as well as the frequency of the
class interval, and using those values to calculate a
measure of central tendency has the potential to
introduce a significant distortion
Grouping always reduces the amount of
information contained in the data
C. Scales of Measurement
Data is the plural of a datum, which are generated
by the recording of measurements
Measurements involves the categorization of
an item (i.e., assigning an item to a set of types)
when the measure is qualitative
or makes use of a number to give something a
quantitative measurement
C. Scales of Measurement
The data used in statistical analyses can divided
into four types:
1. The Nominal Scale
As we progress through
2. The Ordinal Scale these scales, the types
of data they describe
3. The interval Scale have increasing
information content
4. The Ratio Scale
The Nominal Scale
Nominal scale data are data that can simply be
broken down into categories, i.e., having to do
with names or types
It simply labels objects
Dichotomous or binary nominal data has just
two types, e.g., yes/no, female/male, is/is not,
hot/cold, etc
Multichotomous data has more than two types,
e.g., vegetation types, soil types, counties, eye
color, etc
Not a scale in the sense that categories cannot
be ranked or ordered (no greater/less than)
Nominal data nominal categories
a more crude form of data: arent hierarchical, one
limited possibilities for category isnt better
statistical analysis or higher than another
categories, classifications, or assignment of numbers
groupings to the categories has no
pigeon-holing or labeling mathematical meaning
merely measures the nominal categories
presence or absence of should be mutually
something exclusive and
gender: male or female exhaustive
immigration status;
documented,
undocumented
zip codes, 90210, 92634,
91784
Nominal data-continued
Inferential statistics
To infer something about the population from
the sample
Terminology
Population
A collection of items of interest in research
A complete set of things
A group that you wish to generalize your
research to
An example All the trees in Botanical Park
Sample
A subset of a population
The size smaller than the size of a population
An example 100 trees randomly selected
from Botanical Park
Terminology
Representative An accurate reflection of the
population (A primary problem in statistics)
Variables The properties of a population that
are to be measured (i.e., how do parts of the
population differ?
Constant Something that does not vary
Parameter A constant measure which describes
the characteristics of a population
Statistic The corresponding measure for a
sample
Descriptive Statistics
Descriptive statistics Statistics that describe
and summarize the characteristics of a dataset
(sample or population)
these measures describe scores that group
around a central value
The most common way of describing a variable
distribution is in terms of two of its properties:
central tendency & dispersion
Descriptive Statistics
Measures of central tendency
Measures of the location of the middle or the
center of a distribution
Mean, median, mode
Measures of dispersion
Describe how the observations are distributed
Variance, standard deviation, range, etc
Terminology/Notation
A data distribution = A set of data/scores (the whole
thing)
1, 2, 4, 7
X = A raw, single score (i.e., 2 from above)
= Summation (added up)
X = 14 (each individual score added up)
n = sample size (distribution size, or number of
scores)
n = 4 (from above)
Descriptive Statistics
Descriptive statistics = describing the
data
n = 50, a test score of 83%
Where does it fit in the class??
Measurement Relationship
Scales
Nominal
Scatterplot
Ordinal
Correlation
Interval Descriptive Regression
Ratio Statistics
Graphic
Variability
Portrayals Central
Tendency Range
Frequencies
Histograms Standard deviation
Mean
Bar graphs Standardized scores
Median
Normal distribution Mode
Measure of Central Tendency
What SINGLE summary value best describes
the CENTRAL location of an entire
distribution?
Mode: which value occurs most often
Median: the value above and below which 50%
of the cases fall (the middle; 50th percentile)
Mean: mathematical balance point;
arithmetic/mathematical average
Mode
Most frequent occurrence
What if data were?
17, 19, 20, 20, 22, 23, 25, 28
17, 19, 20, 20, 22, 23, 23, 28
Problem: set of numbers can be bimodal,
or trimodal, depending on the scores
Not a stable measure
Ex. 17, 19, 20, 22, 23, 28, 28
Croxton and Cowden : defined it as the mode
of a distribution is the value at the point armed
with the item tend to most heavily concentrated.
It may be regarded as the most typical of a series
of value
Z=L1+
Example: Calculate Mode for the distribution of
monthly rent Paid by Libraries in Karnataka
Z =2000+
Z=2000+0.8 500=400
Z=2400
Advantages of Mode :
Mode is readily comprehensible and easily
calculated
It is the best representative of data
It is not at all affected by extreme value.
The value of mode can also be determined
graphically.
It is usually an actual value of an
important part of the series.
Disadvantages of Mode :
= 40+
= 40+0.52X20
= 40+10.37
= 50.37
Advantages of Median:
Median can be calculated in all distributions.
n N
xi x i
x= i =1 = i =1
n N
Arithmetic Mean Calculated Methods :
Direct Method :
40-49 44.5
To find the mean you take the
midpoint of the group and multiply
it by the frequency. n= Efx=
You then find the sum of the final
column and divide it by the total
Mean = E f x
number in the data set. Mean = ____
n
Mean Grouped Data - Answers
If data is collected in group form Channel Midpoint Frequency Midpoint
depth cm X
then the calculation for mean is a (x) (f) frequency
little different. (f x)
w x i i
City
A
Avg. Income
23,000
Population
100,000
x= i =1
n
B 20,000 50,000
w
C 25,000 150,000
i
i =1 Here, population is the weighting factor
and the average income is the variable
of interest
Characteristics of the Mean
Balance point
Point around which deviations sum to zero
Deviation = X X
Mean Middle
Median Difference
Mode Most
Range Average
Outliers
Sometimes there are extreme values that are
separated from the rest of the data. These
extreme values are called outliers. Outliers affect
the mean.
Daily High Temperatures (for any given date) Over the Last Decade
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
59 50 49 13 40 46 50 53 58 47
Median
13 40 46 47 49 50 50 53 58 59
49+50=99
992=49.5
Sometimes the mode is more helpful when
analyzing data. If you were trying to determine
what clothes to wear for a day trip, you might
base your decision on the mode temperature
because the mode temperature is the
temperature which occurred most often.
13 40 46 47 49 50 50 53 58 59
Dropping the outlier may help when determining
the mean.
59+50+49+13+40+46+50+53+58+57=494
46510=46.5
40+46+47+49+50+50+53+58+59=452
4529=50.2