Professional Documents
Culture Documents
20 25 24 33 13
26 8 19 31 11
16 21 17 11 34
14 15 21 18 17
The above data can be organized into a frequency distribution (or a
grouped data) in several ways. One method is to use intervals as a basis.
The smallest value in the above data is 8 and the largest is 34.
Table 2: Frequency distribution of the time taken (in seconds) by the group of
students to answer a simple math question
Frequency
Smart 5
Normal 10
Below normal 5
Qualitative data
Qualitative data describe items in terms of some quality or
categorization that in some cases may be 'informal‘
In regression analysis, dummy variables are a type of qualitative
data.
For example, if various features are observed about each of various
human subjects, one such feature might be gender, in which case a
dummy variable can be constructed that equals 0 if the subject is
male and equals 1 if the subject is female. Then this dummy
variable can be used as an independent variable (explanatory
variable) in an ordinary least squares regression. Dummy variables
can also be used as dependent variables, in which case the probit
or logistic regression technique would typically be used.
Quality of data
The quality of the data should be checked as early as
possible.
Data quality can be assessed in several ways, using
different types of analyses: frequency counts,
descriptive statistics (mean, standard deviation,
median), normality (skewness, kurtosis, frequency
histograms, normal probability plots), associations
(correlations, scatter plots).
Data analysis tools
Commonly used approaches or tools
Statistics
Models
Standards
Statistic
Statistics is the study of the collection, organization,
analysis, and interpretation of data
Modelling
Data modeling is a method used to define and
analyze data requirements needed to support the
business processes of an organization.
Standard
American Measurement Standard (AMS)
Deutsches Institut für Normung (DIN; in English,
the German Institute for Standardization)
International Standard Organization (ISO)
Australia Standards
Institute for Reference Materials and
Measurements (EU)
Statistical Analysis
Two main areas of statistics
Descriptive statistics. summarize the population data by describing what
was observed in the sample numerically or graphically. Numerical
descriptors include mean and standard deviation for continuous data types
(like heights or weights), while frequency and percentage are more useful in
terms of describing categorical data (like race). Involved : data collection,
organization, summation
Inferential statistics. uses patterns in the sample data to draw inferences
about the population represented, accounting for randomness. These
inferences may take the form of: answering yes/no questions about the data
(hypothesis testing), estimating numerical characteristics of the data
(estimation), describing associations within the data (correlation) and
modeling relationships within the data (for example, using regression
analysis). generalizing from samples to populations. Involved: performing
hypothesis testing, determining relationships among variables, and making
predictions
DATA DESCRIPTION
Three aspects:
1. Measures of Central Tendency
Definition Symbol
Mean sum of values divided by , x
total number of value
Median Middle point in the data MD
set
Mode Most frequent data value None
Midrange (Lowest value plus highest MR
value)/2
2. Measures of Variation.
Sometime the mean is not good enough to describe a data
set as in the following example.
Definition Symbols
Range distance between highest and lowest value R
Definition Symbol
Standard score Number of standard z
or z score deviation a data value is
above or below the mean
Percentile Position in hundredths a Pn
data value is in the
distribution
Decile Position in tenths a data Dn
values is in the distribution
Quartile Position in fourths a data Qn
value is in the distribution
Mode
The mode is the most repeated value in a distribution.
It is represented by Mo.
It is possible to find the mode for categorical and quantitative
variables.
Median
The median is the score of the scale that separates the upper half
of the distribution from the lower, that is to say, it divides the
series of data into two equal parts.
The median is denoted by Me.
The median can only be found for quantitative variables.
Calculation of the Median for Grouped Data
Mean
In statistics, mean has two related meanings:
the arithmetic mean (and is distinguished from the geometric
mean or harmonic mean).
the expected value of a random variable, which is also called the
population mean.
The arithmetic mean is the "standard" average, often
simply called the "mean".
For example, the arithmetic mean of six values: 34, 27, 45,
55, 22, 34 is
Geometric mean (GM)
The geometric mean is an average that is useful for sets of
positive numbers that are interpreted according to their
product and not their sum (as is the case with the
arithmetic mean) e.g. rates of growth.
Next compute the average of these values, and take the square root:
Example:
Calculate the standard deviation for the following sample data using all
methods: 2, 4, 8, 6, 10, and 12.
Solution:
Percentile
percentile (or centile) is the value of a variable below
which a certain percent of observations fall.
One definition of percentile, often given in texts, is that the
P-th percentile ( ) of N ordered values (arranged from least
to greatest) is obtained by first calculating the (ordinal) rank
.
Thus the 30th percentile is 20, the second number in the
sorted list.
44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106
Box plot
box plot or boxplot (also
known as a box-and-whisker
diagram or plot) is a
convenient way of graphically
depicting groups of numerical
data through their five-number
summaries: the smallest
observation (sample minimum),
lower quartile (Q1), median
(Q2), upper quartile (Q3), and
largest observation (sample
maximum).
Information Obtained from a
Box Plot
4 , 10 , 7 , 7 , 6 , 9 , 3 , 8 , 9
Find
a) the mode,
b) the median,
c) the mean (Arithmetic, Geometric and Harmonic)
d) the sample standard deviation.
e) If we replace the data value 6 in the data set above by 24, will the
standard deviation increase, decrease or stay the same?
Solution
The given data set has 2 modes: 7 and 9
order data : 3 , 4 , 6 , 7 , 7 , 8 , 9 , 9 , 10 : median = 7
(mean) : m = (3+4+6+7+7+8+9+9+10) / 9 = 7
Problem
Given the data set
62 , 65 , 68 , 70 , 72 , 74 , 76 , 78 , 80 , 82 , 96 , 101,
find
a) the median,