Data Analysis - Statistics-021216 - NH (Compatibility Mode)

Data Analysis –
Statistics
Overview
Purpose of EDA
 To find patterns in data, to give us a sense of what they look like,

and perhaps some surprises
1
Patterns
 Relations between variables (e.g. linear or nonlinear)

 Correlations (positive, negative, none)
 Outliers
 Distributions (histograms, charts, etc), e.g. is it normal?
y B 


 A

Patterns: Trends and cycles
 Need a long series

 Trends may be
 Rigid (dotted line below)
 Flexible (moving average (MA), shown as light curve), e.g. if
data = {1, 3, 5, 4, 8,...}, a 3-period MA is {3, 4, 17/3,...}.

 Cycles may be irregular
y
Period
Trend
Amplitude
Time
2
Index numbers
 Used to track temporal changes, e.g. house price, steel price

 For a single homogeneous product (i.e. quality does not change
over time), a possible index is
1, P1/P0,..., Pn/P0
where Pt is price at time t. We often multiply by 100 to shift the
decimal place.
Index numbers
 To construct an index of fruit prices, select a representative basket

and compute as follows:
t Index
0 1
1 (35+40+170)/(30+40+150) = 245/220 = 1.11
2 (45+50+180)/220 =1.25
3 (50+55+200)/220 = 1.39
i.e. cost of fruits has gone up.
2000 (t=0) t=1 t=2 t=3

Apple 30c 35c 45c 50c
Orange 40c 40c 50c 55c
Banana (per kg) $1.50 $1.70 $1.80 $2.00
3
Statistics and random variables
 Statistics are
 Methods of analyzing data (e.g. I study statistics);
 Data themselves (e.g. Do you have the statistics?); or
 Formulas based on sample values (e.g. sample mean m = ∑x/n
is a statistic)
 A random variable (rv) refers to an outcome that varies, e.g.
 Gender, e.g. G = 1 (if Male) and G = 0 (if Female). Since G
takes on discrete values, it is a discrete rv.

 Weight, e.g. Wi = weight of the ith person. W is a continuous rv
 Often, we omit the term “random” and call it a variable
Discrete distribution
 Suppose we toss 2 fair coins together.

 Let rv x = No. of heads.
 Its probability distribution f(x) is given below.
f(x) x f(x)
0.50 0 0.25
0.25
1 0.50
2 0.25
0 1 2 x
4
Distributions: Continuous
 Consider a population of 200 students whose test scores (x) are

distributed below, e.g. x = 75.2 marks
Score Frequency Prob distribution, f(x) Cumulative distribution

0-20 10 0.05 0.05
21-40 30 0.15 0.20
41-60 90 0.45 0.65
61-80 50 0.25 0.90
81-100 20 0.10 1.00
Total 200 1.00
Distribution: Continuous
 We can plot its relative frequency polygon

 If the column interval becomes smaller and smaller, the polygon
approaches a curve, called a probability density function
f(x)
0.4
0.2
30 70 x
5
Population mean and variance
 We often use 2 parameters to describe a population distribution:

mean () and variance (2). ( is the standard deviation)
 Note Greek letters refer to population parameters; we will use m

for sample mean and s for sample std deviation.
 The mean or expectation is given by

E[x] =  =  xf(x) if x discrete
=  xf(x) dx if x continuous.
 The variance is given by
2 = E[x- ]2 =  (x-)2f(x) if x discrete
=  (x-) f(x)
2 if x continuous
Example: Discrete case
 If x takes values as shown in the table, x f(x)

0 0.2
 = 0.2(0) + 0.5(2) + 0.3(3) = 1.9. 2 0.5
3 0.3
2 = [0.2(0 - 1.9)2+0.5(2 - 1.9)2+0.3(3 - 1.9)2]
= 1.09.
Thus  = 1.044.
6
Sample statistics
 In practice, we only have a sample of n values {x1, ...,xn}, n < N

where N is population size
Sample mean = m = xi/n

Sample variance = s2 = (xi – m)2/(n-1)
 Example:
If our sample is {0, 2, 5, 5},
m = 12/4 = 3
s2= [(0-3)2+(2-3)2+(5-3)2+(5-3)2]/3 = 6
s = 6.
Distribution of sample mean m
 Left panel: Population distribution of x, e.g. height of UI students,

with unknown  and . Note x is not normally distributed.
 Right panel: Sampling distribution of m, e.g. if we take say 100
samples each of size n (=9 say), and compute m for each sample.
So we have 100 values of m to draw the curve.
 Note m ~ N(, 2/n) for large n. This is the Central Limit Thm. For
small n (<30), m ~ tn-1(, s2/n)
 We often use these result to make inferences about 
shape /n

 x  m
7
Summary: What is statistics?
 Statistics is the scientific application of mathematical principles to

the collection, analysis, and presentation of data
 at the foundation of all of statistics is data.
Collection make
deals Presentation decisions
Statistics data to
with Analysis and solve
problems
Use
15
Engineering statistics
 Engineering statistics is the study of how best to…

 Collect engineering data
 Summarize or describe engineering data
 Draw formal inferences and practical conclusions on the basis of
engineering data all the while recognizing the reality of variation
16

Data Analysis - Statistics-021216 - NH (Compatibility Mode)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis - Statistics-021216 - NH (Compatibility Mode)

Uploaded by

Copyright:

Available Formats

Data Analysis –

 To find patterns in data, to give us a sense of what they look like,

 Relations between variables (e.g. linear or nonlinear)

Patterns: Trends and cycles

 Need a long series

 Flexible (moving average (MA), shown as light curve), e.g. if

data = {1, 3, 5, 4, 8,...}, a 3-period MA is {3, 4, 17/3,...}.

 Used to track temporal changes, e.g. house price, steel price

 To construct an index of fruit prices, select a representative basket

2000 (t=0) t=1 t=2 t=3

 Data themselves (e.g. Do you have the statistics?); or

 Formulas based on sample values (e.g. sample mean m = ∑x/n

takes on discrete values, it is a discrete rv.

 Often, we omit the term “random” and call it a variable

 Suppose we toss 2 fair coins together.

 Consider a population of 200 students whose test scores (x) are

Score Frequency Prob distribution, f(x) Cumulative distribution

 We can plot its relative frequency polygon

 We often use 2 parameters to describe a population distribution:

 Note Greek letters refer to population parameters; we will use m

 The mean or expectation is given by

Example: Discrete case

 If x takes values as shown in the table, x f(x)

 In practice, we only have a sample of n values {x1, ...,xn}, n < N

Sample mean = m = xi/n

Distribution of sample mean m

 Left panel: Population distribution of x, e.g. height of UI students,

 Statistics is the scientific application of mathematical principles to

 Engineering statistics is the study of how best to…

 Summarize or describe engineering data

 Draw formal inferences and practical conclusions on the basis of

engineering data all the while recognizing the reality of variation

You might also like