You are on page 1of 8

Data Analysis –

Statistics
Overview

Purpose of EDA

 To find patterns in data, to give us a sense of what they look like,


and perhaps some surprises

1
Patterns

 Relations between variables (e.g. linear or nonlinear)


 Correlations (positive, negative, none)
 Outliers
 Distributions (histograms, charts, etc), e.g. is it normal?

y B 


 A

Patterns: Trends and cycles

 Need a long series


 Trends may be
 Rigid (dotted line below)

 Flexible (moving average (MA), shown as light curve), e.g. if

data = {1, 3, 5, 4, 8,...}, a 3-period MA is {3, 4, 17/3,...}.


 Cycles may be irregular

y
Period

Trend

Amplitude

Time

2
Index numbers

 Used to track temporal changes, e.g. house price, steel price


 For a single homogeneous product (i.e. quality does not change
over time), a possible index is
1, P1/P0,..., Pn/P0
where Pt is price at time t. We often multiply by 100 to shift the
decimal place.

Index numbers

 To construct an index of fruit prices, select a representative basket


and compute as follows:
t Index
0 1
1 (35+40+170)/(30+40+150) = 245/220 = 1.11
2 (45+50+180)/220 =1.25
3 (50+55+200)/220 = 1.39
i.e. cost of fruits has gone up.

2000 (t=0) t=1 t=2 t=3


Apple 30c 35c 45c 50c
Orange 40c 40c 50c 55c
Banana (per kg) $1.50 $1.70 $1.80 $2.00

3
Statistics and random variables

 Statistics are
 Methods of analyzing data (e.g. I study statistics);

 Data themselves (e.g. Do you have the statistics?); or

 Formulas based on sample values (e.g. sample mean m = ∑x/n

is a statistic)
 A random variable (rv) refers to an outcome that varies, e.g.
 Gender, e.g. G = 1 (if Male) and G = 0 (if Female). Since G

takes on discrete values, it is a discrete rv.


 Weight, e.g. Wi = weight of the ith person. W is a continuous rv

 Often, we omit the term “random” and call it a variable

Discrete distribution

 Suppose we toss 2 fair coins together.


 Let rv x = No. of heads.
 Its probability distribution f(x) is given below.

f(x) x f(x)
0.50 0 0.25

0.25
1 0.50
2 0.25

0 1 2 x

4
Distributions: Continuous

 Consider a population of 200 students whose test scores (x) are


distributed below, e.g. x = 75.2 marks

Score Frequency Prob distribution, f(x) Cumulative distribution


0-20 10 0.05 0.05
21-40 30 0.15 0.20
41-60 90 0.45 0.65
61-80 50 0.25 0.90
81-100 20 0.10 1.00
Total 200 1.00

Distribution: Continuous

 We can plot its relative frequency polygon


 If the column interval becomes smaller and smaller, the polygon
approaches a curve, called a probability density function

f(x)

0.4

0.2

30 70 x

5
Population mean and variance

 We often use 2 parameters to describe a population distribution:


mean () and variance (2). ( is the standard deviation)

 Note Greek letters refer to population parameters; we will use m


for sample mean and s for sample std deviation.

 The mean or expectation is given by


E[x] =  =  xf(x) if x discrete
=  xf(x) dx if x continuous.
 The variance is given by
2 = E[x- ]2 =  (x-)2f(x) if x discrete
=  (x-) f(x)
2 if x continuous

Example: Discrete case

 If x takes values as shown in the table, x f(x)


0 0.2
 = 0.2(0) + 0.5(2) + 0.3(3) = 1.9. 2 0.5
3 0.3
2 = [0.2(0 - 1.9)2+0.5(2 - 1.9)2+0.3(3 - 1.9)2]
= 1.09.

Thus  = 1.044.

6
Sample statistics

 In practice, we only have a sample of n values {x1, ...,xn}, n < N


where N is population size

Sample mean = m = xi/n


Sample variance = s2 = (xi – m)2/(n-1)

 Example:
If our sample is {0, 2, 5, 5},
m = 12/4 = 3
s2= [(0-3)2+(2-3)2+(5-3)2+(5-3)2]/3 = 6
s = 6.

Distribution of sample mean m

 Left panel: Population distribution of x, e.g. height of UI students,


with unknown  and . Note x is not normally distributed.
 Right panel: Sampling distribution of m, e.g. if we take say 100
samples each of size n (=9 say), and compute m for each sample.
So we have 100 values of m to draw the curve.
 Note m ~ N(, 2/n) for large n. This is the Central Limit Thm. For
small n (<30), m ~ tn-1(, s2/n)
 We often use these result to make inferences about 

shape /n

 x  m

7
Summary: What is statistics?

 Statistics is the scientific application of mathematical principles to


the collection, analysis, and presentation of data
 at the foundation of all of statistics is data.

Collection make
deals Presentation decisions
Statistics data to
with Analysis and solve
problems
Use

15

Engineering statistics

 Engineering statistics is the study of how best to…


 Collect engineering data

 Summarize or describe engineering data

 Draw formal inferences and practical conclusions on the basis of

engineering data all the while recognizing the reality of variation

16

You might also like