You are on page 1of 48

7 - 1

7 - 1
Chapter 7: Data Analysis in the
Service of Modeling
The Art of Modeling with
Spreadsheets
S.G. Powell and K.R. Baker
John Wiley and Sons, Inc.
PowerPoint Slides Prepared By:
Tava Olsen
Washington University in St. Louis
7 - 2
Data Analysis in the
Context of Modeling
Supports the modeling process
Improves accuracy of model
Improves usefulness of conclusions
Modeling is the primary goal
Data analysis is a means to that goal
7 - 3
Topics for Chapter
Finding facts in databases
Searching, editing, sorting, and filtering
Estimating parameters
Point estimates and interval estimates
Estimating relationships among variables
Single, multiple, and nonlinear regression
Forecasting a single variable
Time series methods
7 - 4
Databases
Tables of information
Each row is a record in the database
Each column is a field for the records
Excel calls such a table a list
7 - 5
Excel Lists
First row contains names for each field
Each successive row contains one record
Lists may be:
Searched and edited
Sorted
Filtered
Tabulated
7 - 6
Searching and Editing Lists
First assign a range name to entire list
Include column titles
With list selected choose Data Form
Examine records one at a time:
Find Prev
Find Next
Enter new record with New button
Delete record with Delete button
7 - 7
Database Form
***Insert Figure 7.8
7 - 8
Criteria Button
Found under Data Form
Allows for searching of records
Enter data into a field
Click Find Next
7 - 9
Alternate Excel Search
Techniques
Highlight entire database
Use Edit Find to search
Use Find and Replace to edit entries
In Find and Replace
? stands for any single symbol
* stands for any sequence of symbols
7 - 10
Sorting: Data Sort Command
***insert figure 7.10
7 - 11
Filtering
Select database then Data Filter AutoFilter
Will filter lists based on values
Found under arrow at the title of each column
Arrow on title turns blue to remind list is filtered
Can remove filter by:
Select (All) using the list arrow; or
Selecting Show All under Data Filter
7 - 12
More Filtering
Top 10 option returns records with smallest
or largest value of a numerical record
Custom option allows filtering with
compound criteria
More complicated compound criteria can be
achieved with Data Filter Advanced
Filter submenu
7 - 13
Tabulating
Select Data Pivot Table
Creates summary tables
Layout button on
third step of wizard
creates the format
for the table
7 - 14
Analyzing Sample Data
Data is unlikely to cover whole population
Work with sample from population
Statistics are summary measures about sample
Want to construct statistics that represent population
Convenience sampling
Have easy access to information on subset of population
Subset may not be representative
Random sampling
All objects in population have equal chance of appearing
in sample

7 - 15
Descriptive Statistics
Summarizes information in sample
Gives numerical picture of observations
Excel Tools Data Analysis
Descriptive Statistics table produced based on
data given as input
7 - 16
Inferential Statistics
Use information in sample to make inferences about
population
Systematic Error
If sample not representative of population
Avoid by careful sampling
Sampling Error
Sample is merely subset of population
Mitigated by taking large samples
7 - 17
Point Estimates
The sample average is calculated as:

The sample variance is calculated as:


and its square root is the sample standard deviation:

n x x
n
i
i

=
=
1
s
2
=
(x
i
x )
2
n 1
i =1
n

s =
(x
i
x )
2
i =1
n

n 1
7 - 18
Interval Estimates
P(L <= <= U) = 1 o.

L and U represent the lower and upper limits of the
interval
1 o represents the confidence level
Usually a large percentage like 95 or 99%
represents the (unknown) true value of the
parameter.
7 - 19
Sampling Theory
Working with a population described by a Normal
probability model
Mean and standard deviation o.
Take repeated samples of n items from population
Calculate the sample average each time
The sample averages will follow a Normal
distribution with a mean of and a variance of o
2
/n
7 - 20
Estimates
Standard error: the standard deviation of
some function being used to provide an
estimate
Use the sample average to estimate the
population mean
The standard deviation of the sample average
is called the standard error of the mean:


o
x
= o / n
7 - 21
Z-scores
The z-score measures the number of standard
deviations away from the mean
The z-score corresponding to any particular sample
average is:

Tells how many standard errors from the mean
90% of the sample averages will have z-scores
between 1.64 and +1.64
The chances are 90% that the sample average will fall no
more than 1.64 standard errors from the true mean
z =
x
o
x
=
x
o n
7 - 22
Confidence Intervals for Means
Upper and lower limits on estimate for mean:

n>30 recommended unless original population
resembles Normal
z can be computed using NORMSINV(1-o/2)
Replace o by the sample standard deviation s
Provided that sample is larger than n = 30
Excel Descriptive Statistics also will calculate half-
width of confidence interval
x z(o / n)
7 - 23
Interval Estimates for a
Proportion
To estimate the sample proportion p, the
interval estimate is:


Sample size should be at least 50 for this
formula to be reliable
p z
p(1 p)
n
7 - 24
Sample Size Determination
Suppose want to estimate mean of sample to
within a range of R
n = (zo / R)
2

Assumes:
Sampling from Normal distribution
Known variance can begin with small sample
to estimate standard deviation

7 - 25
Sample Size Determination for
Proportions
Suppose want to estimate a proportion to
within a range of R
n = z
2
p(1 p) / R
2

Value maximized at p = 0.5
Conservative value:
n = (z/2)
2
/ R
2


7 - 26
Estimating Relationships
Scatter plot visualize association
Correlation:


n number of pairs of observations for x, y
s
x
, s
y
standard deviations of x, y
r measures strength of linear relationship between
x and y
r =
1
(n 1)
x
i
x
( )
s
x
|
\

|
.
|
i=1
n

(y
i
y )
s
y
|
\

|
.
|
7 - 27
r-statistic
Independent of units of measurement
Lies in range [-1, 1]
r > 0 positive association
r < 0 negative association
r close to 1 (or 1) implies a strong association
r close to 0 implies a weak association
Excel function: CORREL(xrange,yrange)


7 - 28
Regression Relationships
Relationships based on empirical data
Dependent variable predicted from values
of one or more independent variables
Regression models can be:
Linear or nonlinear
Simple or multiple
7 - 29
Simple Linear Regression
y = a + bx + e

y - dependent variable
x - independent variable
e - an error term.
Constants a and b represent the intercept and
slope, respectively, of the regression line
7 - 30
Error Term in Regression
Unexplained noise in the relationship
May represent limitations of knowledge
Or may represent random deviations of the
dependent variable from its mean, y
7 - 31
Regression Goal
Want to find line to most closely match the
observed relationship between x and y
Define most closely as minimizing sum of
squared differences between observed and model
values
Minimizing sum of differences would set y equal to its
mean
Penalizes large differences more than small differences
7 - 32
Performing Regression
Residuals:
e
i
= y
i
y = y
i
(a + bx
i
)

Sum of squared differences between observations
and model :
SS =

The regression problem: choose a and b to
minimize SS
e
i
2
i =1
n

= (y
i
i =1
n

a bx
1
)
2
7 - 33
Regression Analysis
Assumes residuals are normally distributed with
mean 0
Regression parameters can be calculated directly
from the data



Simpler to use Excels regression tool
(Under Data Analysis menu)
b =
n x
i
y
i
x
i
i =1
n

y
i
i =1
n

i =1
n

n x
i
2
( x
i
i =1
n

)
2
i =1
n

a = y bx
7 - 34
Quantifying Regression Fit
Coefficient of determination: R
2
Lies in range [0, 1]
Closer to one better fit
Measures how much of the variation in y-
values is explained by model
1 perfect match to model
0 equation explains none of observed variation

7 - 35
Regression Window
*** insert Figure 7.28
7 - 36
Regression Output
R Squared
Degree of significance
(under 0.1 is significant)
Estimate for a
Estimate for b
P values of under 0.1
are statistically significant
7 - 37
Simple Nonlinear Regression
A straight line may not be the most plausible
description of dependency, e.g., y = ax
b

Can follow previous ideas to minimize sum
of squared differences
No Excel functions or simple formulas
Or can transform non-linear relationship into
linear one, e.g., log y = log a + b log x
Give up some intuition for convenience
7 - 38
Multiple Linear Regression
Multiple independent variables
y = a
0
+ a
1
x
1
+ a
2
x
2
+ + a
m
x
m
+ e
Work with n observations each has:
One observation of dependent variable
One observation each of the m independent variables
Seek to minimize the sum of squared differences
Put all independent variables into x-range in Excels
regression tool
7 - 39
Regression Output
Coefficient of multiple determination
Coefficients of regression equation
P values of under 0.1
are statistically significant
Square root of R square
Accounts for presence of multiple variables
7 - 40
Values to Include in Regression
Ideally pick values that can be justified based
on practical or theoretical grounds
Could choose set that generates largest value
of adjusted R
2
Also could choose based on those with
significant p-values for coefficients
Remember that good models require good
forecasts for the independent variables

7 - 41
Regression Assumptions
Errors in the regression model
Follow a Normal distribution
Are mutually independent
Have the same variance
Linearity is assumed to hold
7 - 42
Forecasting with Time Series
Models
Use historical data
Assume near-term future will resemble past
Hypothesize a model with:
An average level: x
t
= + e
mean value; e random noise term
A trend
A seasonal or cyclic fluctuation
7 - 43
Measures of Forecast Accuracy
MSE Mean Squared Error between forecast
and actual
MAD Mean Absolute Deviation between
forecast and actual
MAPE Mean Absolute Percent Error
between forecast and actual

7 - 44
Moving Average Model
x
t
: observation from period t
n-period moving average forecast:
F
t
= (x
t
+ x
t1
+ + x
tn+1
) / n

Under Excel Data Analysis Moving Average:
interval = number of periods
Pairs forecast F
t
and observation x
t


7 - 45
Exponential Smoothing
Historic observations: x
t
, x
t1
, x
t2
, etc.
Forecast: F
t
= ox
t
+ (1 o)F
t1

Smoothing constant: o

Implies:
F
t
= ox
t
+ o(1 o)x
t1
+ o(1 o)
2
x
t2
+ o(1 o)
3
x
t3
+
F
t
= F
t1
+ o(x
t
F
t1
)

Under Excel Data Analysis damping factor = 1 - o
7 - 46
Exponential Smoothing
with a Trend
x
t
= + |t + e

Forecast calculated after the observation for period t
will be calculated as (F
t
+ T
t
)
o and | smoothing constants

F
t
= ox
t
+ (1 o)(F
t1
+ T
t1
)
T
t
= | (F
t
F
t1
) + (1 |)T
t1


7 - 47
Exponential Smoothing with
Trend and Seasonality
x
t
= ( + |t)S
t
+ e
p = number of periods in a cycle
Forecast calculated after the observation for period t
will be calculated as (F
t
+ T
t
)S
tp+1

o, |, and smoothing constants

F
t
= ox
t
/ S
t-p
+ (1 o)(F
t1
+ T
t1
)
T
t
= |(F
t
F
t1
) + (1 |)T
t1

S
t
= x
t
/ F
t
+ (1 ) S
t-p



7 - 48
Summary
Data collection and analysis should support
modeling
Locate relevant information
Estimate parameters and relations
Construct routine forecasts
Excel provides many tools
Databases: searching, sorting, filtering, and tabulating
Data Analysis: descriptive statistics, linear regression,
moving average and exponentially smoothed forecasts

You might also like