Excel Regression

Spreadsheet Problem Solving
fitting models to data

straight-line regression
multilinear regression
nonlinear regression
model building and selection
Data Analysis Regression tool
using
Trendline
Solver
Review of Straight-line Linear Regression

[ from Class #6 ]
y1
y = ax + b
Model
y
y11
e11
y11
x
x11
For each data point, there is an error between that

point and the model line. Fitting the model has to do
with minimizing these errors.
Finding the model parameters that give the best fit

For the straight-line model, the model parameters are
the slope (a) and the intercept (b).
The problem is then to find the values of a and b that
give the best fit. What is meant by the best fit?
The standard measure of goodness of fit is the sum
of squares of the errors:
n
SSE yi yi
i 1
yi a xi b
So, the problem reduces to finding the minimum of

SSE by adjusting a and b.
Fitting a straight-line model to data

The minimization of SSE can be solved by calculus
to give formulas for the best values of a and b:
n xi yi xi yi
i 1 i 1
a i 1
2
n
n
2
n xi xi
i 1
i 1
n
y
i 1
x
i 1
and Excel solves problems like this with either formulas

or built-in tools (Data Analysis Regression & Trendline).
4
Example: straight-line fit
Transfer the data to an Excel spreadsheet

and create a graph
CO2 Emissions for the US

1520
1500
1480
CO2 Emissions (MMT C)
1460
1440
1420
1400
1380
1360
1340
1320
1989
1990
1991
1992
1993
1994
1995
Year
1996
1997
1998
1999
2000
Calculating the slope and intercept using Excel formulas
n xi yi xi yi
i 1 i 1
a i 1
2
n
n
n xi2 xi
i 1
i 1
n
y
i 1
x
i 1
The formulas behind the numbers
Using the model straight-line equation to compute

the predictions:
and copy these

to the graph,
displaying as
a straight line

1550
1500
y = 21.32x - 41090
1450
1400
1350
1300
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Year
10
Using an alternate, shortcut approach
Trendline
Start with a simple graph of the data

Select the data series by
clicking on it
1520
1500
1480
Select
Add Trendline
option
1460
Right-click on a
data point to get
context-sensitive
menu
1440
1420
1400
1380
1360
1340
1320
1989
1990
1991
1992
1993
1994
1995
Year
1996
1997
1998
1999
11
2000
The Add Trendline dialog box
Linear selected
by default
OK for this
problem
Click on
Options tab
12
Options tab
Set for
Display equation
on chart
Click OK
13
Fix up
equation
display
Initial form of graph with straight-line added

1550
y = 21.315x - 41090
1500
1450
1400
1350
1300
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Year
14

1550
1500
y = 21.315x - 41090
1450
1400
1350
1300
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Year
Looks just like before, but we got there quicker

But neither of these approaches gives us much information
15
about the model, how good it is, etc.
A 2nd alternate approach

Tools
Data Analysis

recall that, if Data Analysis
does not appear on the Tools
menu, you will need to check
Analysis Toolpak in the Add-ins
dialog box [if its not there, you
will have to go back to Microsoft
Office/Excel set-up]
Initial, empty
Regression
dialog box
16
Regression dialog box set up for our problem
checking Residuals
will give us also
model predictions
17
Initial (poorly formatted) Regression output display

[ on new worksheet ]
Format
Autoformat
OK
and fix up
display for
appropriate
significant
figures
18
Final Display of Regression Output

[ tons of info, most of
which you will not
understand for a
couple years ]
used to judge
goodness of
fit
intercept
and slope
values
used to judge
whether terms
belong in the
model
add to data graph
for visual comparison
with model
19
Judging Goodness of Fit
correlation coefficient: if close

to +1 or 1, indicates strong
correlation between x and y
[something we already know
from the original graph!]
coefficient of determination:
%-age of the variability in y
thats accounted for by the
model
gives an idea of how

far off the model
predictions will be
adjustment to R2 that
penalizes the value for
using a model with too
many terms
Adjusted R2 or Standard Error can be used to compare

different models and choose which fits best. The higher
the value of Adjusted R2 the better, the lower the value
of Standard Error the better.
20
Judging whether terms belong in the model

P-values estimate the probability
that the true value of the coefficient
could be zero
A P-value of 5%
(0.05) or greater
causes suspicion
that the coefficient
may not be
significant and that
the term should
probably be dropped
from the model
P-values that are quite small, like

these, indicate that there is little
question about the significance of
the term coefficients. In our case
here, that means that both the
intercept term and the slope term
belong in the model.
21
The Data Analysis Regression tool appears much more

complicated and involved that the shortcut Trendline tool, so . . .
Why use Data Analysis Regression?
1) It provides more information that lets us
judge the goodness of fit and significance
of model terms
2) It can handle model forms that cannot be
handled by Trendline
So, generally, when using Excel, we prefer
the Data Analysis Regression tool over Trendline
but Trendline is still quite good for quick and dirty
looks at the data
Learn to use both!
22
More complicated models

Polynomial models
y a bx cx 2 dx 3 L
Note: it is called linear regression,

even when there are nonlinear
terms in x, because the terms are
linear in the model parameters,
a, b, c, etc.
General linear models

y a f1 x b f 2 x c f 3 x d f 4 x L
Examples:
polynomial models above

1
y a b c ln x
x
Multilinear models
y a f1 x1 ,x2 ,K b f 2 x1 ,x2 ,K c f 3 x1 ,x2 ,K L

Examples:
y a bx1 cx2 dx1 x2
y ae
x1
x2
23
Nonlinear models
Transformable to linear
ln y ln a b x
y a eb x
Not transformable
P 10
B
T C
straight-line
regression!
We can use the Data Analysis Regression tool for everything

except the nonlinear models that cant be transformed into
linear. For those, we can use the Solver.
24
Example: polynomial regression

curvature evident
Viscosity of Water at Atmospheric Pressure

2.000
1.800
1.600
Viscosity (cp)
1.400
1.200
1.000
0.800
0.600
0.400
0.200
0.000
0
50
100
150
200
250
Temperature (degF)
25
Setting up for polynomial fits
Select for quadratic model, etc
26
check Labels because

headings are included
in selections for Y and X
check
Residuals
27
Quadratic model regression results
model performance
adjR2
model coefficients
copy to graph
28
Quadratic model really doesnt capture behavior of data

2.000
1.800
1.600
Data
Viscosity (cp)
1.400
Quadratic
1.200
1.000
0.800
0.600
0.400
0.200
0.000
0
50
100
150
200
250
Temperature (degF)
29
Continue with fits of cubic, 4th- & 5th-order polynomials

Summary of results
Looks like 5th-order offers best performance

but improvement is marginal over 4th-order.
Resulting model:
Visc 3.161 0.05699 T 5.023 10 4 T 2 2.162 10 6 T 3 3.593 10 9 T 4
30

2.000
1.800
1.600
Data
Viscosity (cp)
1.400
Quadratic
Cubic
1.200
4th Order
1.000
0.800
0.600
0.400
0.200
0.000
20
40
60
80
100
120
140
160
180
200
220
Temperature (degF)
31
Precautions on polynomial fitting

Try to use the lowest-order model that gives a good fit.
Higher-order models will have wiggles between data
points that will cause prediction errors.
In fact, an (n-1)th-order polynomial will provide a perfect
fit to the n data points, but it will usually do bizarre things
in between the data points.
32
Example: multi-linear regression
Model 1: y a b x1 c x2
Model 2:
y b x1 c x2
X-input range includes

two independent variables:
x1 and x2
High P value for intercept in
Model 1 suggests Model 2
without intercept, but there
is a significant loss in adjR2
33
Multilinear Model Performance

12.0
Model performance isnt that

great for either model, and
Model 1 doesnt appear
dramatically better than Model 2
10.0
Predicted y
8.0
Model 1
6.0
Model 2
4.0
2.0
0.0
0
10
12
Measured y
Note: for multi-linear models, we plot Predicted vs Measured y.

A perfect model would place points directly on the 45-degree line.
34
Nonlinear Regression
Fitting the parameters of the van der Waals equation of state
Data for SO2
RT
a
V b V 2
Find the values of a and b

that give the best predictions
for P, when compared to the
measured values of P
35
Strategy for Nonlinear Regression

1) estimate initial values for a and b
2) compute predicted Ps using data for V and T
3) compute errors between predicted Ps and measured Ps
4) sum the squares of these errors to compute SSE
5) have the Solver minimize SSE
by adjusting the values of a and b
36
Basic data
Calculated Pressure
by both ideal gas law

and van der Waals
Sum of
squares
of this
column
37
Ideal Gas
Sum of Squares
Calculation Calculation
van der Waals Calculation
Error Calculation
38
Setting up Solver Parameters

SSE as Target Cell
Minimize
by adjusting a and b
with b>=0 constraint
Results
39
Results
40
Fit of van der Waals Eqn for SO2

and Comparison to Ideal Gas Law
12000000
Note departure of
ideal gas predictions
at higher pressures
Predicted Pressure (Pa)
10000000
8000000
van der Waals
Ideal Gas
6000000
4000000
2000000
0
0
2000000
4000000
6000000
8000000
10000000
12000000
Measured Pressure (Pa)
41

Excel Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Excel Regression

Uploaded by

Copyright:

Available Formats

Spreadsheet Problem Solving

fitting models to data

Review of Straight-line Linear Regression

For each data point, there is an error between that

Finding the model parameters that give the best fit

So, the problem reduces to finding the minimum of

Fitting a straight-line model to data

and Excel solves problems like this with either formulas

Example: straight-line fit

Transfer the data to an Excel spreadsheet

CO2 Emissions for the US

CO2 Emissions (MMT C)

Calculating the slope and intercept using Excel formulas

The formulas behind the numbers

Using the model straight-line equation to compute

and copy these

CO2 Emissions for the US

CO2 Emissions (MMT C)

Using an alternate, shortcut approach

Start with a simple graph of the data

The Add Trendline dialog box

Initial form of graph with straight-line added

CO2 Emissions (MMT C)

CO2 Emissions for the US

CO2 Emissions (MMT C)

Looks just like before, but we got there quicker

A 2nd alternate approach

Data Analysis Regression tool

Regression dialog box set up for our problem

Initial (poorly formatted) Regression output display

Final Display of Regression Output

Judging Goodness of Fit

correlation coefficient: if close

gives an idea of how

Adjusted R2 or Standard Error can be used to compare

Judging whether terms belong in the model

P-values that are quite small, like

The Data Analysis Regression tool appears much more

More complicated models

Note: it is called linear regression,

General linear models

polynomial models above

y a f1 x1 ,x2 ,K b f 2 x1 ,x2 ,K c f 3 x1 ,x2 ,K L

y a bx1 cx2 dx1 x2

We can use the Data Analysis Regression tool for everything

Example: polynomial regression

Viscosity of Water at Atmospheric Pressure

Setting up for polynomial fits

Select for quadratic model, etc

Data Analysis Regression tool

check Labels because

Quadratic model regression results

Quadratic model really doesnt capture behavior of data

Continue with fits of cubic, 4th- & 5th-order polynomials

Looks like 5th-order offers best performance

Viscosity of Water at Atmospheric Pressure

Precautions on polynomial fitting

Example: multi-linear regression

X-input range includes

Multilinear Model Performance

Model performance isnt that

Note: for multi-linear models, we plot Predicted vs Measured y.

Find the values of a and b

Strategy for Nonlinear Regression