Chi-square Goodness-of-Fit & Independence Tests Explained

Chi-square goodness-of-fit test
The chi-square test can be used to determine whether the sample data conform to
any kind of expected distribution and the data is categorical (nominal or
ordinal). The test determines whether the data fits a given distribution, such as
uniform, normal, …
( fo − fe )2
χ =Σ
2
df = k – 1 – m
fe
Where:
f0 = frequency of observed (or actual) values
fe = frequency of expected (or theoretical) values
k = number of categories
m = number of parameters being estimated from the sample data
1
Chi-square test for independence
The Chi-square test for independence is based on the count in a
contingency (or cross tabs) table. It tests whether the counts for the
row categories are probabilistically independent of the counts for the
column categories.
(Oij − Eij ) 2
χ 2 = Σij df = (row – 1)(col – 1)
Eij
Where:
Oi = Observed number of observations in category I
Ei = Expected number of observations in category I
2
Chi-square test – Local survey
◆ In a national survey, consumers were asked the question, ‘In general, how would
you rate the level of service that business in this country provide?’
◆ The distribution of responses was in the National column:
◆ Suppose a manager wants to find out whether this result apply to his customers
of her store in the city.
◆ She did a similar survey to 207 randomly selected customers in her
store and observed the results as in the

National Local response (of
Local column. response 207 asked)
◆ She can use Chi-square test to see Excellent 8% 21
if her observed frequencies of responses Pretty good 47% 109
are the same frequencies that would be Only fair 34% 62
expected on the national survey. Poor 11% 15
3
Hypothesis Testing – Local Survey Example
◆ Example using Excel
Microsoft Excel
Worksheet
Clive Morley 4
Steps in Hypothesis Testing
1: State the null and alternative hypotheses.

2: Make a judgment about the population distribution,
the level of measurement, and then select the
appropriate statistical test.
3: Decide upon the desired level of significance.
4: Collect data from a sample and compute the
statistical test to see if the level of significance is
met.
5: Accept or reject the null hypothesis.
Clive Morley 5
Contingency Tables
Two way table
Test whether rows and columns are associated (or

independent)
Can calculate expected numbers in each cell if rows and

columns independent
Compare with actual (observed)
χ ² = Σ (Oi - Ei)²/Ei
Clive Morley 6
Contingency Tables - Example
Two way table
eg. responses to question 6 (a, b, or c) by two groups:
Q6 Group 1 Group 2
a 10 18
b 12 22
c 15 26
The numbers in the table are counts (frequencies) of the

number falling into each category
Clive Morley 7
Q6 Group 1 Group 2
a 27% 27%
b 32 33
c 41 39
Clive Morley 8
Q6 Group 1 Group 2
a 10 18
b 12 22
c 15 26
Chi-square test statistic = 0.0142
χ ² = 0.014
p-value = 0.9929
prob = 0.993
Microsoft Excel
Worksheet
Not significant
Clive Morley 9
Statistical Decision
For t-test (for mean or proportion):
Null Hypothesis: no-change situation
For Chi square test:
Null Hypothesis: two variable sets are independent
test value t > t critical-value (usually 2): Reject Null Hypothesis

p-value < alpha (usually 0.5): Reject Null Hypothesis
test value Chi-square > Chi-square critical-valua: Reject Null Hypothesis
Clive Morley 10
Type I and Type II errors
Two ways a hypothesis test result can be wrong:
I - find hypothesis is wrong, when it is correct
II - find hypothesis is correct, when it is wrong
Clive Morley 11
REALITY
Hypothesis correct Hypothesis wrong
TEST FINDS
Hypothesis correct √ × type II error
Hypothesis wrong × √
type I error
(test significance level)
Clive Morley 12
Prob value = observed probability of type I error
In control charts, control limits often set at ± 3 standard deviations

equivalent to setting probability of type I error at 0.003
minimizes reacting when don’t need to
Using t = 2
equivalent to setting probability of type I error at 0.05
Clive Morley 13
BUSM 4074
Management Decision
Making
Prof. Clive Morley

Graduate School of Business
BUSM 4074 Management Decision
Making
4. Multiple regression
5. Multiple regression (cont)

Unit 4&5 - Learning Objectives
◆ To understand the use of the multiple regression technique,
including linear, log-log, logit, autoregressive and time series
models
◆ To be able to carry out straightforward multiple regression
model estimation
◆ To be able to interpret standard computer output from a multiple
regression exercise, including to assess variables for
significance, estimate the size of an explanatory variable’s
impact on the dependent variable, assess model fit and to use
the model to estimate values of the dependent variable
Clive Morley 16
… as my salary increases, computers are getting
cheaper,
therefore to get cheaper computers, pay me more
What is wrong with this (very attractive) argument?

Multiple regression
very powerful widely used statistical technique

many applications in all sorts of areas
used to estimate the relationship between variables
For example, Y might be the sales of a certain item and X

the price of it. The linear relationship is estimated:
Y = a + bX
Multiple regression
parameters a and b are estimated from data on the

variables X and Y
correlation establishes whether a linear relationship exists

and how strong it is
regression estimates what the relationship is
Multiple regression
model is readily extended to include other explanatory

variables: for example, sales (Y) might depend on price
(X1), buyers incomes (X2) and advertising expenditure
(X3), giving the equation to be estimated
Y = a + b 1 X1 + b 2 X2 + b 3 X3
Data on a number of cases (eg. various sales areas or

different times) for all the variables is needed
Multiple regression
the explanatory variables do not exactly predict the value

of Y
due to random effects
the impacts of other (hopefully minor) variables, etc.
so the equation does not exactly fit

→ residuals
Purposes of multiple regression
◆ to estimate the equation, so we can predict Y for given
values of the explanatory variables, or
◆ to estimate the effects of variables on Y (through the b
parameters of the variables of interest, and also through
the variables correlation with Y), or
◆ to determine which potential explanatory variables
have a significant impact on Y (through testing the
significance of the relevant b values).
Theory – Least squares
Computer finds the values for parameters that give the

“line of best fit’
Best fit is defined as minimising the sum of squared

errors (SSE)
Theory – model specification
Y is some function of a lot of explanatory variables
Narrow the lot of explanatory variables down to those

expected to be important (ignore others)
Then specify functional form of relationship – linear is

usual starting point for regression
(but see discussion of log-log models below)
Theory – model specification
Model specification – which variables, linear (or other)

form, etc based on relevant theory
Estimated relationship then based on data

Multiple regression
overall fit of the equation estimated is measured by

R-squared (R²)
the proportion of the variation in Y explained by the
equation
also is the square of the correlation between the fitted and
actual Y values
Each parameter estimated (and hence each variable) can be

tested for individual significance
Linear Regression – Example
Data: House Price Sq Feet
(y) (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Plot of data
y = 75.814 + 0.123xSq.Fteet
Simple Linear Regression Model
yi = β0 + β1 xi + ε i
^
Y
Y = mX + c
Observed
value of Y for Slope = β1
εi
Xi
εi
Random error for this

Intercept = β0
Xi value
Xi 29
Excel Residual Output for House Price model
X Y Predicted Y Y-Y
(SqFt) ($’000)
Y
1400 245 251.92 -6.92316
1600 312 273.88 38.12329
1700 279 284.85 -5.85348
1875 308 304.06 3.93716
1100 199 218.99 -19.99284
1550 219 268.39 -49.38832
2350 405 356.20 48.79749
2450 324 367.18 -43.17929
1425 319 254.67 64.33264
1700 255 284.85 -29.85348 It shows how well the regression line fits the data
points. The best and worst predictions were 3.94 and
64.3, respectively. 30
Measures of variation
Y
Yi ∧
SSE = ∑(Yi - Ŷi ) 2
Y
_
SSyy = ∑(Yi - Y)2 _

Ŷ
_ SSR = ∑(Ŷi - Y)2 _
Y Y
Xi X
SSE: Sum of Squares of Error, SSR: Sum of Squares of Regression31
Measures of variation
◆ Total variation is made up of two parts: SS yy = SSR + SSE
Total Sum of Regression Sum of Error Sum of

Squares Squares Squares
SS yy = ∑(Yi − Y ) 2 SSR = ∑ (Yˆi − Y ) 2 SSE = ∑ (Yi − Yˆi ) 2
SSyy = Total Sum of Squares

Measures the variation of the Yi values around their mean Y
SSR = Regression Sum of Squares
Explained variation attributable to the relationship between X and Y
SSE = Error Sum of Squares
Variation _attributable to factors other than the relationship between X and Y
Where: Y = Average value of the dependent variable
Yi = Observed values of the dependent variable
Ŷi = Predicted value of Y for the given Xi value 32
Standard Error of the Estimate
The standard error of the estimate is the standard deviation of the

error of a regression model
Sum of Squares SSE = ∑ ( y − yˆ ) 2
Error
= ∑ y 2 − b0 ∑ y − b1 ∑ xy
Standard Error SSE

of the
Estimate
se = n −2
Standard Error of the Estimate tells us how spread-out the errors is.
33
Computer output:
Correlation R = 0.837 , R-squared = 0.700
Coefficient t sig
Constant 75.813 2.508 0.0204
Sq. feet 0.123 7.009 0.0000

Linear regression - Example
the model estimated is:
House Price = 75.813 + 0.123× Sq.Feet
◆ correlation between House Price and Sq. Feet is high, at
0.837, and the fit of the regression model is quite strong
R² = 0.700, i.e. 70%
◆ The Sq. Feet variable is highly significant
t = 7.009, p = 0.0000
◆ Implicit hypothesis is that coefficient is zero, i.e. variable

has no impact
Add another variable to data - Location

Price Sq. Feet Location
245 1400 2
312 1600 3
279 1700 4
308 1875 3
199 1100 5
219 1550 1
405 2350 1
324 2450 5
319 1425 4
etc
Computer output:
Correlation R = 0.839 , R-squared = 0.705
Without Location: Correlation R = 0.837 , R-squared = 0.700
Coefficient t sig
Constant 73.510 2.366 0.0282
Sq. Feet 0.120 6.475 0.0000

Location 2.283 0.525 0.6050
Slight improvement in R²
Location not significant (‘sig’ or p-value high)

- consider dropping from model
Coefficient t sig
Constant 73.510 2.366 0.0282
Sq. Feet 0.120 6.475 0.0000
Location 2.283 0.525 0.6050
Linear regression - example
Model Market to Book Value (MBV) as function of Revenue
Data
Company MBV Revenue
1 2.011 39.505
2 1.814 4.165
3 1.522 10.406
4 1.826 7.602
5 1.824 2.942
6 1.337 5.228
7 1.650 1.697
etc
Output
Dep Var: MBV N: 71
Multiple R: 0.318
Squared multiple R: 0.101
variable coefficient t value sig
Constant 2.010 11.465 0.000
Revenue 0.046 2.789 0.007

Model is:
MBV = 2.010 + 0.046× Revenue
Fit not great (R² = 0.10, 10%)

But significant (F = 7.778, p = 0.007)
Revenue variable is significant

(t = 2.789, p = 0.007)
More factors (variables) impact on MBV and need to be

considered
*** WARNING ***
Case 1 has large leverage
(Leverage = 0.243)
Case 8 has large leverage
(Leverage = 0.163)
Case 56 is an outlier
(Standardized Residual = 5.167)
Durbin-Watson D Statistic 1.682
First Order Autocorrelation 0.140
Multiple regression
Avoid step-wise regression
Look for non-linear patterns in scattered plot
Diagnostic checks
◆ Multicollinearity (different x’s move together in systematic way)
◆ Autocorrelation (successive error terms are correlated with each
other)
◆ Outliers (data points that are not together with the rest)
◆ Heteroscedasticity (non-constant variance)
◆ Leverage (observation with large effects on outcomes)
Multiple regression - Example
Hospital and Nursing Salary example

(9.10 of textbook)
Dialogue box
Dependent: Annual Nursing Salary
Independents: Number of beds in home
Annual medical in-patient days
Annual total patient days
Rural (1) and non-rural (0) homes
Model Summary
R R Square Adjusted R Square
0.8803 0.775 0.7557
Std. Error of the Estimate $82,024.63
ANOVA F Sig
Regression 40.4375 0.000
R=0.88 Coeff. of correlation, the relationship between 2
variables. R=-1: strong, negative relationship. R=1: strong,
positive relationship. R=0: no relationship between 2 variables.
R Square = 0.775 Coeff. of Determination. 77.5% of a change in Y
can be explained by a change in X. The other 22.5% is by some
other factors. This fit is quite strong.
Adjusted R Square=0.7557 Adjusted for multiple variables. A
decrease Adjusted R Square means the newly added variable is not
significant.
Std. Error of the Estimate $82,024.63

Sig=0.000 Significant fit
p=0.1799 (beds) Too high (compared to α), should
consider dropping “beds” as a variable of fit
Coeff- Standard t value p value

icients Error (sig)
Constant (Intercept) 113.5003 495.4654 0.2291 0.8198
Number of beds in home 9.6399 7.0804 1.3615 0.1799
Annual medical in-patient -7.4072 2.4012 -3.0848 0.0034
days (100s)
Annual total patient days 15.7674 2.7550 5.7232 0.0000
(100s)
Rural (1) and non-rural (0) -79.5796 288.1857 -0.2761 0.7837
homes
• The interpretation of the coefficients is that if the in-patient days,
total patient days and rural factor are held constant, then the annual
nursing salary is expected to increase by $9.64 for each extra bed in
home.
• Similarly, annual nursing salary is expected to increase (decrease) by
$-740.72, $1576.74 and $-79.58 for each extra in-patient day, patient
day and rural factor, respectively, other variables held constant.
• The $11300 can be interpreted as the annual base salary.
Coeff-icients Standard Error t value p value (sig)
Annual medical in-patient days (100s) -7.4072 2.4012 -3.0848 0.0034
Annual total patient days (100s) 15.7674 2.7550 5.7232 0.0000
Rural (1) and non-rural (0) homes -79.5796 288.1857 -0.2761 0.7837
◆ Compare the intercepts and the slopes of multiple

regression with those of linear regression?
Changes have occurred. (Difficult to analyse in detail)
◆ Se still is the error of the estimate. Note that multiple
regression yields a better se than those in linear
◆ R2 – similarly (but would increase with extra x’s)
◆ Adjusted R2 – a decrease indicates an added x not
belong to the equation.
◆ Tolerance stats OK (> 0.1), so no multicollinearity

issue. If individual R2 is too high (almost equal R2 of
multiple regression): suspect multicollinearity!
◆ Durbin-Watson stat d = 2.4789, somewhat negatively
autocorrelation issue. 1 < d ≤ 2 would indicate no
autocorrelation concern!
◆ Outlier – see graphs of residuals. Normal shape on
histogram, random (no patterned) on scatter plots.
H
is
tog
ram S t a n d a r d iz e d R e s id u a ls
D
epe
nde
ntV
aria
ble
:Cu
rre
ntS
ala
ry
1
60
20
1
40
15
1
20
Frequenc y
Frequency
1
00
10
8
0
5
6
0
4
0
0
2
0
S
td
M
e
.D
a
e
v=1
n=0
.0
.00
0
0 N=4
74.0
0
- 3 .5 - 3 - 2 .5 - 2 - 1 .5 - 1 - 0 .5 0 0 .5 1 1 .5 2 2 .5 3 3 .5
-
6
4
.0
.0
.0
.0
.0
.0
.0
.0
.0
.0
.0
0
0
R
egre
ssio
nSta
nda
rdizedR
esid
ual
V a r ia b le
Standardized Residual distribution is relatively normal. Relatively good fit

Scatter plot – Randomly distributed. Both sides of 0.00

2000.0000
1500.0000
1000.0000
500.0000
u
aR
id
s
e
0.0000
l
0 10 20 30 40 50 60
-500.0000
-1000.0000
-1500.0000
-2000.0000
-2500.0000
Estimate
Scatterplot
Regression Standardized Residual

Dependent Variable: Current Salary
8
The red plot is an example of a -2
heteroscedasticity (unequal variance -4
-6
distribution) 0 20000 40000 60000 80000 100000 120000 140000
Current Salary
Dummy variables – categorical data, related to dependent

variable
Other names: indicators, 0-1 variables
If dummy variable = 1: correct category
dummy variable = 0: not in that category
The coefficient of this variable indicates the dependent

variable difference due to this (dummy) variable
Coeff-icients Standard Error t value p value (sig)
Annual medical in-patient days (100s) -7.4072 2.4012 -3.0848 0.0034
Annual total patient days (100s) 15.7674 2.7550 5.7232 0.0000
Rural (1) and non-rural (0) homes -79.5796 288.1857 -0.2761 0.7837
Salary = 113.50 + 9.64Bed – 7.41In-ptDay + 15.77Tot-

ptDay – 79.58Rural
Rural = 0 and Rural = 1: Salary difference = - $7958
(rural is lower)
Two or more categorical variables can be involved. It
indicates the y difference when the rest is the same
Analysing a Regression
◆ p-value of the Regression
◆ p-value of each x, to consider dropping it or not
◆ Adjusted R-square value
◆ Standard Error of the Regression estimate
◆ Scatter plot of residuals – Randomness. Outliers. Heteroscedacisticity (non-
equal variance)
◆ Histogram of the residuals
◆ Durbin-Watson statistic (d)
Linear, Quadratic and Log regression
example
The Public Service Electric Company produces different

quantities of electricity each month, depending on the
demand. File Poly and Log examples - Power.xls list
the number of units of electricity produced (Units) and
the total cost of producing these (Cost) for a 36-month
period. How can regression be used to analyse the
relationship between Cost and Units?
R Square 0.7359
Standard Error 2733.7424
R Square 0.8216
Standard Error 2280.7998
Log model
Very often we use multiple regression to fit a multiplicative model:
Y = aX1b1 X2b2 X3b3
Any explanatory variable change by 1%, the dependent variable

change by a constant percentage
This can be estimated by making a logarithmic transformation of

the equation, which gives:
ln(Y) = ln(a) + b1ln(X1)+ b2ln(X2)+ b3ln(X3)
Log model
Thus we can calculate ln(Y), ln(X1), ln(X2), ln(X3)
And regress these variables in the usual way, to estimate

the parameters of the original equation.
Log model example
File CarSales.xls contains annual data (1970 – 1999) on

domestic auto sales in the United States. The variables
are defined as:
Sales: annual domestic auto sales (in number of units)
PriceIndex: consumer price index of transportation
Income: real disposable income
Interest: prime rate interest
MultiRegres Coefficients t value p value Regression and Correlation

Intercept 513941538.55 0.7356 0.4688 Observations 30
Year -258651.57 -0.7234 0.4761 R Square 0.5414
PriceIndex -18121.97 -0.4786 0.6364 Standard Error 758049.7773
Income 2175.75 1.1204 0.2732 Adjusted R Square 0.4680
Interest -8895378.05 -1.4810 0.1511 Multiple R 0.7358
LogRegres Coefficients t value p value Regression and Correlation

Intercept -110360558.48 -45.9500 0.0000 Observations 30
Log(Sales) 7522741.47 54.4195 0.0000 R Square 0.9956
Log(PriceIndex) 35983.70 0.2297 0.8202 Standard Error 74199.1103
Log(Income) -162258.29 -0.6222 0.5395 Adjusted R Square 0.9949
Log(Interest) -13588.13 -0.2133 0.8328 Multiple R 0.9978

Log model
Probably slightly better model

R-square = 0.99, Good
Less outliers, slightly better
Residual plots not necessarily better
Multiple Regression Goal
Remove any unimportant

(multicorrelation or
autocorrelation, etc) variables
out of the equation and decide
which variable(s) are important
for the regression model.
Use that model for your

prediction.
Multiple regression – time series example
Plot CarSales.xls data, Year vs. Sales

Period Sales (‘000)

2003 Quarter I 25.4
2003 Quarter II 23.8
2003 Quarter III 22.0
2003 Quarter IV 28.6
2004Quarter I 28.5
2004 Quarter II 27.0
etc
45
40
SALES
35
30
25
20
0 5 10 15 20
TIME
Create dummy variables for the Quarters and time period
Period Sales Time QII QIII QIV

2003 Q I 25.4 1 0 0 0
2003 Q II 23.8 2 1 0 0
2003 Q III 22.0 3 0 1 0
2003 Q IV 28.6 4 0 0 1
2004 Q I 28.5 5 0 0 0
2004 Q II 27.0 6 1 0 0
etc
Squared multiple R: 0.987
Effect Coefficient t P
CONSTANT 23.679 50.5 0.000
TIME 1.005 28.5 0.000
QII -2.525 -5.2 0.000
QIII -5.070 -9.8 0.000
QIV 0.450 0.9 0.401
Could drop QIV and re-estimate
Model as estimated is:
Sales
= 23.679 + 1.005× Time - 2.525QII – 5.070QIII + 0.450QIV
Say data ended at Time = 24 i.e. 2008 QIV

Use model to forecast
e.g. forecast sales in 2009 in quarters I and II
2009 quarters I is Time = 25, QI = 1, QII=0, QIII=0, QIV=0
Sales = 23.679 + 1.005× 25 - 0 – 0 + 0
= 48.808 i.e. $48,800
2009 quarters II is Time = 26, QI = 0, QII=1, QIII=0, QIV=0

Sales = 23.679 + 1.005× 26 - 0 – 2.525 + 0
= 47.284 i.e. $47,300
Autoregression
Another way of dealing with time series is Autoregression
Often used when Durbin-Watson indicates autocorrelation (a

common issue with time series data)
Or because it makes theoretical sense
that one period’s value depends (partly) on the previous
value of the series
Use previous (lagged) values as explanatory variable
Autoregression
In example, add another variable, which is the lagged sales
Period Sales Time QII QIII QIV lagSales

2003 Q I 25.4 1 0 0 0 -
2003 Q II 23.8 2 1 0 0 25.4
2003 Q III 22.0 3 0 1 0 23.8
2003 Q IV 28.6 4 0 0 1 22.0
2004 Q I 28.5 5 0 0 0 28.6
2004 Q II 27.0 6 1 0 0 28.5
etc
Autoregression
The lagged variable would replace the Time (trend) variable
First data point is lost, as we don’t have a lagged value for it

Can handle seasonality by having another variable, Sales
lagged by the seasonality period (e.g. 4 terms)
Logit regression
If the dependent variable is categorical, not metric

e.g. accounting graduates, membership of CPA Aust or
not is dependent variable
X variables might be gender, age, importance of
joining cost, importance of brand status, etc
Regression possible, special technical issues

Reference
Ragsdale (2008) chapter 9, + pp522-28

Chi-square Goodness-of-Fit & Independence Tests Explained

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chi-square Goodness-of-Fit & Independence Tests Explained

Uploaded by

Copyright:

Available Formats

Chi-square goodness-of-fit test

store and observed the results as in the

◆ Example using Excel

1: State the null and alternative hypotheses.

Test whether rows and columns are associated (or

Can calculate expected numbers in each cell if rows and

The numbers in the table are counts (frequencies) of the

test value t > t critical-value (usually 2): Reject Null Hypothesis

I - find hypothesis is wrong, when it is correct

II - find hypothesis is correct, when it is wrong

Hypothesis correct √ × type II error

In control charts, control limits often set at ± 3 standard deviations

minimizes reacting when don’t need to

Prof. Clive Morley

5. Multiple regression (cont)

What is wrong with this (very attractive) argument?

very powerful widely used statistical technique

For example, Y might be the sales of a certain item and X

parameters a and b are estimated from data on the

correlation establishes whether a linear relationship exists

model is readily extended to include other explanatory

Data on a number of cases (eg. various sales areas or

the explanatory variables do not exactly predict the value

so the equation does not exactly fit

Computer finds the values for parameters that give the

Best fit is defined as minimising the sum of squared

Y is some function of a lot of explanatory variables

Narrow the lot of explanatory variables down to those

Then specify functional form of relationship – linear is

Model specification – which variables, linear (or other)

Estimated relationship then based on data

overall fit of the equation estimated is measured by

Each parameter estimated (and hence each variable) can be

Random error for this

1400 245 251.92 -6.92316

1600 312 273.88 38.12329

1700 279 284.85 -5.85348

1875 308 304.06 3.93716

1100 199 218.99 -19.99284

1550 219 268.39 -49.38832

2350 405 356.20 48.79749

2450 324 367.18 -43.17929

1425 319 254.67 64.33264

SSyy = ∑(Yi - Y)2 _

◆ Total variation is made up of two parts: SS yy = SSR + SSE

Total Sum of Regression Sum of Error Sum of

SSyy = Total Sum of Squares

The standard error of the estimate is the standard deviation of the

Sum of Squares SSE = ∑ ( y − yˆ ) 2

Standard Error SSE

Constant 75.813 2.508 0.0204

Sq. feet 0.123 7.009 0.0000

◆ Implicit hypothesis is that coefficient is zero, i.e. variable

Add another variable to data - Location

Constant 73.510 2.366 0.0282

Sq. Feet 0.120 6.475 0.0000

Location not significant (‘sig’ or p-value high)

variable coefficient t value sig

Constant 2.010 11.465 0.000

Revenue 0.046 2.789 0.007

Fit not great (R² = 0.10, 10%)

Revenue variable is significant