You are on page 1of 77

Chi-square goodness-of-fit test

The chi-square test can be used to determine whether the sample data conform to
any kind of expected distribution and the data is categorical (nominal or
ordinal). The test determines whether the data fits a given distribution, such as
uniform, normal, …

( fo − fe )2
χ =Σ
2
df = k – 1 – m
fe
Where:
f0 = frequency of observed (or actual) values
fe = frequency of expected (or theoretical) values
k = number of categories
m = number of parameters being estimated from the sample data

1
Chi-square test for independence
The Chi-square test for independence is based on the count in a
contingency (or cross tabs) table. It tests whether the counts for the
row categories are probabilistically independent of the counts for the
column categories.

(Oij − Eij ) 2
χ 2 = Σij df = (row – 1)(col – 1)
Eij
Where:
Oi = Observed number of observations in category I
Ei = Expected number of observations in category I

2
Chi-square test – Local survey
◆ In a national survey, consumers were asked the question, ‘In general, how would
you rate the level of service that business in this country provide?’
◆ The distribution of responses was in the National column:
◆ Suppose a manager wants to find out whether this result apply to his customers
of her store in the city.
◆ She did a similar survey to 207 randomly selected customers in her

store and observed the results as in the


National Local response (of
Local column. response 207 asked)
◆ She can use Chi-square test to see Excellent 8% 21
if her observed frequencies of responses Pretty good 47% 109
are the same frequencies that would be Only fair 34% 62
expected on the national survey. Poor 11% 15
3
Hypothesis Testing – Local Survey Example

◆ Example using Excel

Microsoft Excel
Worksheet

Clive Morley 4
Steps in Hypothesis Testing

1: State the null and alternative hypotheses.


2: Make a judgment about the population distribution,
the level of measurement, and then select the
appropriate statistical test.
3: Decide upon the desired level of significance.
4: Collect data from a sample and compute the
statistical test to see if the level of significance is
met.
5: Accept or reject the null hypothesis.

Clive Morley 5
Contingency Tables
Two way table

Test whether rows and columns are associated (or


independent)

Can calculate expected numbers in each cell if rows and


columns independent
Compare with actual (observed)

χ ² = Σ (Oi - Ei)²/Ei
Clive Morley 6
Contingency Tables - Example
Two way table
eg. responses to question 6 (a, b, or c) by two groups:

Q6 Group 1 Group 2

a 10 18
b 12 22
c 15 26

The numbers in the table are counts (frequencies) of the


number falling into each category
Clive Morley 7
Contingency Tables - Example

Q6 Group 1 Group 2

a 27% 27%
b 32 33
c 41 39

Clive Morley 8
Contingency Tables - Example

Q6 Group 1 Group 2
a 10 18
b 12 22
c 15 26
Chi-square test statistic =  0.0142
χ ² = 0.014  

p-value = 0.9929

prob = 0.993
Microsoft Excel
Worksheet
Not significant
Clive Morley 9
Statistical Decision
For t-test (for mean or proportion):
Null Hypothesis: no-change situation
For Chi square test:
Null Hypothesis: two variable sets are independent

test value t > t critical-value (usually 2): Reject Null Hypothesis


p-value < alpha (usually 0.5): Reject Null Hypothesis
test value Chi-square > Chi-square critical-valua: Reject Null Hypothesis

Clive Morley 10
Type I and Type II errors
Two ways a hypothesis test result can be wrong:

I - find hypothesis is wrong, when it is correct

II - find hypothesis is correct, when it is wrong

Clive Morley 11
Type I and Type II errors
REALITY
Hypothesis correct Hypothesis wrong
TEST FINDS

Hypothesis correct √ × type II error

Hypothesis wrong × √
type I error
(test significance level)

Clive Morley 12
Type I and Type II errors
Prob value = observed probability of type I error

In control charts, control limits often set at ± 3 standard deviations


equivalent to setting probability of type I error at 0.003

minimizes reacting when don’t need to

Using t = 2
equivalent to setting probability of type I error at 0.05

Clive Morley 13
BUSM 4074
Management Decision
Making

Prof. Clive Morley


Graduate School of Business
BUSM 4074 Management Decision
Making

4. Multiple regression

5. Multiple regression (cont)


Unit 4&5 - Learning Objectives
◆ To understand the use of the multiple regression technique,
including linear, log-log, logit, autoregressive and time series
models
◆ To be able to carry out straightforward multiple regression
model estimation
◆ To be able to interpret standard computer output from a multiple
regression exercise, including to assess variables for
significance, estimate the size of an explanatory variable’s
impact on the dependent variable, assess model fit and to use
the model to estimate values of the dependent variable

Clive Morley 16
… as my salary increases, computers are getting
cheaper,
therefore to get cheaper computers, pay me more

What is wrong with this (very attractive) argument?


Multiple regression

very powerful widely used statistical technique


many applications in all sorts of areas
used to estimate the relationship between variables

For example, Y might be the sales of a certain item and X


the price of it. The linear relationship is estimated:
Y = a + bX
Multiple regression

parameters a and b are estimated from data on the


variables X and Y

correlation establishes whether a linear relationship exists


and how strong it is
regression estimates what the relationship is
Multiple regression

model is readily extended to include other explanatory


variables: for example, sales (Y) might depend on price
(X1), buyers incomes (X2) and advertising expenditure
(X3), giving the equation to be estimated

Y = a + b 1 X1 + b 2 X2 + b 3 X3

Data on a number of cases (eg. various sales areas or


different times) for all the variables is needed
Multiple regression

the explanatory variables do not exactly predict the value


of Y
due to random effects
the impacts of other (hopefully minor) variables, etc.

so the equation does not exactly fit


→ residuals
Purposes of multiple regression
◆ to estimate the equation, so we can predict Y for given
values of the explanatory variables, or
◆ to estimate the effects of variables on Y (through the b
parameters of the variables of interest, and also through
the variables correlation with Y), or
◆ to determine which potential explanatory variables
have a significant impact on Y (through testing the
significance of the relevant b values).
Theory – Least squares

Computer finds the values for parameters that give the


“line of best fit’

Best fit is defined as minimising the sum of squared


errors (SSE)
Theory – model specification

Y is some function of a lot of explanatory variables

Narrow the lot of explanatory variables down to those


expected to be important (ignore others)

Then specify functional form of relationship – linear is


usual starting point for regression
(but see discussion of log-log models below)
Theory – model specification

Model specification – which variables, linear (or other)


form, etc based on relevant theory

Estimated relationship then based on data


Multiple regression

overall fit of the equation estimated is measured by


R-squared (R²)
the proportion of the variation in Y explained by the
equation
also is the square of the correlation between the fitted and
actual Y values

Each parameter estimated (and hence each variable) can be


tested for individual significance
Linear Regression – Example
Data: House Price Sq Feet
(y) (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Linear Regression – Example

Plot of data
y = 75.814 + 0.123xSq.Fteet
Simple Linear Regression Model
yi = β0 + β1 xi + ε i
^
Y
Y = mX + c

Observed
value of Y for Slope = β1
εi
Xi
εi

Random error for this


Intercept = β0
Xi value

Xi 29
Excel Residual Output for House Price model
X Y Predicted Y Y-Y
(SqFt) ($’000)
Y

1400 245 251.92 -6.92316

1600 312 273.88 38.12329

1700 279 284.85 -5.85348

1875 308 304.06 3.93716

1100 199 218.99 -19.99284

1550 219 268.39 -49.38832

2350 405 356.20 48.79749

2450 324 367.18 -43.17929

1425 319 254.67 64.33264

1700 255 284.85 -29.85348 It shows how well the regression line fits the data
points. The best and worst predictions were 3.94 and
64.3, respectively. 30
Measures of variation
Y
Yi ∧
SSE = ∑(Yi - Ŷi ) 2
Y
_

SSyy = ∑(Yi - Y)2 _


Ŷ
_ SSR = ∑(Ŷi - Y)2 _
Y Y

Xi X
SSE: Sum of Squares of Error, SSR: Sum of Squares of Regression31
Measures of variation

◆ Total variation is made up of two parts: SS yy = SSR + SSE

Total Sum of Regression Sum of Error Sum of


Squares Squares Squares
SS yy = ∑(Yi − Y ) 2 SSR = ∑ (Yˆi − Y ) 2 SSE = ∑ (Yi − Yˆi ) 2

SSyy = Total Sum of Squares


Measures the variation of the Yi values around their mean Y
SSR = Regression Sum of Squares
Explained variation attributable to the relationship between X and Y
SSE = Error Sum of Squares
Variation _attributable to factors other than the relationship between X and Y
Where: Y = Average value of the dependent variable
Yi = Observed values of the dependent variable
Ŷi = Predicted value of Y for the given Xi value 32
Standard Error of the Estimate

The standard error of the estimate is the standard deviation of the


error of a regression model

Sum of Squares SSE = ∑ ( y − yˆ ) 2

Error
= ∑ y 2 − b0 ∑ y − b1 ∑ xy

Standard Error SSE


of the
Estimate
se = n −2

Standard Error of the Estimate tells us how spread-out the errors is.
33
Linear Regression – Example
Computer output:
Correlation R = 0.837 , R-squared = 0.700

Coefficient t sig

Constant 75.813 2.508 0.0204

Sq. feet 0.123 7.009 0.0000


Linear regression - Example
the model estimated is:
House Price = 75.813 + 0.123× Sq.Feet
◆ correlation between House Price and Sq. Feet is high, at
0.837, and the fit of the regression model is quite strong
R² = 0.700, i.e. 70%
◆ The Sq. Feet variable is highly significant
t = 7.009, p = 0.0000

◆ Implicit hypothesis is that coefficient is zero, i.e. variable


has no impact
Linear regression - Example

Add another variable to data - Location


Price Sq. Feet Location
245 1400 2
312 1600 3
279 1700 4
308 1875 3
199 1100 5
219 1550 1
405 2350 1
324 2450 5
319 1425 4
etc
Linear regression - Example
Computer output:
Correlation R = 0.839 , R-squared = 0.705
Without Location: Correlation R = 0.837 , R-squared = 0.700

Coefficient t sig

Constant 73.510 2.366 0.0282

Sq. Feet 0.120 6.475 0.0000


Location 2.283 0.525 0.6050
Linear Regression – Example

Slight improvement in R²

Location not significant (‘sig’ or p-value high)


- consider dropping from model
Coefficient t sig
Constant 73.510 2.366 0.0282
Sq. Feet 0.120 6.475 0.0000
Location 2.283 0.525 0.6050
Linear regression - example
Model Market to Book Value (MBV) as function of Revenue

Data
Company MBV Revenue
1 2.011 39.505
2 1.814 4.165
3 1.522 10.406
4 1.826 7.602
5 1.824 2.942
6 1.337 5.228
7 1.650 1.697
etc
Linear regression - example
Output
Dep Var: MBV N: 71
Multiple R: 0.318
Squared multiple R: 0.101

variable coefficient t value sig

Constant 2.010 11.465 0.000

Revenue 0.046 2.789 0.007


Linear regression - example

Model is:
MBV = 2.010 + 0.046× Revenue

Fit not great (R² = 0.10, 10%)


But significant (F = 7.778, p = 0.007)

Revenue variable is significant


(t = 2.789, p = 0.007)
Linear regression - example

More factors (variables) impact on MBV and need to be


considered
*** WARNING ***
Case 1 has large leverage
(Leverage = 0.243)
Case 8 has large leverage
(Leverage = 0.163)
Case 56 is an outlier
(Standardized Residual = 5.167)
Durbin-Watson D Statistic 1.682
First Order Autocorrelation 0.140
Multiple regression
Avoid step-wise regression
Look for non-linear patterns in scattered plot
Diagnostic checks
◆ Multicollinearity (different x’s move together in systematic way)
◆ Autocorrelation (successive error terms are correlated with each
other)
◆ Outliers (data points that are not together with the rest)
◆ Heteroscedasticity (non-constant variance)
◆ Leverage (observation with large effects on outcomes)
Multiple regression - Example

Hospital and Nursing Salary example


(9.10 of textbook)
Multiple regression - Example

Dialogue box
Dependent: Annual Nursing Salary
Independents: Number of beds in home
Annual medical in-patient days
Annual total patient days
Rural (1) and non-rural (0) homes
Multiple regression - Example

Model Summary
R R Square Adjusted R Square
0.8803 0.775 0.7557

Std. Error of the Estimate $82,024.63

ANOVA F Sig
Regression 40.4375 0.000
Multiple regression - Example
R=0.88 Coeff. of correlation, the relationship between 2
variables. R=-1: strong, negative relationship. R=1: strong,
positive relationship. R=0: no relationship between 2 variables.
R Square = 0.775 Coeff. of Determination. 77.5% of a change in Y
can be explained by a change in X. The other 22.5% is by some
other factors. This fit is quite strong.
Adjusted R Square=0.7557 Adjusted for multiple variables. A
decrease Adjusted R Square means the newly added variable is not
significant.
Multiple regression - Example

Std. Error of the Estimate $82,024.63


Sig=0.000 Significant fit
p=0.1799 (beds) Too high (compared to α), should
consider dropping “beds” as a variable of fit
Multiple regression - Example

Coeff- Standard t value p value


icients Error (sig)
Constant (Intercept) 113.5003 495.4654 0.2291 0.8198
Number of beds in home 9.6399 7.0804 1.3615 0.1799
Annual medical in-patient -7.4072 2.4012 -3.0848 0.0034
days (100s)
Annual total patient days 15.7674 2.7550 5.7232 0.0000
(100s)
Rural (1) and non-rural (0) -79.5796 288.1857 -0.2761 0.7837
homes
Multiple regression - Example
• The interpretation of the coefficients is that if the in-patient days,
total patient days and rural factor are held constant, then the annual
nursing salary is expected to increase by $9.64 for each extra bed in
home.
• Similarly, annual nursing salary is expected to increase (decrease) by
$-740.72, $1576.74 and $-79.58 for each extra in-patient day, patient
day and rural factor, respectively, other variables held constant.
• The $11300 can be interpreted as the annual base salary.
Coeff-icients Standard Error t value p value (sig)
Constant (Intercept) 113.5003 495.4654 0.2291 0.8198
Number of beds in home 9.6399 7.0804 1.3615 0.1799
Annual medical in-patient days (100s) -7.4072 2.4012 -3.0848 0.0034
Annual total patient days (100s) 15.7674 2.7550 5.7232 0.0000
Rural (1) and non-rural (0) homes -79.5796 288.1857 -0.2761 0.7837
Multiple regression - Example

◆ Compare the intercepts and the slopes of multiple


regression with those of linear regression?
Changes have occurred. (Difficult to analyse in detail)
◆ Se still is the error of the estimate. Note that multiple
regression yields a better se than those in linear
◆ R2 – similarly (but would increase with extra x’s)
◆ Adjusted R2 – a decrease indicates an added x not
belong to the equation.
Multiple regression - Example

◆ Tolerance stats OK (> 0.1), so no multicollinearity


issue. If individual R2 is too high (almost equal R2 of
multiple regression): suspect multicollinearity!
◆ Durbin-Watson stat d = 2.4789, somewhat negatively
autocorrelation issue. 1 < d ≤ 2 would indicate no
autocorrelation concern!
◆ Outlier – see graphs of residuals. Normal shape on
histogram, random (no patterned) on scatter plots.
Multiple regression - Example

H
is
tog
ram S t a n d a r d iz e d R e s id u a ls
D
epe
nde
ntV
aria
ble
:Cu
rre
ntS
ala
ry
1
60

20
1
40

15
1
20
Frequenc y
Frequency

1
00

10
8
0

5
6
0

4
0
0
2
0
S
td
M
e
.D
a
e
v=1
n=0
.0
.00
0

0 N=4
74.0
0

- 3 .5 - 3 - 2 .5 - 2 - 1 .5 - 1 - 0 .5 0 0 .5 1 1 .5 2 2 .5 3 3 .5
-

6
4

.0

.0

.0

.0

.0
.0

.0
.0

.0

.0

.0

0
0

R
egre
ssio
nSta
nda
rdizedR
esid
ual
V a r ia b le

Standardized Residual distribution is relatively normal. Relatively good fit


Multiple regression - Example

Scatter plot – Randomly distributed. Both sides of 0.00


2000.0000

1500.0000

1000.0000

500.0000
u
aR
id
s
e

0.0000
l

0 10 20 30 40 50 60
-500.0000

-1000.0000

-1500.0000

-2000.0000

-2500.0000

Estimate

Scatterplot

Regression Standardized Residual


Dependent Variable: Current Salary
8

The red plot is an example of a -2

heteroscedasticity (unequal variance -4

-6

distribution) 0 20000 40000 60000 80000 100000 120000 140000

Current Salary
Multiple regression - Example

Dummy variables – categorical data, related to dependent


variable
Other names: indicators, 0-1 variables
If dummy variable = 1: correct category
dummy variable = 0: not in that category

The coefficient of this variable indicates the dependent


variable difference due to this (dummy) variable
Multiple regression - Example
Coeff-icients Standard Error t value p value (sig)
Constant (Intercept) 113.5003 495.4654 0.2291 0.8198
Number of beds in home 9.6399 7.0804 1.3615 0.1799
Annual medical in-patient days (100s) -7.4072 2.4012 -3.0848 0.0034
Annual total patient days (100s) 15.7674 2.7550 5.7232 0.0000
Rural (1) and non-rural (0) homes -79.5796 288.1857 -0.2761 0.7837

Salary = 113.50 + 9.64Bed – 7.41In-ptDay + 15.77Tot-


ptDay – 79.58Rural
Rural = 0 and Rural = 1: Salary difference = - $7958
(rural is lower)
Two or more categorical variables can be involved. It
indicates the y difference when the rest is the same
Analysing a Regression
◆ p-value of the Regression
◆ p-value of each x, to consider dropping it or not
◆ Adjusted R-square value
◆ Standard Error of the Regression estimate
◆ Scatter plot of residuals – Randomness. Outliers. Heteroscedacisticity (non-
equal variance)
◆ Histogram of the residuals
◆ Durbin-Watson statistic (d)
Linear, Quadratic and Log regression
example

The Public Service Electric Company produces different


quantities of electricity each month, depending on the
demand. File Poly and Log examples - Power.xls list
the number of units of electricity produced (Units) and
the total cost of producing these (Cost) for a 36-month
period. How can regression be used to analyse the
relationship between Cost and Units?
Multiple regression - Example

R Square 0.7359
Standard Error 2733.7424

R Square 0.8216
Standard Error 2280.7998
Log model
Very often we use multiple regression to fit a multiplicative model:
Y = aX1b1 X2b2 X3b3

Any explanatory variable change by 1%, the dependent variable


change by a constant percentage

This can be estimated by making a logarithmic transformation of


the equation, which gives:
ln(Y) = ln(a) + b1ln(X1)+ b2ln(X2)+ b3ln(X3)
Log model

Thus we can calculate ln(Y), ln(X1), ln(X2), ln(X3)

And regress these variables in the usual way, to estimate


the parameters of the original equation.
Log model example

File CarSales.xls contains annual data (1970 – 1999) on


domestic auto sales in the United States. The variables
are defined as:
Sales: annual domestic auto sales (in number of units)
PriceIndex: consumer price index of transportation
Income: real disposable income
Interest: prime rate interest
Multiple regression - Example

MultiRegres Coefficients t value p value Regression and Correlation


Intercept 513941538.55 0.7356 0.4688 Observations 30
Year -258651.57 -0.7234 0.4761 R Square 0.5414
PriceIndex -18121.97 -0.4786 0.6364 Standard Error 758049.7773
Income 2175.75 1.1204 0.2732 Adjusted R Square 0.4680
Interest -8895378.05 -1.4810 0.1511 Multiple R 0.7358

LogRegres Coefficients t value p value Regression and Correlation


Intercept -110360558.48 -45.9500 0.0000 Observations 30
Log(Sales) 7522741.47 54.4195 0.0000 R Square 0.9956
Log(PriceIndex) 35983.70 0.2297 0.8202 Standard Error 74199.1103

Log(Income) -162258.29 -0.6222 0.5395 Adjusted R Square 0.9949

Log(Interest) -13588.13 -0.2133 0.8328 Multiple R 0.9978


Multiple regression - Example
Log model

Probably slightly better model


R-square = 0.99, Good
Less outliers, slightly better
Residual plots not necessarily better
Multiple Regression Goal

Remove any unimportant


(multicorrelation or
autocorrelation, etc) variables
out of the equation and decide
which variable(s) are important
for the regression model.

Use that model for your


prediction.
Multiple regression – time series example

Plot CarSales.xls data, Year vs. Sales


Multiple regression – time series example

Period Sales (‘000)


2003 Quarter I 25.4
2003 Quarter II 23.8
2003 Quarter III 22.0
2003 Quarter IV 28.6
2004Quarter I 28.5
2004 Quarter II 27.0
etc
Multiple regression – time series example

45

40
SALES

35

30

25

20
0 5 10 15 20
TIME
Multiple regression – time series example
Create dummy variables for the Quarters and time period

Period Sales Time QII QIII QIV


2003 Q I 25.4 1 0 0 0
2003 Q II 23.8 2 1 0 0
2003 Q III 22.0 3 0 1 0
2003 Q IV 28.6 4 0 0 1
2004 Q I 28.5 5 0 0 0
2004 Q II 27.0 6 1 0 0
etc
Multiple regression – time series example
Squared multiple R: 0.987
Effect Coefficient t P
CONSTANT 23.679 50.5 0.000
TIME 1.005 28.5 0.000
QII -2.525 -5.2 0.000
QIII -5.070 -9.8 0.000
QIV 0.450 0.9 0.401
Could drop QIV and re-estimate
Multiple regression – time series example
Model as estimated is:
Sales
= 23.679 + 1.005× Time - 2.525QII – 5.070QIII + 0.450QIV

Say data ended at Time = 24 i.e. 2008 QIV


Use model to forecast
e.g. forecast sales in 2009 in quarters I and II
Multiple regression – time series example
2009 quarters I is Time = 25, QI = 1, QII=0, QIII=0, QIV=0
Sales = 23.679 + 1.005× 25 - 0 – 0 + 0
= 48.808 i.e. $48,800

2009 quarters II is Time = 26, QI = 0, QII=1, QIII=0, QIV=0


Sales = 23.679 + 1.005× 26 - 0 – 2.525 + 0
= 47.284 i.e. $47,300
Autoregression

Another way of dealing with time series is Autoregression

Often used when Durbin-Watson indicates autocorrelation (a


common issue with time series data)
Or because it makes theoretical sense
that one period’s value depends (partly) on the previous
value of the series
Use previous (lagged) values as explanatory variable
Autoregression
In example, add another variable, which is the lagged sales

Period Sales Time QII QIII QIV lagSales


2003 Q I 25.4 1 0 0 0 -
2003 Q II 23.8 2 1 0 0 25.4
2003 Q III 22.0 3 0 1 0 23.8
2003 Q IV 28.6 4 0 0 1 22.0
2004 Q I 28.5 5 0 0 0 28.6
2004 Q II 27.0 6 1 0 0 28.5
etc
Autoregression

The lagged variable would replace the Time (trend) variable

First data point is lost, as we don’t have a lagged value for it


Can handle seasonality by having another variable, Sales
lagged by the seasonality period (e.g. 4 terms)
Logit regression

If the dependent variable is categorical, not metric


e.g. accounting graduates, membership of CPA Aust or
not is dependent variable
X variables might be gender, age, importance of
joining cost, importance of brand status, etc

Regression possible, special technical issues


Reference

Ragsdale (2008) chapter 9, + pp522-28

You might also like