You are on page 1of 16

DSC2008

NATIONAL UNIVERSITY OF SINGAPORE

DSC2008 BUSINESS ANALYTICS DATA AND DECISIONS


(Semester 2 AY2013/2014)

Time Allowed : 2 Hours

INSTRUCTIONS TO CANDIDATES
1 Please write only your student number below. Do not write your name.
2 This booklet contains two (2) Sections and comprises fifteen (15) printed
pages.
3 Answer ALL questions. This is an OPEN Book examination. All materials are
allowed.
4 Write legibly. A dark pencil may be used.
5 Graphic calculators or other calculators may be used, but not computers,
tablets or phones.
6 Write your answers in the spaces provided after each part of a question,
except that answers to Section A must only be entered into the Bubble Form
provided. The first column for STUDENT NUMBER on the Bubble Form should
be left blank.
7 Plan your answers to ensure they fit within the spaces provided. Other than
this cover page and the spaces designated for providing your answers, you
may do your rough work anywhere. Whatever you write outside of the
answer spaces may be ignored.

Write your SEAT NUMBER and MATRICULATION NUMBER below.


Seat No:
Matriculation
No :

Question
Section A

Max
40

Section B
Question 1

15

Question 2

15

Marks

DSC2008
Question 3

18

Question 4

12

Total

100

DSC2008

Section A (40 marks). Each question carries 2 marks. Choose the


most appropriate answer.
1. Which of the following time series forecasting methods could NOT be used to
forecast a time series that exhibits a linear trend with no seasonal or cyclical
patterns?
A. Dummy variable regression.
B. Linear trend regression.
C. Holt-Winters double exponential smoothing.
D. Ratio-to-Moving-Average method with linear trend model.
2. To check for positive or negative first-order autocorrelation, we can use
A. the Durbin-Watson statistic.
B. a one-sample t test.
C. multiplicative Winters method.
D. dummy variable regression.
3. If the errors produced by a forecasting method for 3 observations are 2, -1,
and -6, then what is the closest value to their root mean squared error?
A. -5/3
B. 41/3
C. 3
D. 11/3

[Answer] RMSE =

(2 +(1 ) +(6 ) )/3=3.70 11 /3


2

4. Consider the following time series.


Time period
1
2
3
4
5
6
7
Sales
53
60
47
55
56
53
?
Compute sales forecast for time period 7 using
i.
the 3-period weighted moving average giving a weight of 1/2 to the
most recent period, a weight of 1/3 to the second most recent period,
and a weight of 1/6 to the third most recent period and
ii.
the exponential smoothing with smoothing constant of 1/2.
What is the closest value to the difference between the two forecasts of i and ii?
A. 53
B.

54

C.

1
3

1
3

D. 1

1
12
13
14
15
16
53+ 56+ 55+ 47+ 60+ 53=53.02
2
2
2
2
2
2
1
1
1
WMA(3) 53+ 56+ 55=54.33
2
3
6
1
The difference = 1.31 1
.
3
[Answer] SES(0.5) =

5. Which of the following statement is true concerning the autocorrelation


function (ACF) and partial autocorrelation function (PACF)?
A. The ACF will always be smaller than PACF.
B. The PACF for an AR(q) model will be zero beyond lag q.
C. The ACF for an AR(p) model will be zero beyond lag p.
D. The ACF and PACF will always be identical for an ARMA(1,1) model.

DSC2008

DSC2008

Answer questions 6 to 10 according to the following data analysis.


Lind fabric retailer sells a variety of fabrics and sewing supplies, craft materials,
frames, home decorations, artificial floral items, and seasonal goods. It
operates more than 300 stores, with most located in mall shopping centers. The
quarterly revenues of Lind increased from about USD 5 million in Year 2009 to
more than USD50 million in Year 2013.

Revenue
60,000,000
50,000,000
40,000,000
30,000,000
20,000,000
10,000,000
0

Lind invited you to predict the future revenues in Year 2014. Given the
existence of seasonal pattern, you adopted a seasonal dummy model. Answer
the following questions according to the Excel and SAS output below.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.978
R Square
Adjusted R Square
Standard
3443218.65
Observation 20
ANOVA
Regression
Residual
Total

df
4
15
19

SS
3.88088E+
1.77836E+
4.05872E+

MS
9.70221E+
1.18558E+

F
81.835

Intercept
Time
Q1
Q2
Q4

Coefficients
17447690.7
1132370.43
11997271.0

Standard
2147703.31

t Stat
8.12388311
8.31981952
-

P-value
7.12257E5.29612E6.76499E0.00066196
6.12299E-

2194629.88
2181931.82
2181931.82

5.49846284

Significanc
5.30312E-

The residuals of the fitted model are analyzed in SAS. Please refer to the SAS
output.

Type
Single
Mean

To
Lag
6
12
18

DSC2008

Augmented Dickey-Fuller Unit Root Tests for Residuals


Rho
Pr < Rho
Tau
Pr < Tau
F
-13.2380 0.0261
-2.83
0.0722
4.05
-22.8343 0.0002
-3.07
0.0477
4.70
0.0001
-3.39
0.0264
5.76
353.722

La
0
1
2
ChiSquare
8.21
13.82
19.44

D
F
6
1
2
1
8

Autocorrelation
Pr
>
ChiSq
0.2234
0.271
0.3123
-0.037
0.3655
-0.057

Check of Residuals
Autocorrelations
-0.144 -0.385
-0.209
-0.058 0.130
0.246
-0.063 -0.043
-0.163

Pr >
F
0.11
68
0.07
26
0.03
70

-0.118
0.194
0.023

6. What is the t-stat of Q2? Is Q2 significant at 5% level?


A. -8.15. Insignificant.
B. -8.15. Significant.
C. -4.28. Insignificant.
D. -4.28. Significant.

[Answer] d. The t-stat = Coef./SE = -9331747.045/2181931.822 = - 4.277.


Given P-value less than 5%, the variable Q2 is significant.
7. Which quarter is considered as base season in your model and what is the
fitted model for the base season?
A. Q1. Revenue 17447690.73+ 1132370.43 Time17903183.80 Q1 .
B. Q3. Revenue 17447690.73+ 1132370.43 Time .
C. Q3. Revenue 17447690.73+ 1132370.43 Time 14105289 Q 3 .
D. Q4. Revenue 17447690.73+ 1132370.43 Time .

-0.134
-0.070
0.094

DSC2008

8. What is the 95% interval forecast (using Normal approximation) for the first
quarter of Year 2014 (2014Q1) with Time = 21?
A. [16.6 million, 30.1 million].
B. 23.3 million.
C. [6.6 million, 30.1 million].
D. 16.6 million.

[Answer] a. The point forecast = 17447690.73+1132370.43 2117903183.80= 23


324 286.
The
95%
interval
forecast
=
23324286
1.96 3443218.65=[ 16 575577, 30 072 995 ] .
9. Are the residuals stationary?
A. Yes. The ADF test rejects the null hypothesis.
B. No. The ADF test rejects the null hypothesis.
C. Yes. The Ljung-Box test indicates that the residuals are independent.
D. No. The Ljung-Box test indicates that the residuals are independent.
10.Is the model adequate?
A. Yes, given the adjusted R square being 0.978.
B. Yes, since the ADF test rejects the null hypothesis.
C. Yes, given that the Ljung-Box tests do not reject the null hypothesis.
D. No. The residuals are dependent.
11. One-way Analysis of Variance (ANOVA) is not which of the following?
A. A generalization of the 2-sided pooled t-test for means, when variances
are unknown but equal.
B. A method for comparing multiple entire distributions of variables.
C. A component of the usual multiple regression output.
D. A simple model for the design of experiments.
12. P-value in hypothesis testing is not which of the following?
A. Given the truth of the Null Hypothesis, the probability of a fresh sample
favoring the Alternative Hypothesis at least as much as the current
sample.
B. When compared with the preset probability of Type I Error, leads to
whether the Null Hypothesis is rejected.
C. In a stepwise regression, can be used for selecting the variable to be
deleted from a regression model.
D. The probability that the Null Hypothesis is true.
13. How many percent of the data are between the second and third quartiles?
A. 25%
B. 68%
C. 50%
D. Depends on the Standard Deviation.
14. Which of the followings is not a measure of spread for a variable
distribution?
A. Interquartile range
B. Median absolute deviation
C. Range
D. The average of the first and third quartiles
15. Which of the following is a description of the correlation coefficient?

DSC2008

A. It measures the correlation between two regression coefficients.


B. It is the square root of the Adjusted R2 in regression.
C. For X and Y variables standardized, it measures the extent of clustering
around either the positively or negatively sloping 45-degree line through
the origin.
D. It is not symmetrical in the two variables involved.
16. Logistic regression is
A. the regression of log(Y) on log(X).
B. frequently used when deciding between only two possible outcomes.
C. where the regression error term is assumed to follow the Logistic rather
than Normal distribution.
D. the regression of log(Y) on X.
17. The Central Limit Theorem says that:
A. in the limit, everything tends toward the centre.
B. the limiting distribution of the sample average is Normal as the sample
size is increased.
C. in a different round, a corresponding observation tends to be relatively
closer to the centre.
D. nowhere is the central tendency of a distribution more limiting than at
the average.
18. Indicator or dummy variables may not be used for
A. identifying important outliers.
B. imputing missing values.
C. indicating dummy collinearity.
D. representing categorical variables.
19. Residual plots may sometimes provide useful regression diagnostics, except
for identifying
A. inconstant error variance.
B. correlated errors.
C. regression effect.
D. departure from Normality assumption for errors.
20. Simpsons Paradox warns against
A. rampant use of transformed X variables.
B. failure to identify any regression effect.
C. confusion between the power and the strength of a test.
D. carelessly adding numbers from disparate groups.

DSC2008

Section B (60 marks). There are 4 questions.


Question 1 (15 marks)
Two scientists Flury and Riedwyl collected and analyzed 200 old Swiss
banknotes. Each banknote was measured (unit: centimeter), as indicated in the
figure,
X1 = length of the banknote
X2 = height of the banknote (left)
X3 = height of the banknote (right)
X4 = distance of the inner frame to the lower border
X5 = distance of the inner frame to the upper border
X6 = length of the diagonal of the central picture
The aim was to study how these measurements may be used to distinguish
genuine and counterfeit banknotes.

X2

X6

X5

X1

The K-means method was used to cluster these banknotes.


X3 Analyst YC repeated
the analysis with K = 2 and K = 5.X4
The SAS output is displayed as follows.

Cluster Means K = 2
Cluster
X1
1
214.82
2
214.97

X2
130.30
129.95

X3
130.19
129.72

X4
10.55
8.31

X5
11.13
10.18

X6
139.44
141.50

Cluster Means K = 5
Cluster
X1
1
215.27
2
214.91
3
214.75
4
214.85
5
214.95

X2
130.28
130.30
130.30
129.84
129.84

X3
130.01
130.21
130.18
129.61
129.67

X4
8.81
9.52
11.36
7.83
8.95

X5
10.18
11.57
10.75
10.49
9.32

X6
141.22
139.29
139.59
141.54
141.85

10

DSC2008

a) Choose an appropriate K-means method and justify your selection.


[4 marks]

Since the aim was to determine whether a banknote is genuine or counterfeit, it is natural to
set K = 2. One represents the genuine notes and the other is for the counterfeit notes.
b) Discuss the differences among the clusters of your selected method, in
terms of the 6 measurements.
[4 marks]

According to the K-means result, the length, height (left and right) and distance of the inner
frame to the upper border, i.e. X1, X2, X3 and X5, of genuine and counterfeit banknotes are
quite similar, while there are relatively big differences between the two centers on distance
of the inner frame to the lower border at 2.24cm and the diagonal length of the central
picture at 2.06cm, i.e. X4 and X6. It implies that one can focus on X4 and X6 to distinguish
genuine or counterfeit banknotes.
c) Analyst YC also tried Hierarchical method. The result is consistent with your
selection. However YC lost the distance measure between two banknotes.
The measures of the two banknotes are given:

Bankn
ote
1
2

X1
214.8
214.6

X2
131.0
129.7

X3
131.1
129.7

X4
9.0
8.1

X5
9.7
9.5

X6
141.0
141.7

What is the city-block distance between the two banknotes?


[4 marks]

City-block
distance
|214.8214.6|+|131129.7|+|131.1129.7|+|9.08.1|+|9.79.5|+|141.0141.7|=4.7 .
d) List 3 popular distance measures in cluster analysis and discuss their pros
and cons.
[3
marks]

I.
II.
III.

Euclidean Distance: scale dependent. Not affected when including new objects.
Manhattan Distance: scale dependent. Not affected when including new objects.
Manhattan Distance, compared to Euclidean Distance, dampens effect of outliers as
differences are not squared.
Mahalanobis Distance: doesnt dependent on scales of measurement used. However it
will be influenced when new objects are included as the standard deviation needs to
recalculate given the increased sample.

11

DSC2008

Question 2 (15 marks)


Poverty is a state of a lack of goods and services that are commonly taken for
granted by members of mainstream society. The most common measure of
poverty in the U.S. is the poverty level set by the U.S. government according to
total income received. For example, the poverty level in Year 2012 was set at
$23,050 (total yearly income) for a family of 4 in the U.S.
Consider the time series of poverty rates, percent of all the individuals under
the age of 18 living below the poverty level, in the U.S. from Year 1959 to 2003:

Analyst YC conducted Holt-Winters method with season period of M = 4 years


to forecast the poverty rates (i.e. some component pattern repeats every 4
years). The initial forecasts of level and trend are 21.18 and -1.26. The initial
forecasts of seasonality are 1.09, 1.09, 0.99 and 0.83. The optimal smoothing
constants are =1, =0.16, =0.
a) Explain whether or not Holt-Winters method is appropriate to analyze the
series of poverty rates.
[4 marks]

The time series plot displays a time varying trend with possible seasonal/cyclical variations.
However the underlying dependence is not very clear. The HW method is flexible to
forecast time series data when the underlying dynamics is not apparent and may have trend
and seasonal variations.
b) Interpret the meaning of the optimal smoothing constants,
[4 marks]

, , .

The magnitude of smoothing constants tells the influence of the recent values on forecast.
Given =1, the level of time series continuously updates when new observation arrives.
On the contrary, =0 indicates the initial values of seasonality are quite accurate and
there is no need to update season forecast. The trend forecast slowly updates given
=0.16 .
c) Are there seasonal/cyclical variations in the time series?
[3 marks]

12

DSC2008

Yes, there are seasonal/cyclical variations with stable season forecasts of 1.09, 1.09, 0.99
and 0.83. In each period, the poverty rate in the first two years are 9% higher than the
average, while 1% and 17% lower in the rest two years.
d) List 2 alternative models that are appropriate to analyze the time series of
poverty rates.
[4
marks]

We can try multiplicative seasonal model and additive seasonal model.


Question 3 (18 marks)
Salaries for employees at a bank are analyzed using regression, with the
following results:
Regression Statistics
Multiple R
0.816645142
R Square
0.666909288
Adjusted R Square
0.653518706
Standard Error
6625.6718
Observations
208
ANOVA
Regression
Residual
Total
Y: Salary
Intercept
YrsThisBank
YrsThisBank*Female
YrsPriorBank*Female
ComputerRelated
YrsPriorBank
Female
Age*Female
Age

df
8
199
207

SS
17491101398
8736005833
26227107231

MS
2186387675
43899526.8

F
49.80435632

Coefficients
31829.30731
1495.083406
-1109.41783
988.3191454
4018.024602
-585.1391184
1896.763788
-46.21756033
2.591559621

Standard Error
3405.456736
159.6484805
204.3454289
356.2003226
1674.774669
305.4275282
4320.417514
142.7899567
119.9336832

t Stat
9.346560469
9.364845828
-5.429129666
2.774616087
2.399143405
-1.91580347
0.439023262
-0.323675148
0.021608272

P-value
1.89825E-17
1.68338E-17
1.63858E-07
0.006053119
0.017355991
0.056823454
0.661120987
0.746523894
0.982782086

Significance F
1.64008E-43

Please note that Female and ComputerRelated are indicator/dummy variables,


with 1 for their respective categories and 0 otherwise. All employees with
ComputerRelated jobs are female. YrsThisBank and YrsPriorBank records the
time an employee has worked for this bank and for any prior bank.
Variable*Female is the product of the corresponding Variable with the Female
indicator variable.
(a) Using the regression results above, what are the separate prediction
equations for the Salaries of Male and Female employees at the bank?
Please round the coefficients to whole numbers.
[3 marks]

Male: Salary = 31829 + 1495*YrsThisBank 585*YrsPriorBank + 3*Age


Female: Salary = (31829+1897) + (1495-1109)*YrsThisBank + (-585+988)*YrsPriorBank
+ (3-46)*Age + 4018*Computer Related
= 33726 + 386*YrsThisBank + 403*YrsPriorBank - 44*Age + 4018*Computer Related

13

DSC2008

(b) Based on your answers to (a), comment on expected yearly salary


increments for male and female employees of this bank. How would these
numbers be commonly interpreted?
[3 marks]

Male: 1495+3 = 1498 (3 is the coefficient of Age)


Female: 386-44 = 342
Female employees average yearly salary increment is only 23% of that of male employees.
This probably reflects the relatively junior positions of female employees. Note: The fact
that female employees intercept is 1897 larger wasnt really an issue.
(c) Explain how you might improve the model by discarding a variable. Please
be specific.
[3
marks]

Age has the largest P-value; in fact it is almost 1. This variable should be deleted and a new
model fitted.
A variables-selection technique is employed, resulting in the new model below:
Regression Statistics
Multiple R
0.816241456
R Square
0.666250115
Adjusted R Square
0.657988979
Standard Error
6582.791103
Observations
208
ANOVA
Regression
Residual
Total
Y: Salary
Intercept
YrsThisBank
YrsThisBank*Female
YrsPriorBank*Female
ComputerRelated
YrsPriorBank

df
5
202
207

SS
17473813213
8753294018
26227107231

MS
3494762643
43333138.7

F
80.64873091

Coefficients
32171.74593
1485.262977
-1127.603829
1005.09522
4092.763696
-608.86039

Standard Error
924.3653924
75.76029523
87.24034017
286.5924527
1639.307593
248.3842939

t Stat
34.80414368
19.60476754
-12.92525713
3.507054043
2.4966417
-2.451283777

P-value
2.80414E-87
1.24796E-48
3.03196E-28
0.000558289
0.013337044
0.015084658

Significance F
3.12637E-46

(d) Please explain whether the above is indeed a better model.


[3 marks]

The Significance F, P-values and number of predictors are all smaller, and Adjusted R 2 is a
bit higher, hence a better model.
(e) How come Age is now not a factor in explaining Salary?
[3 marks]

Age was probably collinear with both YrsPriorBank & YrsThisBank. In the presence of
these 2 variables, Age is no longer needed for explaining Salary.
(f) Based on this new model, compare the expected yearly salary increments
for male and female employees of this bank. Please comment on the

14

DSC2008

negative coefficient of YrsPriorBank.


[3 marks]

Male: Salary = 32172 + 1485*YrsThisBank 608*YrsPriorBank


Female: Salary = 32172 + (1485-1128)*YrsThisBank + (-608+1005)*YrsPriorBank +
4093*Computer Related
= 32172 + 358*YrsThisBank + 396*YrsPriorBank + 4093*Computer Related
The ratio of female and male employee salary increments is now 358 to 1485, which is
24%, almost the same as for the first model. For female employees, with Age out of the
way, it seems that each year of work is worth more in any prior bank than in the current
bank!
The -608 negative coefficient of YrsPriorBank might reflect that, among male employees,
the older recruits occupy more junior positions in the bank, since YrsPriorBank is probably
a surrogate for Age.
Question 4 (15 marks)
U.S. law school graduates pays have been studied by Forbes magazine in
March 2014. Law schools whose graduates volunteered their pay information
are ranked. Schools having the highest 25 average annual starting pays at
graduation also have their graduates mid-career pays reported. The following
plot scatters the latter pays versus the initial pays.

Mid-Career vs Starting Pay


220000
210000
200000
190000
180000
170000
160000
150000
140000
130000
120000
110000
80000
70000

90000

100000
120000
140000
110000
130000
150000

The boxplot for the two data series is as follows:

15

DSC2008

Chart Title
Axis Title
0

50000

100000

150000

200000

250000

US$ Anually
The very top (hence rich) U.S. law schools encourage graduates to take on lowpaying judicial clerkships in the public sector, partly through awarding studyloan-forgiving post-graduate fellowships. Some of these top graduates continue
into a career of public service, while others become academics (with pays
nowhere near what private law firms offer). However, sought-after clerkships
(especially at the U.S. Supreme Court) are also stepping stones to very highpaying high-street law firms, so many top students who are not primarily
interested in public-sector positions also elected for those low-paying jobs, for a
stint, upon graduation.
(a) How do you explain the shape of the boxplot for law graduates average
starting pays?
[3
marks]

If there are many law schools in the US, the top 25 average salaries would represent only
the right tail of average salary distribution. As we start moving left from the right tail
(perhaps representing law schools with few graduates starting in public-sector), the
clustering of salaries will initially become increasingly dense. Hence, the lower boxplot
represents a left-truncated distribution with very short left tail compared to the long right
tail where there are outliers.
(b) Please comment on the data summary depicted by the two boxplots.
[3 marks]

The Starting-pay boxplot skews to the right, whereas the more-symmetrical Mid-career-pay
boxplot skews to the left, probably because some schools continue to have large number of
mid-career graduates in the public sector. The top-half of the first plot has a range bigger
than the top-half of second plot, although overall spread is larger for the Mid-career group.
The difference in the medians of starting and mid-career pays is almost US$100,000.
Perhaps graduates in top law firms, who are not in the public sector at their mid-career,
cause their schools to be in the top half of the second boxplot, in which case the schools end
up having graduates with rather similarly high average salaries of nearly US$200,000.
(c) Why is the scatterplot without the usual oval or elliptical shape?
Please
comment on how this might affect the prediction of Mid-career pay from
Starting pay.
[3 marks]

The fact of the left truncation of the Starting-pay distribution largely contributed to the
unusual scatterplot, i.e. the high degree of skewness of the X variable led to the absence of
the oval. Given the small sample size, the one big outlier on the right also makes it hard for

16

DSC2008

the ellipse to appear. All these make for low forecast accuracy from average graduate
Starting pay of a school from among those with the highest 25 average Starting pay, to the
corresponding average Mid-career pay of the schools graduates. [In fact, the R2 is only
0.2.]
(d) Please write an executive summary for the readers of Forbes magazine.
[3 marks]

Top U.S. law graduates do not all go for the highest-paying jobs in the first instance, since a
stint as a judicial clerk can pay very high dividends later when switching to a private law
firm. This voluntary hardship posting is also encouraged by generous study-loan-payback
subsidies offered by the very best law schools that encourage their graduates to go into
public service. Hence, the median of the top 25 average-starting-salary law schools is only
about US$80,000 yearly. These salaries are self-reported by graduates, so may represent a
bias sample.
By mid-career, however, only lawyers who are committed to low-paying public service
remain in the judicial system, so the median of the average salaries of lawyers from those
same 25 schools more than doubled to about US$175,000. Since the proportions of publicsector-moved-to-private-sector graduates are not indicated for the schools, it is difficult to
predict those top schools graduates average mid-career salaries from their respective
average starting salaries; presumably, schools with the higher proportions of those career
switches saw the larger jumps in average salaries.
========= END OF PAPER =========

You might also like