Professional Documents
Culture Documents
INSTRUCTIONS TO CANDIDATES
1 Please write only your student number below. Do not write your name.
2 This booklet contains two (2) Sections and comprises fifteen (15) printed
pages.
3 Answer ALL questions. This is an OPEN Book examination. All materials are
allowed.
4 Write legibly. A dark pencil may be used.
5 Graphic calculators or other calculators may be used, but not computers,
tablets or phones.
6 Write your answers in the spaces provided after each part of a question,
except that answers to Section A must only be entered into the Bubble Form
provided. The first column for STUDENT NUMBER on the Bubble Form should
be left blank.
7 Plan your answers to ensure they fit within the spaces provided. Other than
this cover page and the spaces designated for providing your answers, you
may do your rough work anywhere. Whatever you write outside of the
answer spaces may be ignored.
Question
Section A
Max
40
Section B
Question 1
15
Question 2
15
Marks
DSC2008
Question 3
18
Question 4
12
Total
100
DSC2008
[Answer] RMSE =
54
C.
1
3
1
3
D. 1
1
12
13
14
15
16
53+ 56+ 55+ 47+ 60+ 53=53.02
2
2
2
2
2
2
1
1
1
WMA(3) 53+ 56+ 55=54.33
2
3
6
1
The difference = 1.31 1
.
3
[Answer] SES(0.5) =
DSC2008
DSC2008
Revenue
60,000,000
50,000,000
40,000,000
30,000,000
20,000,000
10,000,000
0
Lind invited you to predict the future revenues in Year 2014. Given the
existence of seasonal pattern, you adopted a seasonal dummy model. Answer
the following questions according to the Excel and SAS output below.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.978
R Square
Adjusted R Square
Standard
3443218.65
Observation 20
ANOVA
Regression
Residual
Total
df
4
15
19
SS
3.88088E+
1.77836E+
4.05872E+
MS
9.70221E+
1.18558E+
F
81.835
Intercept
Time
Q1
Q2
Q4
Coefficients
17447690.7
1132370.43
11997271.0
Standard
2147703.31
t Stat
8.12388311
8.31981952
-
P-value
7.12257E5.29612E6.76499E0.00066196
6.12299E-
2194629.88
2181931.82
2181931.82
5.49846284
Significanc
5.30312E-
The residuals of the fitted model are analyzed in SAS. Please refer to the SAS
output.
Type
Single
Mean
To
Lag
6
12
18
DSC2008
La
0
1
2
ChiSquare
8.21
13.82
19.44
D
F
6
1
2
1
8
Autocorrelation
Pr
>
ChiSq
0.2234
0.271
0.3123
-0.037
0.3655
-0.057
Check of Residuals
Autocorrelations
-0.144 -0.385
-0.209
-0.058 0.130
0.246
-0.063 -0.043
-0.163
Pr >
F
0.11
68
0.07
26
0.03
70
-0.118
0.194
0.023
-0.134
-0.070
0.094
DSC2008
8. What is the 95% interval forecast (using Normal approximation) for the first
quarter of Year 2014 (2014Q1) with Time = 21?
A. [16.6 million, 30.1 million].
B. 23.3 million.
C. [6.6 million, 30.1 million].
D. 16.6 million.
DSC2008
DSC2008
X2
X6
X5
X1
Cluster Means K = 2
Cluster
X1
1
214.82
2
214.97
X2
130.30
129.95
X3
130.19
129.72
X4
10.55
8.31
X5
11.13
10.18
X6
139.44
141.50
Cluster Means K = 5
Cluster
X1
1
215.27
2
214.91
3
214.75
4
214.85
5
214.95
X2
130.28
130.30
130.30
129.84
129.84
X3
130.01
130.21
130.18
129.61
129.67
X4
8.81
9.52
11.36
7.83
8.95
X5
10.18
11.57
10.75
10.49
9.32
X6
141.22
139.29
139.59
141.54
141.85
10
DSC2008
Since the aim was to determine whether a banknote is genuine or counterfeit, it is natural to
set K = 2. One represents the genuine notes and the other is for the counterfeit notes.
b) Discuss the differences among the clusters of your selected method, in
terms of the 6 measurements.
[4 marks]
According to the K-means result, the length, height (left and right) and distance of the inner
frame to the upper border, i.e. X1, X2, X3 and X5, of genuine and counterfeit banknotes are
quite similar, while there are relatively big differences between the two centers on distance
of the inner frame to the lower border at 2.24cm and the diagonal length of the central
picture at 2.06cm, i.e. X4 and X6. It implies that one can focus on X4 and X6 to distinguish
genuine or counterfeit banknotes.
c) Analyst YC also tried Hierarchical method. The result is consistent with your
selection. However YC lost the distance measure between two banknotes.
The measures of the two banknotes are given:
Bankn
ote
1
2
X1
214.8
214.6
X2
131.0
129.7
X3
131.1
129.7
X4
9.0
8.1
X5
9.7
9.5
X6
141.0
141.7
City-block
distance
|214.8214.6|+|131129.7|+|131.1129.7|+|9.08.1|+|9.79.5|+|141.0141.7|=4.7 .
d) List 3 popular distance measures in cluster analysis and discuss their pros
and cons.
[3
marks]
I.
II.
III.
Euclidean Distance: scale dependent. Not affected when including new objects.
Manhattan Distance: scale dependent. Not affected when including new objects.
Manhattan Distance, compared to Euclidean Distance, dampens effect of outliers as
differences are not squared.
Mahalanobis Distance: doesnt dependent on scales of measurement used. However it
will be influenced when new objects are included as the standard deviation needs to
recalculate given the increased sample.
11
DSC2008
The time series plot displays a time varying trend with possible seasonal/cyclical variations.
However the underlying dependence is not very clear. The HW method is flexible to
forecast time series data when the underlying dynamics is not apparent and may have trend
and seasonal variations.
b) Interpret the meaning of the optimal smoothing constants,
[4 marks]
, , .
The magnitude of smoothing constants tells the influence of the recent values on forecast.
Given =1, the level of time series continuously updates when new observation arrives.
On the contrary, =0 indicates the initial values of seasonality are quite accurate and
there is no need to update season forecast. The trend forecast slowly updates given
=0.16 .
c) Are there seasonal/cyclical variations in the time series?
[3 marks]
12
DSC2008
Yes, there are seasonal/cyclical variations with stable season forecasts of 1.09, 1.09, 0.99
and 0.83. In each period, the poverty rate in the first two years are 9% higher than the
average, while 1% and 17% lower in the rest two years.
d) List 2 alternative models that are appropriate to analyze the time series of
poverty rates.
[4
marks]
df
8
199
207
SS
17491101398
8736005833
26227107231
MS
2186387675
43899526.8
F
49.80435632
Coefficients
31829.30731
1495.083406
-1109.41783
988.3191454
4018.024602
-585.1391184
1896.763788
-46.21756033
2.591559621
Standard Error
3405.456736
159.6484805
204.3454289
356.2003226
1674.774669
305.4275282
4320.417514
142.7899567
119.9336832
t Stat
9.346560469
9.364845828
-5.429129666
2.774616087
2.399143405
-1.91580347
0.439023262
-0.323675148
0.021608272
P-value
1.89825E-17
1.68338E-17
1.63858E-07
0.006053119
0.017355991
0.056823454
0.661120987
0.746523894
0.982782086
Significance F
1.64008E-43
13
DSC2008
Age has the largest P-value; in fact it is almost 1. This variable should be deleted and a new
model fitted.
A variables-selection technique is employed, resulting in the new model below:
Regression Statistics
Multiple R
0.816241456
R Square
0.666250115
Adjusted R Square
0.657988979
Standard Error
6582.791103
Observations
208
ANOVA
Regression
Residual
Total
Y: Salary
Intercept
YrsThisBank
YrsThisBank*Female
YrsPriorBank*Female
ComputerRelated
YrsPriorBank
df
5
202
207
SS
17473813213
8753294018
26227107231
MS
3494762643
43333138.7
F
80.64873091
Coefficients
32171.74593
1485.262977
-1127.603829
1005.09522
4092.763696
-608.86039
Standard Error
924.3653924
75.76029523
87.24034017
286.5924527
1639.307593
248.3842939
t Stat
34.80414368
19.60476754
-12.92525713
3.507054043
2.4966417
-2.451283777
P-value
2.80414E-87
1.24796E-48
3.03196E-28
0.000558289
0.013337044
0.015084658
Significance F
3.12637E-46
The Significance F, P-values and number of predictors are all smaller, and Adjusted R 2 is a
bit higher, hence a better model.
(e) How come Age is now not a factor in explaining Salary?
[3 marks]
Age was probably collinear with both YrsPriorBank & YrsThisBank. In the presence of
these 2 variables, Age is no longer needed for explaining Salary.
(f) Based on this new model, compare the expected yearly salary increments
for male and female employees of this bank. Please comment on the
14
DSC2008
90000
100000
120000
140000
110000
130000
150000
15
DSC2008
Chart Title
Axis Title
0
50000
100000
150000
200000
250000
US$ Anually
The very top (hence rich) U.S. law schools encourage graduates to take on lowpaying judicial clerkships in the public sector, partly through awarding studyloan-forgiving post-graduate fellowships. Some of these top graduates continue
into a career of public service, while others become academics (with pays
nowhere near what private law firms offer). However, sought-after clerkships
(especially at the U.S. Supreme Court) are also stepping stones to very highpaying high-street law firms, so many top students who are not primarily
interested in public-sector positions also elected for those low-paying jobs, for a
stint, upon graduation.
(a) How do you explain the shape of the boxplot for law graduates average
starting pays?
[3
marks]
If there are many law schools in the US, the top 25 average salaries would represent only
the right tail of average salary distribution. As we start moving left from the right tail
(perhaps representing law schools with few graduates starting in public-sector), the
clustering of salaries will initially become increasingly dense. Hence, the lower boxplot
represents a left-truncated distribution with very short left tail compared to the long right
tail where there are outliers.
(b) Please comment on the data summary depicted by the two boxplots.
[3 marks]
The Starting-pay boxplot skews to the right, whereas the more-symmetrical Mid-career-pay
boxplot skews to the left, probably because some schools continue to have large number of
mid-career graduates in the public sector. The top-half of the first plot has a range bigger
than the top-half of second plot, although overall spread is larger for the Mid-career group.
The difference in the medians of starting and mid-career pays is almost US$100,000.
Perhaps graduates in top law firms, who are not in the public sector at their mid-career,
cause their schools to be in the top half of the second boxplot, in which case the schools end
up having graduates with rather similarly high average salaries of nearly US$200,000.
(c) Why is the scatterplot without the usual oval or elliptical shape?
Please
comment on how this might affect the prediction of Mid-career pay from
Starting pay.
[3 marks]
The fact of the left truncation of the Starting-pay distribution largely contributed to the
unusual scatterplot, i.e. the high degree of skewness of the X variable led to the absence of
the oval. Given the small sample size, the one big outlier on the right also makes it hard for
16
DSC2008
the ellipse to appear. All these make for low forecast accuracy from average graduate
Starting pay of a school from among those with the highest 25 average Starting pay, to the
corresponding average Mid-career pay of the schools graduates. [In fact, the R2 is only
0.2.]
(d) Please write an executive summary for the readers of Forbes magazine.
[3 marks]
Top U.S. law graduates do not all go for the highest-paying jobs in the first instance, since a
stint as a judicial clerk can pay very high dividends later when switching to a private law
firm. This voluntary hardship posting is also encouraged by generous study-loan-payback
subsidies offered by the very best law schools that encourage their graduates to go into
public service. Hence, the median of the top 25 average-starting-salary law schools is only
about US$80,000 yearly. These salaries are self-reported by graduates, so may represent a
bias sample.
By mid-career, however, only lawyers who are committed to low-paying public service
remain in the judicial system, so the median of the average salaries of lawyers from those
same 25 schools more than doubled to about US$175,000. Since the proportions of publicsector-moved-to-private-sector graduates are not indicated for the schools, it is difficult to
predict those top schools graduates average mid-career salaries from their respective
average starting salaries; presumably, schools with the higher proportions of those career
switches saw the larger jumps in average salaries.
========= END OF PAPER =========