You are on page 1of 3

IS4240 Business Intelligence Systems

AY 2013/14 Semester 2
Assignment Data Mining 1


Regression (10 marks)

This question is based on the Bike Sharing dataset taken from the UCI Machine Learning
Repository (originally from http://capitalbikeshare.com/system-data)
http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset.

The original source of the dataset is attributed to:
Fanaee-T, H., Gama, J ., Event Labeling Combining Ensemble Detectors and
Background Knowledge, Progress in Artificial Intelligence, 2013, pp. 1-15, Springer
Berlin Heidelberg.

The dataset is concerned with the domain of bike sharing systems. Bike sharing systems are
new generation of traditional bike rentals where whole process from membership, rental and
return back has become automatic. Through these systems, user is able to easily rent a bike
from a particular position and return back at another position. Currently, there are about over
500 bike-sharing programs around the world which is composed of over 500 thousands
bicycles. Today, there exists great interest in these systems due to their important role in
traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of
data being generated by these systems make them attractive for research. Opposed to other
transport services such as bus or subway, the duration of travel, departure and arrival position
is explicitly recorded in these systems. This feature turns bike sharing system into a virtual
sensor network that can be used for sensing mobility in the city. Hence, it is expected that
most of important events in the city could be detected via monitoring these data.

The dataset comes in two versions. In the first version, the rental bikes records are organized
by day. In the second version, the rental bikes records are organized by hour of day. In
general, you may think of one record in the first version for a particular day as being divided
into 24 records in the second version, i.e., one for each hour of day. However, if a particular
hour does not have a single bike being rented out; it will be excluded from the dataset. In
other words, the first version of the dataset contains 731 observations but the second version
of the dataset contains less than 731 x 24 =17,544 observations. In fact, the second version of
the dataset only has 17,379 observations. It is deemed that there is no missing data. This
assignment is based on the second version of the dataset.

The second version of the dataset consists of 17 variables:

1. Instant record index
2. dteday date
3. season season (1: springer, 2: summer, 3: fall, 4: winter)
4. yr year (0: 2011, 1: 2012)
5. mnth month (1 to 12)
6. hr hour (0 to 23)
7. holiday weather day is holiday or not
8. weekday day of the week
9. workingday if day is neither weekend nor holiday is 1, otherwise is 0.
10. weathersit
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist +Cloudy, Mist +Broken clouds, Mist +Few clouds, Mist
3: Light Snow, Light Rain +Thunderstorm +Scattered clouds, Light Rain +
Scattered clouds
4: Heavy Rain +Ice Pallets +Thunderstorm +Mist, Snow +Fog
11. temp Normalized temperature in Celsius. The values are divided to 41 (max)
12. atemp Normalized feeling temperature in Celsius. The values are divided to 50
(max)
13. hum Normalized humidity. The values are divided to 100 (max)
14. windspeed Normalized wind speed. The values are divided to 67 (max)
15. casual count of casual users
16. registered count of registered users
17. cnt count of total rental bikes including both casual and registered

For the purpose of this assignment, the target variable is cnt (i.e., variable 17). There are 12
explanatory variables from season to windspeed (i.e., variables 3 to 14).

Perform the following tasks and answer the respective questions:

1) Using SAS, perform a bivariate correlation analysis on the appropriate explanatory
variables.

List the variables that you have included in the analysis and explain any potential
problem(s) that you detect. (1 marks)

2) Using SAS, build a multiple linear regression model to predict the value of cnt using all
12 explanatory variables. Request SAS to calculate the Variance Inflation Factor (VIF).
Do NOT recode any explanatory variables.

Report the Model Sum of Squares (MSS), Error Sum of Squares (ESS), F-Value, p-Value
of the F-Value, R
2
and Adjusted R
2
.

Explain your observation and identify one major problem with this model. (2 marks)

3) Correct the problem identified in (2) by removing one explanatory variable from the
model in (2). Which explanatory variable did you remove and why?

Fit the new model and report the MSS, ESS, F-Value, p-Value of the F-Value, R
2
and
Adjusted R
2
.

Comment on the validity of the model and attempt to interpret those regression
coefficients that are statistically significant. Is there any problem with this model? (1
marks)

4) Fit a new model with stepwise model selection using the 11 explanatory variables from
(3). Report the MSS, ESS, F-Value, p-Value of the F-Value, R
2
and Adjusted R
2
.

Is the stepwise model better than the model in (3)? (1 marks)

5) The preceding multiple linear regression models appear to suffer from a common
problem. To resolve this problem, examine each of the 11 explanatory variables from (3)
carefully and attempt to recode them as appropriate. Explain the measures that you have
taken.

Fit a new model using the recoded explanatory variables. Did you notice any major
problem when fitting this model? State the problem that you have encountered and
explain how you have attempted to resolve it.

After resolving the problem, fit a final model and report the MSS, ESS, F-Value, p-Value
of the F-Value, R
2
and Adjusted R
2
.

Comment on the validity of the final model and attempt to interpret those regression
coefficients that are statistically significant. Is the final model better than the preceding
models? (5 marks)

You might also like