You are on page 1of 2

I.

Preprocessing and scaling

a. Preprocessing

In order to crunch the data, we had to prepocess them. Indeed, Data-gathering methods
are often loosely controlled, resulting in out-of-range values, impossible data combinations
(e.g., Day: Monday) missing values Analyzing data that has not been carefully screened for
such problems can produce misleading results. Thus, the representation and quality of data is
first and foremost before running an analysis.

The first thing we did was to handle the DATE column. In the
train_2011_2012_2013.csv file, this column looked like yyyy-dd-mm hh :mm :ss. WE
decided to create four columns : YEAR, MONTH, DAY, TIME using four fonctions we
created :
def get_year(x):
return x.year

def get_day(x):
return x.day

def get_month(x):
return x.month

def get_time(x):
hour=x.hour
minutes=x.minute
return 60*hour+minutes

To avoid the impossible data combinations, we convert categorical variable into


indicator variables thanks to the function get_dummies. For exemple :

During the preprocessing, our main fight was to avoid the overfitting. Just to remind,
overfitting occurs when a model is excessively complex, such as having too many parameters
relative to the number of observations. In the file given, there were 86 features which is a way
too many. Among them, some had absolutely any link with the prediction (for exemple
CSPL_Ringtime), it was very difficult to use them for our testing set and we decided to delete
them. All the more so as in the submission.txt file, we only have 2 features : the date and the
assignment. However, we added other features because they were linked to the date
information and had consequences on the prediction. Among them, the school holidays and
the day-off. Besides, we get a trend factor to make our model more accurate thanks to the
function rolling_mean applied on the last 3/5 and 10 days.

In order to get a training set similar to the testing, we decided to agregate the rows (we
summed them) with the function group_by.

Last but not least, we created the matrix X_test, filling it with the features we raised
above. Here is a part of the code :
b. Scaling

Since the range of values of raw data varies widely, objective functions will not work
properly without normalization. For example, the majority of classifiers calculate the distance
between two points by the Euclidean distance. If one of the features has a broad range of
values, the distance will be governed by this particular feature. Therefore, the range of all
features should be normalized so that each feature contributes approximately proportionately
to the final distance.
For every data value, the mean has to be subtracted and the result has to be divided
by the standard deviation .

We use the function : Scale = preprocessing.StandardScaler()

II. Machine learning algorithms


III. Finale submission

You might also like