You are on page 1of 6

14/03/2019 .

NET, TensorFlow, and the Windmills of Kaggle - CodeProject

.NET, TensorFlow, and the Windmills of Kaggle


LOST_FREEMAN, 23 Feb 2019

Hands-on data science competition with TensorFlow on .NET

This is a series of articles about my ongoing journey into the dark forest of Kaggle
competitions as a .NET developer.

I will be focusing on (almost) pure neural networks in this and the following articles. It
means, that most of the boring parts of the dataset preparation, like filling out missing
values, feature selection, outliers analysis, etc. will be intentionally skipped.

The tech stack will be C# + TensorFlow tf.keras API. As of today, it will also require
Windows. Larger models in the future articles may need a suitable GPU for their training
time to remain sane.

Let's Predict Real Estate Prices!


House Prices is a great competition for novices to start with. Its dataset is small, there are no special rules, public leaderboard has
many participants, and you can submit up to 4 entries a day.

Register on Kaggle, if you have not done that yet, join this competition, and download the data. Goal is to predict sale price
(SalePrice column) for entries in test.csv. Archive contains train.csv, which has about 1500 entries with known sale price to train
on. We'll begin with loading that dataset, and exploring it a little bit, before getting into neural networks.

Analyze Training Data


Did I say we will skip the dataset preparation? I lied! You have to take a look at least once.

To my surprise, I did not find an easy way to load a .csv file in the .NET standard class library, so I installed a NuGet package, called
CsvHelper. To simplify data manipulation, I also got my new favorite LINQ extension package MoreLinq.

static DataTable LoadData(string csvFilePath) {


var result = new DataTable();
using (var reader = new CsvDataReader(new CsvReader(new StreamReader(csvFilePath)))) {
result.Load(reader);
}
return result;
}

Using DataTable for training data manipulation is, actually, a bad idea.

ML.NET is supposed to have the .csv loading and many of the data preparation and exploration operations. However, it was not
ready for that particular purpose yet, when I just entered House Prices competition.

The data looks like this (only a few rows and columns):

Id MSSubClass MSZoning LotFrontage LotArea

https://www.codeproject.com/Articles/1278115/NET-TensorFlow-and-the-Windmills-of-Kaggle?display=Print 1/6
14/03/2019 .NET, TensorFlow, and the Windmills of Kaggle - CodeProject

Id MSSubClass MSZoning LotFrontage LotArea

1 60 RL 65 8450
2 20 RL 80 9600
3 60 RL 68 11250
4 70 RL 60 9550

After loading data, we need to remove the Id column, as it is actually unrelated to the house prices:

var trainData = LoadData("train.csv");


trainData.Columns.Remove("Id");

Analyzing the Column Data Types


DataTable does not automatically infer data types of the columns, and assumes it's all strings. So the next step is to
determine what we actually have. For each column, I computed the following statistics: number of distinct values, how many of
them are integers, and how many of them are floating point numbers (a source code with all helper methods will be linked at the
end of the article):

var values = rows.Select(row => (string)row[column]);


double floats = values.Percentage(v => double.TryParse(v, out _));
double ints = values.Percentage(v => int.TryParse(v, out _));
int distincts = values.Distinct().Count();

Numeric Columns
It turns out that most columns are actually ints, but since neural networks mostly work on floating numbers, we will convert them
to doubles anyway.

Categorical Columns
Other columns describe categories the property on sale belonged to. None of them have too many different values, which is good.
To use them as an input for our future neural network, they have to be converted to double too.

Initially, I simply assigned numbers from 0 to distinctValueCount - 1 to them, but that does not make much sense, as
there is actually no progression from "Facade: Blue" through "Facade: Green" into "Facade: White". So early on, I
changed that to what's called a one-hot encoding, where each unique value gets a separate input column. E.g. "Facade: Blue"
becomes [1,0,0], and "Facade: White" becomes [0,0,1].

Getting Them All Together

CentralAir: 2 values, ints: 0.00%, floats: 0.00%


Street: 2 values, ints: 0.00%, floats: 0.00%
Utilities: 2 values, ints: 0.00%, floats: 0.00%
....
LotArea: 1073 values, ints: 100.00%, floats: 100.00%

Many value columns:


Exterior1st: AsbShng, AsphShn, BrkComm, BrkFace, CBlock, CemntBd, HdBoard,
ImStucc, MetalSd, Plywood, Stone, Stucco, VinylSd, Wd Sdng, WdShing
Exterior2nd: AsbShng, AsphShn, Brk Cmn, BrkFace, CBlock, CmentBd, HdBoard,
ImStucc, MetalSd, Other, Plywood, Stone, Stucco, VinylSd, Wd Sdng, Wd Shng
Neighborhood: Blmngtn, Blueste, BrDale, BrkSide, ClearCr, CollgCr, Crawfor,
Edwards, Gilbert, IDOTRR, MeadowV, Mitchel, NAmes, NoRidge, NPkVill,
NridgHt, NWAmes, OldTown, Sawyer, SawyerW, Somerst, StoneBr, SWISU, Timber, Veenker

non-parsable floats
GarageYrBlt: NA
LotFrontage: NA
MasVnrArea: NA
https://www.codeproject.com/Articles/1278115/NET-TensorFlow-and-the-Windmills-of-Kaggle?display=Print 2/6
14/03/2019 .NET, TensorFlow, and the Windmills of Kaggle - CodeProject

float ranges:
BsmtHalfBath: 0...2
HalfBath: 0...2
...
GrLivArea: 334...5642
LotArea: 1300...215245

With that in mind, I built the following ValueNormalizer, which takes some information about the values inside the column,
and returns a function, that transforms a value (a string) into a numeric feature vector for the neural network (double[]):

static Func<string, double[]> ValueNormalizer(double floats, IEnumerable<string> values) {


if (floats > 0.01) {
double max = values.AsDouble().Max().Value;
return s => new[] { double.TryParse(s, out double v) ? v / max : -1 };
} else {
string[] domain = values.Distinct().OrderBy(v => v).ToArray();
return s => new double[domain.Length+1]
.Set(Array.IndexOf(domain, s)+1, 1);
}
}

Now we've got the data converted into a format, suitable for a neural network. It is time to build one.

Build a Neural Network


If you already have Python 3.6 and TensorFlow 1.10.x installed, all you need is:

<PackageReference Include="Gradient" Version="0.1.10-tech-preview4" />

in your modern .csproj file. Otherwise, refer to the Gradient manual to do the initial setup.

Once the package is up and running, we can create our first shallow deep network.

using tensorflow;
using tensorflow.keras;
using tensorflow.keras.layers;
using tensorflow.train;

...

var model = new Sequential(new Layer[] {


new Dense(units: 16, activation: tf.nn.relu_fn),
new Dropout(rate: 0.1),
new Dense(units: 10, activation: tf.nn.relu_fn),
new Dense(units: 1, activation: tf.nn.relu_fn),
});

model.compile(optimizer: new AdamOptimizer(), loss: "mean_squared_error");

This will create an untrained neural network with 3 neuron layers, and a dropout layer, that helps to prevent overfitting.

tf.nn.relu_fn is the activation function for our neurons. ReLU is known to work well in deep networks, because it solves vanishing
gradient problem: derivatives of original non-linear activation functions tended to become very small when the error propagated
back from the output layer in deep networks. That meant, that the layers closer to the input would only adjust very slightly, which
slowed training of deep networks significantly.

Dropout is a special-function layer in neural networks, which actually does not contain neurons as such. Instead, it operates by
taking each individual input, and randomly replaces it with 0 on self output (otherwise, it just passes the original value along). By
doing so, it helps to prevent overfitting to less relevant features in a small dataset. For example, if we did not remove the Id
column, the network could have potentially memorized <Id>-><SalePrice> mapping exactly, which would give us 100%
accuracy on the training set, but completely unrelated numbers on any other data. Why do we need dropout? Our training data only
has ~1500 examples, and this tiny neural network we've built has > 1800 tunable weights. If it would be a simple polynomial, it
could match the price function, we are trying to approximate exactly. But then it would have enormous values on any inputs outside
of the original training set.

https://www.codeproject.com/Articles/1278115/NET-TensorFlow-and-the-Windmills-of-Kaggle?display=Print 3/6
14/03/2019 .NET, TensorFlow, and the Windmills of Kaggle - CodeProject

Feed the Data


TensorFlow expects its data either in NumPy arrays, or existing tensors. I am converting DataRows into NumPy arrays:

using numpy;

...

const string predict = "SalePrice";

ndarray GetInputs(IEnumerable<DataRow> rowSeq) {


return np.array(rowSeq.Select(row => np.array(
columnTypes
.Where(c => c.column.ColumnName != predict)
.SelectMany(column => column.normalizer(
row.Table.Columns.Contains(column.column.ColumnName)
? (string)row[column.column.ColumnName]
: "-1"))
.ToArray()))
.ToArray()
);
}

var predictColumn = columnTypes.Single(c => c.column.ColumnName == predict);


ndarray trainOutputs = np.array(predictColumn.trainValues
.AsDouble()
.Select(v => v ?? -1)
.ToArray());
ndarray trainInputs = GetInputs(trainRows);

In the code above, we convert each DataRow into an ndarray by taking every cell in it, and applying the ValueNormalizer
corresponding to its column. Then, we put all rows into another ndarray, getting an array of arrays.

No such transform is needed for outputs, where we just convert train values to another ndarray.

Time to Get Down the Gradient


With this setup, all we need to do to train our network is to call model's fit function:

model.fit(trainInputs, trainOutputs,
epochs: 2000,
validation_split: 0.075,
verbose: 2);

This call will actually set aside the last 7.5% of the training set for validation, then repeat the following 2000 times:

1. Split the rest of trainInputs into batches


2. Feed these batches one by one into the neural network
3. Compute error using the loss function we defined above
4. Backpropagate the error through the gradients of individual neuron connections, adjusting weights

While training, it will output the network's error on the data it set aside for validation as val_loss and the error on the training
data itself as just loss. Generally, if val_loss becomes much greater, than the loss, it means the network started overfitting. I
will address that in more detail in the following articles.

If you did everything correctly, a square root of one of your losses should be on the order of 20000.

https://www.codeproject.com/Articles/1278115/NET-TensorFlow-and-the-Windmills-of-Kaggle?display=Print 4/6
14/03/2019 .NET, TensorFlow, and the Windmills of Kaggle - CodeProject

Submission
I won't talk much about generating the file to submit here. The code to compute outputs is simple:

const string SubmissionInputFile = "test.csv";


DataTable submissionData = LoadData(SubmissionInputFile);
var submissionRows = submissionData.Rows.Cast<DataRow>();
ndarray submissionInputs = GetInputs(submissionRows);
ndarray sumissionOutputs = model.predict(submissionInputs);

which mostly uses functions, that were defined earlier.

Then you need to write them into a .csv file, which is simply a list of Id, predicted_value pairs.

When you submit your result, you should get a score on the order of 0.17, which would be somewhere in the last quarter of the
public leaderboard table. But hey, if it was as simple as a 3 layer network with 27 neurons, those pesky data scientists would not be
getting $300k+/y total compensations from the major US companies.

Wrapping Up
The full source code for this entry (with all of the helpers, and some of the commented out parts of my earlier exploration and
experiments) is about 200 lines on the PasteBin.

In the next article, you will see my shenanigans trying to get into top 50% of that public leaderboard. It's going to be an amateur
journeyman's adventure, a fight with The Windmill of Overfitting with the only tool the wanderer has - a bigger model (e.g., deep
NN, remember, no manual feature engineering!). It will be less of a coding tutorial, and more of a thought quest with really crooky
math and a weird conclusion.

Stay tuned!

Links
Kaggle
House Prices competition on Kaggle
TensorFlow regression tutorial
TensorFlow home page
TensorFlow API reference
Gradient (TensorFlow binding)
https://www.codeproject.com/Articles/1278115/NET-TensorFlow-and-the-Windmills-of-Kaggle?display=Print 5/6
14/03/2019 .NET, TensorFlow, and the Windmills of Kaggle - CodeProject

License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author


LOST_FREEMAN
No Biography provided
United States

Comments and Discussions


0 messages have been posted for this article Visit https://www.codeproject.com/Articles/1278115/NET-TensorFlow-and-
the-Windmills-of-Kaggle to post and view comments on this article, or click here to get a print view with messages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile Article Copyright 2019 by LOST_FREEMAN
Web04 | 2.8.190306.1 | Last Updated 23 Feb 2019 Everything else Copyright © CodeProject, 1999-2019

https://www.codeproject.com/Articles/1278115/NET-TensorFlow-and-the-Windmills-of-Kaggle?display=Print 6/6

You might also like