No data left behind | Data Science Somedays

I’m going through the Kaggle Intermediate Machine Learning course again to make sure that I understand the material as I remember feeling a bit lost when I started on a different project.

Here are some things that I’ve gone over again from the “Missing Values” section of the course.

Introduction

In machine learning, we do not like missing values. They make the code uncomfortable and it refuses to work. In this case, a missing value is “NaN”. A missing value isn’t “0”. Imagine an empty cell rather than “0”. If we don’t know anything about a value when we can’t use it to make a prediction.

As a result, there are a few techniques we can use when playing around with our data to remove missing values. We can do that by getting rid of the column completely (including the rows _with_ information), filling in the missing values with a certain strategy, and filling in missing values but making sure we say which ones we’ve filled in.

Another final note before those who haven’t come across this series before – I write in Python, I tell bad jokes and I’m not good at this.

The techniques

First, let’s simply find the columns with missing values:

missing_columns = [x for x in training_data[x] if train_columns[x].isnull().isna()]

This uses List Comprehension – it’s great. I’d use that link to wrap your head around it (if you haven’t).

In the Kaggle course, it uses a Mean Absolute Error method of testing how good a prediction is. In short, if you have a prediction and data you know is correct… how far away is your prediction from the true value? The closer to 0 the better (in most cases. I suppose it may not be that helpful if you think you’re overfitting your data).

Dropping values

If we come across a column with a missing value, we can opt to drop it completely.

In this case, we may end up missing out on a lot of information. For example, I have a column with ten thousand rows and 100 are NaN, then I can no longer use over 9 thousand pieces of information!

smaller_training_set = training_data.drop(missing_columns, axis=1)
smaller_validation_set = validation_data.drop(missing_columns, axis=1)

Comic of man knocking down a column from a table with two people walking by — Maybe dropping all of them is overkill

Filling in the values

This uses a function called “Simple Imputation” – the act of replacing missing values by inferring information from the values you do have.

You can take the average of the values (mean, median etc) and just fill in all of the missing values with that information. It isn’t perfect, but nothing is. We can do this using the following:

SimpImp = SimpleImputer()

imputed_X_train = pd.DataFrame(SimpImp.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(SimpImp.transform(X_valid))

So if you’re a bit confused as to why we use “.fit_transform()” on the training set but only “transform()” on the validation set, here is why:

“fit_transform()” performs two actions in one line. It “fits” the data by calculating the mean and variance of the training set. It then transforms the rest of the features using that mean and variance information. We are training the model.

We don’t want to do “fit” the validation data because the prediction is made based on what has been learned from the training set. The data set is different and we want to know how well the model performs when faces with new information.

Here is a better explanation. | Here is another better explanation.

This method performs better than just dropping values and it is possible to play around with the method in imputation. The default is replacing the values with the mean but it’s worthwhile playing around and seeing what gives you better results.

It’s also important to note that this only works with numerical information! Why? A lot of data sets have missing labels. For example, if you had the colour of a house in your data set, trying to get the mean of green and orange is impossible (it’s also ugly).

The end

There are many other methods with different pros and cons (there’s forward filling and backward filling which I won’t go into detail here) but I want to keep this post relatively short.

This is were bias and mistakes begin to creep into the model because we are always making decisions about what to do with the information that we have. That’s important to keep on mind as they become more complex.

Hope you enjoyed and happy machine learning.

My Twitter.

If you want to listen to this post, you can here.