March 2021 – Improving Slowly

And we back. Welcome to Data Science Some Days. A series where I go through some of the things I’ve learned in the field of Data Science.

For those who are new… I write in Python, these aren’t tutorials, and I’m bad at this.

Today, we’re going to go talk about Pipelines – a tool that seems remarkably helpful in reducing the amount of code needed to achieve an end result.

Why pipelines?

When creating a basic model, you need a few things:

Exploratory Data Analysis
Data cleaning and sorting
Predictions
Testing
Return to number one or two and repeat.

As you make different decisions with your data, the more complex the information, the easier it might be to miss a step or simply forget the steps you’ve made! My first model involved so many random bits of code. If I had to reread it, I’d have no idea what I wanted to do. That’s even with comments.

Pipelines help combine different steps into smaller amounts of code so it’s easier. Let’s take a look at what this might look like.

The data we have to work with

As we may know, the data we work with might have many different types of data. Within the columns, they may also have missing values everywhere. From the previous post, we learned that’s a no go. We need to fill in our data with something.

test_dataset.isna().sum()

pickles 100
oats 22
biscuits 15bananas 0

Alright, a bunch of missing values, let’s see what data type they are.

test_dataset.dtype

pickles int64
oats object
biscuits object
bananas int64

We have numbers (int64) and strings (object – not strictly just strings but we’ll work with this).

So we know that we have to fill in the values for 3 of the 4 columns we have. Additionally, we have to do this for different kinds of data. We can utilise a Simple Imputation and OneHotEncoder to do this. Let’s try to do this in as few steps as possible.

Creating a pipeline

#fills missing values with the mean across the column when applied to dataset
numerical_transformer = SimpleImputer(strategy="mean") 

# replaces the categorical columns with a number (1 for yes, 0 for no) when applied to dataset
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Let's apply the above steps using a "column transformer". A very helpful tool when
preprocesser = ColumnTransformer(transformers=[("num", numerical_transformer, numerical_columns),
("cat", categorical_transformer, categorical_columns)])

Ok, we’ve done a lot here. We’ve defined the methods we want to use to fill in missing values and how we’re going to handle categorical variables.

Just to prevent confusion, “ColumnTransformer” can be imported from “sklearn.compose”.

Now we know that the methods we are using are consistent for the entirety of the dataset. If we want to change this, it’ll be easier to find and simpler to change.

Then we can put this all into a pipeline which contains the model we wish to work with:

# bundles the above into a pipeline containing a model
pipeline_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

This uses a small amount of code. So now we can use this when fitting our model to the training data we have then later making predictions. This is instead of trying to make sure our tables are all clean and accidentally applying predictions to the wrong table (I have done this…), we can just send the raw data through this pipeline and we’ll be left with a prediction.

It’ll look something like this:

fitted_model = pipeline_model.fit(X_train, y_train)
my_prediction = pipeline_model.predict(X_valid)

Conclusion

Pipelines have been an interesting introduction to my Data Science journey and I hope this helps give a rough idea of what they are and why they might be useful.

Of course, they can (and will) become more complex if you are faced with more difficult problems. You might want to apply different methods of filling in missing values to different sections of the data set. You might want to test multiple different models. You might want to delete Python from your computer because you keep getting random errors.

Whatever it may be, just keep trying, and experimenting.

Some helpful resources:

Pipelines | Kaggle Intermediate Machine Learning – Alexis Cook
A Simple Guide to Scikit-learn Pipelines | Rebecca Vickery
A Simple Example of Pipeline in Machine Learning with Scikit-learn | Saptashwa Bhattacharyya

They’re all better than this post. Promise – it’s not hard to do.

I’m going through the Kaggle Intermediate Machine Learning course again to make sure that I understand the material as I remember feeling a bit lost when I started on a different project.

Here are some things that I’ve gone over again from the “Missing Values” section of the course.

Introduction

In machine learning, we do not like missing values. They make the code uncomfortable and it refuses to work. In this case, a missing value is “NaN”. A missing value isn’t “0”. Imagine an empty cell rather than “0”. If we don’t know anything about a value when we can’t use it to make a prediction.

As a result, there are a few techniques we can use when playing around with our data to remove missing values. We can do that by getting rid of the column completely (including the rows _with_ information), filling in the missing values with a certain strategy, and filling in missing values but making sure we say which ones we’ve filled in.

Another final note before those who haven’t come across this series before – I write in Python, I tell bad jokes and I’m not good at this.

The techniques

First, let’s simply find the columns with missing values:

missing_columns = [x for x in training_data[x] if train_columns[x].isnull().isna()]

This uses List Comprehension – it’s great. I’d use that link to wrap your head around it (if you haven’t).

In the Kaggle course, it uses a Mean Absolute Error method of testing how good a prediction is. In short, if you have a prediction and data you know is correct… how far away is your prediction from the true value? The closer to 0 the better (in most cases. I suppose it may not be that helpful if you think you’re overfitting your data).

Dropping values

If we come across a column with a missing value, we can opt to drop it completely.

In this case, we may end up missing out on a lot of information. For example, I have a column with ten thousand rows and 100 are NaN, then I can no longer use over 9 thousand pieces of information!

smaller_training_set = training_data.drop(missing_columns, axis=1)
smaller_validation_set = validation_data.drop(missing_columns, axis=1)

Comic of man knocking down a column from a table with two people walking by — Maybe dropping all of them is overkill

Filling in the values

This uses a function called “Simple Imputation” – the act of replacing missing values by inferring information from the values you do have.

You can take the average of the values (mean, median etc) and just fill in all of the missing values with that information. It isn’t perfect, but nothing is. We can do this using the following:

SimpImp = SimpleImputer()

imputed_X_train = pd.DataFrame(SimpImp.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(SimpImp.transform(X_valid))

So if you’re a bit confused as to why we use “.fit_transform()” on the training set but only “transform()” on the validation set, here is why:

“fit_transform()” performs two actions in one line. It “fits” the data by calculating the mean and variance of the training set. It then transforms the rest of the features using that mean and variance information. We are training the model.

We don’t want to do “fit” the validation data because the prediction is made based on what has been learned from the training set. The data set is different and we want to know how well the model performs when faces with new information.

Here is a better explanation. | Here is another better explanation.

This method performs better than just dropping values and it is possible to play around with the method in imputation. The default is replacing the values with the mean but it’s worthwhile playing around and seeing what gives you better results.

It’s also important to note that this only works with numerical information! Why? A lot of data sets have missing labels. For example, if you had the colour of a house in your data set, trying to get the mean of green and orange is impossible (it’s also ugly).

The end

There are many other methods with different pros and cons (there’s forward filling and backward filling which I won’t go into detail here) but I want to keep this post relatively short.

This is were bias and mistakes begin to creep into the model because we are always making decisions about what to do with the information that we have. That’s important to keep on mind as they become more complex.

Hope you enjoyed and happy machine learning.

My Twitter.

If you want to listen to this post, you can here.

Month: March 2021

I’m lost and can’t get out of this pipeline | Data Science Some Days

Why pipelines?

The data we have to work with

Creating a pipeline

Conclusion

No data left behind | Data Science Somedays

Why pipelines?

The data we have to work with

Creating a pipeline

Conclusion

If you liked this post, share it with others!

If you liked this post, share it with others!