And we back. Welcome to Data Science Some Days. A series where I go through some of the things I’ve learned in the field of Data Science.
For those who are new… I write in Python, these aren’t tutorials, and I’m bad at this.
Today, we’re going to go talk about Pipelines – a tool that seems remarkably helpful in reducing the amount of code needed to achieve an end result.
Why pipelines?
When creating a basic model, you need a few things:
- Exploratory Data Analysis
- Data cleaning and sorting
- Predictions
- Testing
- Return to number one or two and repeat.
As you make different decisions with your data, the more complex the information, the easier it might be to miss a step or simply forget the steps you’ve made! My first model involved so many random bits of code. If I had to reread it, I’d have no idea what I wanted to do. That’s even with comments.
Pipelines help combine different steps into smaller amounts of code so it’s easier. Let’s take a look at what this might look like.
The data we have to work with
As we may know, the data we work with might have many different types of data. Within the columns, they may also have missing values everywhere. From the previous post, we learned that’s a no go. We need to fill in our data with something.
test_dataset.isna().sum()
pickles 100
oats 22
biscuits 15
bananas 0
Alright, a bunch of missing values, let’s see what data type they are.
test_dataset.dtype
pickles int64
oats object
biscuits object
bananas int64
We have numbers (int64) and strings (object – not strictly just strings but we’ll work with this).
So we know that we have to fill in the values for 3 of the 4 columns we have. Additionally, we have to do this for different kinds of data. We can utilise a Simple Imputation and OneHotEncoder to do this. Let’s try to do this in as few steps as possible.
Creating a pipeline
#fills missing values with the mean across the column when applied to dataset
numerical_transformer = SimpleImputer(strategy="mean")
# replaces the categorical columns with a number (1 for yes, 0 for no) when applied to dataset
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# Let's apply the above steps using a "column transformer". A very helpful tool when
preprocesser = ColumnTransformer(transformers=[("num", numerical_transformer, numerical_columns),
("cat", categorical_transformer, categorical_columns)])
Ok, we’ve done a lot here. We’ve defined the methods we want to use to fill in missing values and how we’re going to handle categorical variables.
Just to prevent confusion, “ColumnTransformer” can be imported from “sklearn.compose”.
Now we know that the methods we are using are consistent for the entirety of the dataset. If we want to change this, it’ll be easier to find and simpler to change.
Then we can put this all into a pipeline which contains the model we wish to work with:
# bundles the above into a pipeline containing a model
pipeline_model = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])
This uses a small amount of code. So now we can use this when fitting our model to the training data we have then later making predictions. This is instead of trying to make sure our tables are all clean and accidentally applying predictions to the wrong table (I have done this…), we can just send the raw data through this pipeline and we’ll be left with a prediction.
It’ll look something like this:
fitted_model = pipeline_model.fit(X_train, y_train)
my_prediction = pipeline_model.predict(X_valid)
Conclusion
Pipelines have been an interesting introduction to my Data Science journey and I hope this helps give a rough idea of what they are and why they might be useful.
Of course, they can (and will) become more complex if you are faced with more difficult problems. You might want to apply different methods of filling in missing values to different sections of the data set. You might want to test multiple different models. You might want to delete Python from your computer because you keep getting random errors.
Whatever it may be, just keep trying, and experimenting.
Some helpful resources:
- Pipelines | Kaggle Intermediate Machine Learning – Alexis Cook
- A Simple Guide to Scikit-learn Pipelines | Rebecca Vickery
- A Simple Example of Pipeline in Machine Learning with Scikit-learn | Saptashwa Bhattacharyya
They’re all better than this post. Promise – it’s not hard to do.