Taste the food while you’re cooking | Data Science Some Days

And we back. Welcome to Data Science Some Days. A series where I go through some of the things I’ve learned in the field of Data Science.

For those who are new… I write in Python, these aren’t tutorials, and I’m bad at this.


When I’m cooking, sometimes I’m too nervous to taste the food as I’m going along. It’s like, I must think that I’m being judged by my guests over my shoulder.

That’s a rubbish approach because I might reach the end and not realise that I’ve missed some important seasoning.

So, if you want a good dinner, it’s important you just check as you’re going along.

This analogy doesn’t quite work with what I’m about to explain but I’ve committed. It’s staying.

Cross-Validation

The first introduction to testing is usually just an 80/20 split. This is where you take 80% of your training data and use that to train your model. The final 20% is used for validation. It comes in the following form:

X = training_data
y = training_data["prediciton target"]

X_train, X_valid, y_train, y_valid = train_test_split(X,y,train_size=0.8,test_size=0.2) 

This is fine for large data sets because the 20% you’re using to test your model against will be large enough to offset the random chance that you’ve just managed to pick a “good 20%”.

With small data sets (such as the one from the Housing Prices competition from Kaggle), you could just get lucky with your test split. This is where cross validation comes into play.

This means you test your model against multiple different subsets of your data. It just takes longer to run (because you’re testing your model many different times).

It looks like this instead:

# cross_val_score comes from sklean.model_selection. 
# The pipeline here just contains my model and the different things I've done to clean up the data
# You multiply by -1 because sklearn uses "higher number is better" and this allows consistency 

scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

“cv=5” determines the number of subsets (or better known as “folds”).

The more you use, the smaller the amount of data you use per test. It might take longer but with small data sets, the difference is negligible.

It’ll spit out a list of scores, then you can just average it to get the best model.

I heard the test results were positive.

Conclusion

Cross validation is a helpful way to test your models. It helps reduce the chance that you simply got lucky with the validation portion of your data.

With larger data sets, this is less likely though.

It’s important to keep in mind that cross validation increases the run time of your code because it runs your model multiple times (one on each fold for example).

Also, I’m aware that my comic doesn’t make that much sense but it made me laugh so I’m keeping it.

Further reading

I’m lost and can’t get out of this pipeline | Data Science Some Days

And we back. Welcome to Data Science Some Days. A series where I go through some of the things I’ve learned in the field of Data Science.

For those who are new… I write in Python, these aren’t tutorials, and I’m bad at this.

Today, we’re going to go talk about Pipelines – a tool that seems remarkably helpful in reducing the amount of code needed to achieve an end result.

Why pipelines?

When creating a basic model, you need a few things:

  1. Exploratory Data Analysis
  2. Data cleaning and sorting
  3. Predictions
  4. Testing
  5. Return to number one or two and repeat.

As you make different decisions with your data, the more complex the information, the easier it might be to miss a step or simply forget the steps you’ve made! My first model involved so many random bits of code. If I had to reread it, I’d have no idea what I wanted to do. That’s even with comments.

Pipelines help combine different steps into smaller amounts of code so it’s easier. Let’s take a look at what this might look like.

The data we have to work with

As we may know, the data we work with might have many different types of data. Within the columns, they may also have missing values everywhere. From the previous post, we learned that’s a no go. We need to fill in our data with something.

test_dataset.isna().sum()

pickles 100
oats 22
biscuits 15bananas 0

Alright, a bunch of missing values, let’s see what data type they are.

test_dataset.dtype

pickles int64
oats object
biscuits object
bananas int64

We have numbers (int64) and strings (object – not strictly just strings but we’ll work with this).

So we know that we have to fill in the values for 3 of the 4 columns we have. Additionally, we have to do this for different kinds of data. We can utilise a Simple Imputation and OneHotEncoder to do this. Let’s try to do this in as few steps as possible.

Creating a pipeline

#fills missing values with the mean across the column when applied to dataset
numerical_transformer = SimpleImputer(strategy="mean") 

# replaces the categorical columns with a number (1 for yes, 0 for no) when applied to dataset
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Let's apply the above steps using a "column transformer". A very helpful tool when
preprocesser = ColumnTransformer(transformers=[("num", numerical_transformer, numerical_columns),
("cat", categorical_transformer, categorical_columns)])

Ok, we’ve done a lot here. We’ve defined the methods we want to use to fill in missing values and how we’re going to handle categorical variables.

Just to prevent confusion, “ColumnTransformer” can be imported from “sklearn.compose”.

Now we know that the methods we are using are consistent for the entirety of the dataset. If we want to change this, it’ll be easier to find and simpler to change.

Then we can put this all into a pipeline which contains the model we wish to work with:

# bundles the above into a pipeline containing a model
pipeline_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

This uses a small amount of code. So now we can use this when fitting our model to the training data we have then later making predictions. This is instead of trying to make sure our tables are all clean and accidentally applying predictions to the wrong table (I have done this…), we can just send the raw data through this pipeline and we’ll be left with a prediction.

It’ll look something like this:

fitted_model = pipeline_model.fit(X_train, y_train)
my_prediction = pipeline_model.predict(X_valid)

Conclusion

Pipelines have been an interesting introduction to my Data Science journey and I hope this helps give a rough idea of what they are and why they might be useful.

Of course, they can (and will) become more complex if you are faced with more difficult problems. You might want to apply different methods of filling in missing values to different sections of the data set. You might want to test multiple different models. You might want to delete Python from your computer because you keep getting random errors.

Whatever it may be, just keep trying, and experimenting.

Some helpful resources:

  1. Pipelines | Kaggle Intermediate Machine Learning – Alexis Cook
  2. A Simple Guide to Scikit-learn Pipelines | Rebecca Vickery
  3. A Simple Example of Pipeline in Machine Learning with Scikit-learn | Saptashwa Bhattacharyya

They’re all better than this post. Promise – it’s not hard to do.

What I’m currently learning in Data Science | Data Science Somedays

Black and white keyboard with red space invader icon

It is 26 September as I write this meaning that I’m on day 26 of #66daysofdata.

If this is unfamiliar to you, it’s a small journey started by a Data Scientist named Ken Jee. He decided to “restart” his data science journey is invited us all to come along for the ride.

I’m not a data scientist, I’ve just always found the young field interesting. I thought, for this instance of Data Science Somedays, I’ll go through some of the things I’ve learned (in non-technical detail).


Data Ethics

I’m starting with this because I actually think it’s one of the most important, yet overlooked parts of Data Science. Just because you can do something, doesn’t mean you should. Not everything is good simply because it can be completed with an algorithm.

One of the problems with Data Science, at least in the commercial sphere, is that there’s a lot of value in having plenty of data. Sometimes, this value is taken as a priority versus privacy. In addition, many adversaries understand the value of data and as a result, aim to muddy the waters with large disinformation campaigns or steal personal data. What does the average citizen do in this scenario?

Where am I learning this? Fast.ai’s Pratical Data Ethics course.


Coding

How do I even start?

Quite easily because I’m not that good at programming so I haven’t learned all that much. Some of the main things that come to mind are:

  1. Object Oriented Programming (this took me forever to wrap my head around… it’s still difficult).
  2. Python decorators
  3. Functions

All of this stuff has helped me create:

None of them are impressive. But they exist and I was really happy when I fixed my bugs (if there are more, don’t tell me).

Where am I learning this? 2020 Complete Python Bootcamp: From Zero to Hero in Python.

(I said earlier I haven’t learned much – that’s just me being self-deprecating. It’s a good course – I’m just not good at programming… yet.

I also bought this for £12. Udemy is on sale all the time (literally))


Data visualisation and predictions

Pandas

After a while, I wanted to direct my coding practice to more data work rather than gaining a general understanding of Python.

To do this, I started learning Pandas which is a library (a bunch of code that helps you quickly do other things), that focuses on data manipulation. In short, I can now use Excel files with python. It included things such as:

  • How to rename columns
  • How to find averages, reorganise information, and then create a new table
  • How to answer basic data analysis questions

Pandas is definitely more powerful than the minor things I mentioned above. It’s still quite difficult to remember how to use all of the syntax so I still have to Google a lot of basic information but I’ll get there.

Where am I learning this? Kaggle – Pandas

Bokeh and Seaborn

When I could mess around with excel files and data sets, I took my talents to data visualisation.

Data visualisation will always be important because looking at tables are 1) boring, 2) slow, and 3) boring. How could I make my data sets at least look interesting?

Seaborn is another library that makes data visualisation much simpler (e.g. “creating a bar chart in one line of code” simpler).

Bokeh is another library that seems to be slightly more powerful in the sense that I can then make my visualisations interactive which is helpful. Especially when you have a lot of information to display at once.

I knew that going through tutorials will have their limit as my hand is always being held so I found a data set on ramen and created Kaggle notebooks. My aim was to practice and show others what my thought process was.

Where am I learning this? Seaborn | Bokeh


Machine learning

This is my most recent venture. How can I begin to make predictions using code, computers and coffee?

So for all of the above, I still find quite difficult and there will be a little while until I can say “I know Python” but this topic seemed like the one with the biggest black box.

If I say

filepath = “hello.csv”

“pandas.read_csv(filepath)”

I understand that I’m taking a function from the Pandas library, and that function will allow me to interact with the .csv file I’ve called.

If I say sklearn.predict(X_new_data) – honestly what is even happening? Half the time, I feel like it’s just luck that I get a good outcome.

Where am I learning this? Kaggle – Intro to machine learning


What is next?

I’m going to continue learning about data manipulation with Pandas and Bokeh as those were the modules I found the most interesting to learn about. However, that could very easily change.

My approach to learning all of this is to go into practice as soon as I can even if it’s a bit scary. It exposes my mistakes and reminds me that working through tutorials often leaves me feeling as though I’ve learned more than I have.

There’s also a second problem – I’m not a Computer Science student so I don’t have the benefit of learning the theory behind all of this stuff. Part of me wants to dive in, the other part is asking that I stay on course and keep learning the practical work so I can utilise it in my work.

Quite frequently, I get frustrated by not understanding and remembering what I’m learning “straight away”. However, this stuff isn’t easy by any stretch of the imagination. So it might take some time.

And that’s alright. Because we’re improving slowly.