Start more projects, please | Data Science Some days

And we back. Welcome to Data Science Some Days. A series where I go through some of the things I’ve learned in the field of Data Science.

For those who are new… I write in Python, these aren’t tutorials, and I’m bad at this.

I haven’t written one of these for an entire month. My mistake… time flies when you procrastinate.


I want to spend a little bit of time talking about learning Data Science itself.

I won’t say how long I’ve been learning to code and such because I don’t really know the answer. This isn’t to say I’ve been doing it for a long time, I just don’t have much of a memory for this stuff.

I will, however, say one of the big mistakes I’ve made in my journey:

Not completing enough projects.

Any time I start a new project, I feel figuratively naked – as though all the knowledge I’ve ever gathered has deserted me and I’ll never be able to find it again.

I find it difficult to do anything at all until I get past the initial uncomfortable feeling of not having my hand held through to the end. Will I fall over? Yes. But learning how to get back up is a really useful skill, even if you fall over after another step.

Photo by cottonbro on Pexels.com

Let’s talk about tutorials

There are a lot of tutorials online about all sorts of things. Many of them are good, some are bad, some are brilliant. When it comes to programming, you will never have a shortage of materials for beginners. As a result, they’re tempting and the entry point is quite low. Many even have no idea where to start.

Tutorials are also easy to get lost in because they do a lot of the heavy lifting in the background. That’s the more helpful way to teach information but perhaps less useful for the learner. This isn’t to say all tutorials and courses are “easy”. Far from it. Rather, no one has ever become a developer or programmer purely off the back of completing a handful of tutorials.

Don’t get attached to tutorials or courses. They can only take us so far. It’s also difficult to stay entertained by them for the long haul.

Learning just enough

My new enjoyment of projects comes from a video by Tina Huaug on How to self-study technical things. She mentioned a helpful principle:

“Learn just enough to start on a project”

This divorces you quite quickly from an attachment to completing courses or selecting the “right one”. If you’ve got what you need out of it, then move on and use the knowledge to create something. Fortunately, the information doesn’t disappear if you tell yourself you might not complete it. It’s fine to refer to them during projects, anyway.

You’ll get to the difficult parts more quickly which lets you understand the true gaps in your knowledge/skill. It’s perfectly fine for this to be humbling. Getting better at anything requires humility.

It’s more fun

Being the person responsible for creating something is a really satisfying feeling, even if it sucks. (It likely does, only in comparison to those who are much more experienced than you, which is unfair. Comparison is a fools game.)

You can point to a model you’ve trained or visualisation you’ve created and say “That was ME”. And it’ll be true.

When you look at a list of potential projects, you’re more likely to add your own twist to it (it doesn’t matter if that’s just experimenting with different colours). If you’re following a tutorial to the T, you miss out on something important:

Ownership.

The difficulties and successes are yours.

Leave yourself open to surprises

I’ve noticed a few things in a recent project of mine (more on that in the next DS Somedays post, it’s nothing special):

  1. I know and understand a bit more than I gave myself credit for
  2. There is so much more I can add to my knowledge base to improve the project
  3. Courses, tutorials, tools are just there to help me reach my end goal. It helps explain why I always have so many tabs open

They can be challenging, which might also explain why they’re easy to avoid. However, I’ll definitely have to work towards doing more. If not for my portfolio but general enjoyment.

Project-based learning is the way forward.

Further resources:

  1. How to self-study technical things.
  2. Project based tutorials (many different programming languages)
  3. Projectlearn.io

Taste the food while you’re cooking | Data Science Some Days

And we back. Welcome to Data Science Some Days. A series where I go through some of the things I’ve learned in the field of Data Science.

For those who are new… I write in Python, these aren’t tutorials, and I’m bad at this.


When I’m cooking, sometimes I’m too nervous to taste the food as I’m going along. It’s like, I must think that I’m being judged by my guests over my shoulder.

That’s a rubbish approach because I might reach the end and not realise that I’ve missed some important seasoning.

So, if you want a good dinner, it’s important you just check as you’re going along.

This analogy doesn’t quite work with what I’m about to explain but I’ve committed. It’s staying.

Cross-Validation

The first introduction to testing is usually just an 80/20 split. This is where you take 80% of your training data and use that to train your model. The final 20% is used for validation. It comes in the following form:

X = training_data
y = training_data["prediciton target"]

X_train, X_valid, y_train, y_valid = train_test_split(X,y,train_size=0.8,test_size=0.2) 

This is fine for large data sets because the 20% you’re using to test your model against will be large enough to offset the random chance that you’ve just managed to pick a “good 20%”.

With small data sets (such as the one from the Housing Prices competition from Kaggle), you could just get lucky with your test split. This is where cross validation comes into play.

This means you test your model against multiple different subsets of your data. It just takes longer to run (because you’re testing your model many different times).

It looks like this instead:

# cross_val_score comes from sklean.model_selection. 
# The pipeline here just contains my model and the different things I've done to clean up the data
# You multiply by -1 because sklearn uses "higher number is better" and this allows consistency 

scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

“cv=5” determines the number of subsets (or better known as “folds”).

The more you use, the smaller the amount of data you use per test. It might take longer but with small data sets, the difference is negligible.

It’ll spit out a list of scores, then you can just average it to get the best model.

I heard the test results were positive.

Conclusion

Cross validation is a helpful way to test your models. It helps reduce the chance that you simply got lucky with the validation portion of your data.

With larger data sets, this is less likely though.

It’s important to keep in mind that cross validation increases the run time of your code because it runs your model multiple times (one on each fold for example).

Also, I’m aware that my comic doesn’t make that much sense but it made me laugh so I’m keeping it.

Further reading

I’m lost and can’t get out of this pipeline | Data Science Some Days

And we back. Welcome to Data Science Some Days. A series where I go through some of the things I’ve learned in the field of Data Science.

For those who are new… I write in Python, these aren’t tutorials, and I’m bad at this.

Today, we’re going to go talk about Pipelines – a tool that seems remarkably helpful in reducing the amount of code needed to achieve an end result.

Why pipelines?

When creating a basic model, you need a few things:

  1. Exploratory Data Analysis
  2. Data cleaning and sorting
  3. Predictions
  4. Testing
  5. Return to number one or two and repeat.

As you make different decisions with your data, the more complex the information, the easier it might be to miss a step or simply forget the steps you’ve made! My first model involved so many random bits of code. If I had to reread it, I’d have no idea what I wanted to do. That’s even with comments.

Pipelines help combine different steps into smaller amounts of code so it’s easier. Let’s take a look at what this might look like.

The data we have to work with

As we may know, the data we work with might have many different types of data. Within the columns, they may also have missing values everywhere. From the previous post, we learned that’s a no go. We need to fill in our data with something.

test_dataset.isna().sum()

pickles 100
oats 22
biscuits 15bananas 0

Alright, a bunch of missing values, let’s see what data type they are.

test_dataset.dtype

pickles int64
oats object
biscuits object
bananas int64

We have numbers (int64) and strings (object – not strictly just strings but we’ll work with this).

So we know that we have to fill in the values for 3 of the 4 columns we have. Additionally, we have to do this for different kinds of data. We can utilise a Simple Imputation and OneHotEncoder to do this. Let’s try to do this in as few steps as possible.

Creating a pipeline

#fills missing values with the mean across the column when applied to dataset
numerical_transformer = SimpleImputer(strategy="mean") 

# replaces the categorical columns with a number (1 for yes, 0 for no) when applied to dataset
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Let's apply the above steps using a "column transformer". A very helpful tool when
preprocesser = ColumnTransformer(transformers=[("num", numerical_transformer, numerical_columns),
("cat", categorical_transformer, categorical_columns)])

Ok, we’ve done a lot here. We’ve defined the methods we want to use to fill in missing values and how we’re going to handle categorical variables.

Just to prevent confusion, “ColumnTransformer” can be imported from “sklearn.compose”.

Now we know that the methods we are using are consistent for the entirety of the dataset. If we want to change this, it’ll be easier to find and simpler to change.

Then we can put this all into a pipeline which contains the model we wish to work with:

# bundles the above into a pipeline containing a model
pipeline_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

This uses a small amount of code. So now we can use this when fitting our model to the training data we have then later making predictions. This is instead of trying to make sure our tables are all clean and accidentally applying predictions to the wrong table (I have done this…), we can just send the raw data through this pipeline and we’ll be left with a prediction.

It’ll look something like this:

fitted_model = pipeline_model.fit(X_train, y_train)
my_prediction = pipeline_model.predict(X_valid)

Conclusion

Pipelines have been an interesting introduction to my Data Science journey and I hope this helps give a rough idea of what they are and why they might be useful.

Of course, they can (and will) become more complex if you are faced with more difficult problems. You might want to apply different methods of filling in missing values to different sections of the data set. You might want to test multiple different models. You might want to delete Python from your computer because you keep getting random errors.

Whatever it may be, just keep trying, and experimenting.

Some helpful resources:

  1. Pipelines | Kaggle Intermediate Machine Learning – Alexis Cook
  2. A Simple Guide to Scikit-learn Pipelines | Rebecca Vickery
  3. A Simple Example of Pipeline in Machine Learning with Scikit-learn | Saptashwa Bhattacharyya

They’re all better than this post. Promise – it’s not hard to do.

What I’m currently learning in Data Science | Data Science Somedays

Black and white keyboard with red space invader icon

It is 26 September as I write this meaning that I’m on day 26 of #66daysofdata.

If this is unfamiliar to you, it’s a small journey started by a Data Scientist named Ken Jee. He decided to “restart” his data science journey is invited us all to come along for the ride.

I’m not a data scientist, I’ve just always found the young field interesting. I thought, for this instance of Data Science Somedays, I’ll go through some of the things I’ve learned (in non-technical detail).


Data Ethics

I’m starting with this because I actually think it’s one of the most important, yet overlooked parts of Data Science. Just because you can do something, doesn’t mean you should. Not everything is good simply because it can be completed with an algorithm.

One of the problems with Data Science, at least in the commercial sphere, is that there’s a lot of value in having plenty of data. Sometimes, this value is taken as a priority versus privacy. In addition, many adversaries understand the value of data and as a result, aim to muddy the waters with large disinformation campaigns or steal personal data. What does the average citizen do in this scenario?

Where am I learning this? Fast.ai’s Pratical Data Ethics course.


Coding

How do I even start?

Quite easily because I’m not that good at programming so I haven’t learned all that much. Some of the main things that come to mind are:

  1. Object Oriented Programming (this took me forever to wrap my head around… it’s still difficult).
  2. Python decorators
  3. Functions

All of this stuff has helped me create:

None of them are impressive. But they exist and I was really happy when I fixed my bugs (if there are more, don’t tell me).

Where am I learning this? 2020 Complete Python Bootcamp: From Zero to Hero in Python.

(I said earlier I haven’t learned much – that’s just me being self-deprecating. It’s a good course – I’m just not good at programming… yet.

I also bought this for £12. Udemy is on sale all the time (literally))


Data visualisation and predictions

Pandas

After a while, I wanted to direct my coding practice to more data work rather than gaining a general understanding of Python.

To do this, I started learning Pandas which is a library (a bunch of code that helps you quickly do other things), that focuses on data manipulation. In short, I can now use Excel files with python. It included things such as:

  • How to rename columns
  • How to find averages, reorganise information, and then create a new table
  • How to answer basic data analysis questions

Pandas is definitely more powerful than the minor things I mentioned above. It’s still quite difficult to remember how to use all of the syntax so I still have to Google a lot of basic information but I’ll get there.

Where am I learning this? Kaggle – Pandas

Bokeh and Seaborn

When I could mess around with excel files and data sets, I took my talents to data visualisation.

Data visualisation will always be important because looking at tables are 1) boring, 2) slow, and 3) boring. How could I make my data sets at least look interesting?

Seaborn is another library that makes data visualisation much simpler (e.g. “creating a bar chart in one line of code” simpler).

Bokeh is another library that seems to be slightly more powerful in the sense that I can then make my visualisations interactive which is helpful. Especially when you have a lot of information to display at once.

I knew that going through tutorials will have their limit as my hand is always being held so I found a data set on ramen and created Kaggle notebooks. My aim was to practice and show others what my thought process was.

Where am I learning this? Seaborn | Bokeh


Machine learning

This is my most recent venture. How can I begin to make predictions using code, computers and coffee?

So for all of the above, I still find quite difficult and there will be a little while until I can say “I know Python” but this topic seemed like the one with the biggest black box.

If I say

filepath = “hello.csv”

“pandas.read_csv(filepath)”

I understand that I’m taking a function from the Pandas library, and that function will allow me to interact with the .csv file I’ve called.

If I say sklearn.predict(X_new_data) – honestly what is even happening? Half the time, I feel like it’s just luck that I get a good outcome.

Where am I learning this? Kaggle – Intro to machine learning


What is next?

I’m going to continue learning about data manipulation with Pandas and Bokeh as those were the modules I found the most interesting to learn about. However, that could very easily change.

My approach to learning all of this is to go into practice as soon as I can even if it’s a bit scary. It exposes my mistakes and reminds me that working through tutorials often leaves me feeling as though I’ve learned more than I have.

There’s also a second problem – I’m not a Computer Science student so I don’t have the benefit of learning the theory behind all of this stuff. Part of me wants to dive in, the other part is asking that I stay on course and keep learning the practical work so I can utilise it in my work.

Quite frequently, I get frustrated by not understanding and remembering what I’m learning “straight away”. However, this stuff isn’t easy by any stretch of the imagination. So it might take some time.

And that’s alright. Because we’re improving slowly.