Start more projects, please | Data Science Some days

And we back. Welcome to Data Science Some Days. A series where I go through some of the things I’ve learned in the field of Data Science.

For those who are new… I write in Python, these aren’t tutorials, and I’m bad at this.

I haven’t written one of these for an entire month. My mistake… time flies when you procrastinate.


I want to spend a little bit of time talking about learning Data Science itself.

I won’t say how long I’ve been learning to code and such because I don’t really know the answer. This isn’t to say I’ve been doing it for a long time, I just don’t have much of a memory for this stuff.

I will, however, say one of the big mistakes I’ve made in my journey:

Not completing enough projects.

Any time I start a new project, I feel figuratively naked – as though all the knowledge I’ve ever gathered has deserted me and I’ll never be able to find it again.

I find it difficult to do anything at all until I get past the initial uncomfortable feeling of not having my hand held through to the end. Will I fall over? Yes. But learning how to get back up is a really useful skill, even if you fall over after another step.

Photo by cottonbro on Pexels.com

Let’s talk about tutorials

There are a lot of tutorials online about all sorts of things. Many of them are good, some are bad, some are brilliant. When it comes to programming, you will never have a shortage of materials for beginners. As a result, they’re tempting and the entry point is quite low. Many even have no idea where to start.

Tutorials are also easy to get lost in because they do a lot of the heavy lifting in the background. That’s the more helpful way to teach information but perhaps less useful for the learner. This isn’t to say all tutorials and courses are “easy”. Far from it. Rather, no one has ever become a developer or programmer purely off the back of completing a handful of tutorials.

Don’t get attached to tutorials or courses. They can only take us so far. It’s also difficult to stay entertained by them for the long haul.

Learning just enough

My new enjoyment of projects comes from a video by Tina Huaug on How to self-study technical things. She mentioned a helpful principle:

“Learn just enough to start on a project”

This divorces you quite quickly from an attachment to completing courses or selecting the “right one”. If you’ve got what you need out of it, then move on and use the knowledge to create something. Fortunately, the information doesn’t disappear if you tell yourself you might not complete it. It’s fine to refer to them during projects, anyway.

You’ll get to the difficult parts more quickly which lets you understand the true gaps in your knowledge/skill. It’s perfectly fine for this to be humbling. Getting better at anything requires humility.

It’s more fun

Being the person responsible for creating something is a really satisfying feeling, even if it sucks. (It likely does, only in comparison to those who are much more experienced than you, which is unfair. Comparison is a fools game.)

You can point to a model you’ve trained or visualisation you’ve created and say “That was ME”. And it’ll be true.

When you look at a list of potential projects, you’re more likely to add your own twist to it (it doesn’t matter if that’s just experimenting with different colours). If you’re following a tutorial to the T, you miss out on something important:

Ownership.

The difficulties and successes are yours.

Leave yourself open to surprises

I’ve noticed a few things in a recent project of mine (more on that in the next DS Somedays post, it’s nothing special):

  1. I know and understand a bit more than I gave myself credit for
  2. There is so much more I can add to my knowledge base to improve the project
  3. Courses, tutorials, tools are just there to help me reach my end goal. It helps explain why I always have so many tabs open

They can be challenging, which might also explain why they’re easy to avoid. However, I’ll definitely have to work towards doing more. If not for my portfolio but general enjoyment.

Project-based learning is the way forward.

Further resources:

  1. How to self-study technical things.
  2. Project based tutorials (many different programming languages)
  3. Projectlearn.io

To survive the titanic, become a 50 year old woman | Data Science Somedays

Firstly, the title is a joke. I really have no helpful insights to share as you’ll see from my work.

This will be split into a few sections

  1. What is machine learning?
  2. Train and test data
  3. Visualising the training data
  4. Creating a feature
  5. Cleaning the data
  6. Converting the data
  7. Testing predictions with the test data
  8. Final thoughts

It should definitely be mentioned that this is the furthest thing from a tutorial you will ever witness. I’m not writing to teach but to learn and tell bad jokes.

If you want a helpful tutorial (one that I helped me along), follow Titanic – Data Science Solutions on Kaggle.


  1. What is Machine Learning?

One of the basic tasks in machine learning is classification. You want to predict something as either “A will happen” or “B will happen”. You can do this with historical data and selecting algorithms that are best fit for purpose.

The problem we are posed with is:

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

Kaggle – Machine Learning From Disaster

2. Train and Test data

Kaggle, the data science website, has a beginner problem called “Titanic – Machine Learning from Disaster” where you’re given data about who survives the titanic crash with information about their age, name, number of siblings and so on. You’re then asked to predict the outcome for 400 people.

The original table looks something like this:

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNSS
211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85CC
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNSS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123SS
503Allen, Mr. William Henrymale35.0003734508.0500NaNSS
Initial table for titanic problem

This is what we call “training data”. It is information that we know the outcome for and we can use this to make our fit our algorithms to then make a prediction.

There is also “test” data. It is similar to the data above but with the survived column removed. We will use this to check our predictions against and see how well our efforts have done with all of the visualisations and algorithm abuse we’re doing.


3. Visualising the data

To start with, it’s important to simply have a look at the data to see what insights we can gather from a birds eye view. Otherwise we’re just staring at tables and then hoping for the best.

Information as a histogram
Information as a box plot

I won’t go through everything (and yes, it is very rough) but we can gain some basic insights from this. It might influence whether we want to create any new features or focus on certain features when trying to predict survival rates.

For example, we can see from the box plots that most people were roughly 30 years old and had one sibling on board (2nd row, first two box plots). From the histograms, we can see that most people were in passenger class 3 (we have no idea what that means in real life) and a lot of people on the titanic (at least in this dataset) were pretty young.

How does this impact survival? I’m glad you asked. Let’s look at some more graphs.

Survival rates vs passenger class, sex and embarking location. Women in passenger class 1 seemed to live…
Women seemed to have a much higher chance of survival at first glance

Now, we could just make predictions based off these factors if we really wanted to. However, we can also create features based on the information that we have. This is called feature engineering.


4. Creating a feature

I know, this seems like I’m playing God with data. In part, that is why I’m doing this. To feel something.

We have their names with their titles includes. We can extract their titles and create a feature called “Title”. With this, we’ll also be able to make a distinction between whether people with fancy titles were saved first or married women and so on.

for dataset in new_combined:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

You don’t need to understand everything or the variables here. They are specific to the code written which is found on my GitHub.

It basically takes the name “Braund, Mr. Owen Harris” and finds a pattern of the kind A-Za-z with a dot at the end. When this code is run, it’ll take out “Mr.” because it fits that pattern. If it was written as “mr” then the code would miss the title and ignore the name. It’s great, I’ll definitely be using the str.extract feature again.


5. Cleaning the data

A lot of data is bad. Data can regularly contain missing values, mistakes or simply be remarkably unhelpful for our goals. I’ve been told that this is large part of the workflow when trying to solve problems that require prediction.

We can get this information pretty quickly:

new_combined.info() #This tells us all the non-null values in the data set
new_combined.isna().sum() #This tells us which rows have null values (it's quicker then the first method)

In the titanic data set, we have loads of missing data in the “age” column and a small amount in the “embarked” column.

For the “age” section, I followed the advice from the tutorial linked above and guessed the ages based on their passenger class, and sex.

For the “embarked” section, because there were so few missing values, I filled them in using the most common location someone embarked on.

As you can see, cleaning data requires some assumptions to be made and can utilise different techniques. It is definitely something to keep in mind as datasets get bigger and messier. The dataset I’m working with is actually pretty good which is likely a luxury.

It isn’t sexy but important. I suppose that’s the case with many things in life.


5. Converting the data

In order for this information to be useful to an algorithm, we need to make sure that he information we have in our table is numerical.

We can do this by mapping groups of information to numbers. I did this for all features.

It basically follows this format:

for item in new_combined:
    item.Sex = item.Sex.map({"male":0, "female":1}).astype(int)

It is important to note that this only works if all of the info is filled in (which is why the previous step is so important).

For features that have a large number of entries (for example, “age” could potentially have 891 unique values), we can group them together so we have a smaller number of numerical values. This is the same for “fare” and the “title” feature created earlier.

It is basically the same as above but there is one prior step – creating the bands! It is simply using the “pd.cut()” feature. This segments whichever column we specify into the number of bands we want. Then we use those bands and say something like:

“If this passenger is between the age of 0 and 16, we’ll assign them a “1”.”

Our final table will look like this:

SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
0301100.013
1112103.034
1311001.012
1112103.014
0302001.013
Much less interesting to look at but more useful for our next step

6. Testing predictions with the test data

Now we have a table prepared for our predictions, we can select algorithms, fit them to our training data, then make a prediction.

While the previous stages were definitely frustrating to wrap my head around, this section certainly exposed just how much more there is to learn! Exciting but somewhat demoralising.

There are multiple models you can use to create predictions and there are also multiple ways to test whether what you have done is accurate.

So again, this is not a tutorial. Just an expose of my poor ability.

Funnily enough, I also think this is where it went wrong. My predictions don’t really make any sense.

To set the scene – we have:

  • A table of features we’ll use to make a prediction (the above table) = X
  • A prediction target (the “survived” column) = y

We can split our data into 4 sections and it looks like so:

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

This splits our data into the four variables I’ve specified. “Random_state = 0” just means we get the same data split every time the script is run (so randomised data splits is false).

Now we can define our models. I picked a variety of different models to see what results I would get and will hopefully be able to explain the models in another post. However, a detailed understanding of them isn’t necessary at the moment.

I used two linear models and four non-linear models. The most accurate model I used was “SVC” or Support Vector Classification.

SVM = SVC(gamma='auto')
 #Defines the model

SVM.fit(train_X.drop(["Survived"],axis=1), train_y) #Allows the model to "learn" from the data we have provided

Y_prediction = SVM.predict(test_X.drop(["PassengerId"], axis=1)) #predicts the values that should be in the "survived" column 

acc_log = round(SVM.score(train_X.drop(["Survived"],axis=1), train_y) * 100, 2)
 # Returns the mean accuracy based on the labels provided

acc_log # returns the accuracy as a percentage

My final result was 83.7% accuracy!

My first attempt led me to a 99.7% accuracy – Ain’t no way! And the kicker? It predicted everyone would die!

I did this entire project for this comic.

At this point, my brain rightfully died and I submitted my prediction to the Kaggle competition with it being better than 77% of other users. So there is much room for improvement.

8. Final thoughts

This is a beginner problem designed to help people get used to the basics of machine learning so the dataset is better than you’d usually get in the real world.

As I was working through this, I noticed that there are a lot of decisions we can make when creating a prediction. It sounds obvious but it’s important. This is where normal cognitive biases creep in which can go unnoticed – especially when the information we’re working with is far more complex and less complete.

For example, if any of the features were less complete, our decisions on how to fill them in would make a greater impact on our decisions. The algorithms we choose are never a one size fits all solution (which is why we often test many).

I’ll publish my code on my GitHub page when I’ve cleaned it up slightly and removed the swear words.

I’ve probably made a really dumb mistake somewhere so if you feel like looking at the code, please do let me know what that might be…

And with that, I bring it to an end.

There will be much more to improve and learn but I’m glad I’ve given this a shot.


Twitter @ImprovingSlowly

Recent Data Science Somedays posts

How to copy the entire internet | Data Science Somedays

Photo by Markus Winkler on Pexels.com

My honourable guests, thank you for joining me today to learn how to copy the entire internet and store it in a less efficient format.

A recent project of mine is working on the Police Rewired Hackathon which asks us to think of ways to address hate speech online. Along with three colleagues, we started hacking the internet to put an end to hate speech online.

We won the Hackathon and all of us were given ownership of Google as a reward. Google is ours.

Thank you for reading and accepting your new digital overlords.

Our idea is simple and I will explain it without code because my code is often terrible and I’ve been saved by Sam, our python whisperer, on many occasions.

  1. Select a few twitter users (UCL academics in this case)
  2. Take their statuses and replies to those statuses
  3. Analyse the replies and classify them as hate speech, offensive speech, or neither
  4. Visualise our results and see if there are any trends.

In this post, I will only go through the first two.

Taking information from twitter

This is the part of the hackathon I’ve been most involved in because I’ve never created a Twitter scraper before (a program that takes information from Twitter and stores it in a spreadsheet or database). It was a good chance to learn.

For the next part to make sense, here’s a very small background of what a “tweet” is.

It is a message/status on Twitter. With limited amounts of text – you can also attach images.
These tweets contain a lot of information which can be used to form all sorts of analysis on. For example, a single tweet contains:

  1. Text
  2. Coordinates of where it was posted (if geolocation is enabled)
  3. the platform it came from (“Twitter for iPhone”)
  4. Likes (and who liked them)
  5. Retweets (and who did this)
  6. Time it was posted

And so on. With thousands of tweets, you can extract a number of potential trends and this is what we are trying to do. Does hate speech come from a specific area in the world?

OK, now how do we get this information?

There are two main ways to do this. The first is by using the Twitter Application Programming Interface (API). In short, Twitter has created this book of code which people like me can interact with with my code and it’ll give me information. For example, every tweet as a “status ID” that I can use to differentiate between tweets.

All you need to do is apply for developer status and you’ll be given authentication keys. There is a large limitation though – it’s owned by Twitter and Twitter, like most private companies, value making money.

There is a free developer status but that only allows for a small sample of tweets to be collected up to 7 days in the past. Anything beyond that, I’ll receive no information. I also can’t interact with the API too often before it tells me to shut up.

Collecting thousands of tweets at a decent rate would cost a lot of money (which people like myself… and most academics, cannot afford).

Fine.

Programmers are quite persistent. There are helpful Python modules (a bunch of code that helps you write other code) such as Twint.

Twint is a wonderfully comprehensive module that allows for significant historical analysis of Twitter. It uses a lot of the information that Twitter provides, does what the API does but without the artificial limitations from Twitter. However, it is fickle – for an entire month it was broken because twitter changed a URL.

Not sustainable.

Because I don’t want to incriminate myself, I will persist with the idea that I used the Twitter API.

How does it work?

Ok, I said no code but I lied. I didn’t know how else to explain it.

for user in users:
    tweets_pulled = dict()
    replies=[]
    for user_tweets in tweepy.Cursor(api.user_timeline,screen_name=user).items(20): 
        for tweet in tweepy.Cursor(api.search,q='to:'+user, result_type='recent').items(100): # process up to 100 replies (limit per hour)
            if hasattr(tweet, 'in_reply_to_status_id_str'):
                if (tweet.in_reply_to_status_id_str==user_tweets.id_str):
                    replies.append(tweet)

I’ve removed some stuff to make it slightly easier to read. However, it is a simple “for loop”. This takes a user (“ImprovingSlowly”) and takes 20 tweets from their timeline.

After it has a list of these tweets, it searches twitter for “ImprovingSlowly” and adds to a list whether the tweets found were replies to any statuses.

Do that for 50 users with many tweets each, you’ll find yourself with a nice number of tweets.

If we ignore the hundred errors I received, multiple expletives at 11pm, and the three times I slammed my computer shut because life is meaningless, the code was pretty simple all things considered. It helped us on our way to addressing the problem of hate speech on Twitter.

Limitations

So there are many limitations to this approach. Here are some of the biggest:

  1. With hundreds of thousands of tweets, this is slow. Especially with the limits placed on us by Twitter, it can take hours to barely scratch the surface
  2. You have to “catch” the hate speech. If hate speech is caught and deleted before I run the code, I have no evidence it ever existed.
  3. …We didn’t find much hate speech. Of course this is good. But a thousand “lol” replies doesn’t really do much for a hackathon on hate speech.

Then there’s the bloody idea of “what even is hate speech?”

I’m not answering that in this blog post. I probably never will.

Conclusion

Don’t be mean to people on Twitter.

I don’t know who you are. I don’t know what you want. If you are looking for retweets, I can tell you I don’t have any to give, but what I do have are a very particular set of skills.

Skills I have acquired over a very short Hackathon.

Skills that make me a nightmare for people like you.

If you stop spreading hate, that’ll be the end of it. I will not look for you, I will not pursue you, but if you don’t, I will look for you, I will find you and I will visualise your hate speech on a Tableau graph.

What I’m currently learning in Data Science | Data Science Somedays

Black and white keyboard with red space invader icon

It is 26 September as I write this meaning that I’m on day 26 of #66daysofdata.

If this is unfamiliar to you, it’s a small journey started by a Data Scientist named Ken Jee. He decided to “restart” his data science journey is invited us all to come along for the ride.

I’m not a data scientist, I’ve just always found the young field interesting. I thought, for this instance of Data Science Somedays, I’ll go through some of the things I’ve learned (in non-technical detail).


Data Ethics

I’m starting with this because I actually think it’s one of the most important, yet overlooked parts of Data Science. Just because you can do something, doesn’t mean you should. Not everything is good simply because it can be completed with an algorithm.

One of the problems with Data Science, at least in the commercial sphere, is that there’s a lot of value in having plenty of data. Sometimes, this value is taken as a priority versus privacy. In addition, many adversaries understand the value of data and as a result, aim to muddy the waters with large disinformation campaigns or steal personal data. What does the average citizen do in this scenario?

Where am I learning this? Fast.ai’s Pratical Data Ethics course.


Coding

How do I even start?

Quite easily because I’m not that good at programming so I haven’t learned all that much. Some of the main things that come to mind are:

  1. Object Oriented Programming (this took me forever to wrap my head around… it’s still difficult).
  2. Python decorators
  3. Functions

All of this stuff has helped me create:

None of them are impressive. But they exist and I was really happy when I fixed my bugs (if there are more, don’t tell me).

Where am I learning this? 2020 Complete Python Bootcamp: From Zero to Hero in Python.

(I said earlier I haven’t learned much – that’s just me being self-deprecating. It’s a good course – I’m just not good at programming… yet.

I also bought this for £12. Udemy is on sale all the time (literally))


Data visualisation and predictions

Pandas

After a while, I wanted to direct my coding practice to more data work rather than gaining a general understanding of Python.

To do this, I started learning Pandas which is a library (a bunch of code that helps you quickly do other things), that focuses on data manipulation. In short, I can now use Excel files with python. It included things such as:

  • How to rename columns
  • How to find averages, reorganise information, and then create a new table
  • How to answer basic data analysis questions

Pandas is definitely more powerful than the minor things I mentioned above. It’s still quite difficult to remember how to use all of the syntax so I still have to Google a lot of basic information but I’ll get there.

Where am I learning this? Kaggle – Pandas

Bokeh and Seaborn

When I could mess around with excel files and data sets, I took my talents to data visualisation.

Data visualisation will always be important because looking at tables are 1) boring, 2) slow, and 3) boring. How could I make my data sets at least look interesting?

Seaborn is another library that makes data visualisation much simpler (e.g. “creating a bar chart in one line of code” simpler).

Bokeh is another library that seems to be slightly more powerful in the sense that I can then make my visualisations interactive which is helpful. Especially when you have a lot of information to display at once.

I knew that going through tutorials will have their limit as my hand is always being held so I found a data set on ramen and created Kaggle notebooks. My aim was to practice and show others what my thought process was.

Where am I learning this? Seaborn | Bokeh


Machine learning

This is my most recent venture. How can I begin to make predictions using code, computers and coffee?

So for all of the above, I still find quite difficult and there will be a little while until I can say “I know Python” but this topic seemed like the one with the biggest black box.

If I say

filepath = “hello.csv”

“pandas.read_csv(filepath)”

I understand that I’m taking a function from the Pandas library, and that function will allow me to interact with the .csv file I’ve called.

If I say sklearn.predict(X_new_data) – honestly what is even happening? Half the time, I feel like it’s just luck that I get a good outcome.

Where am I learning this? Kaggle – Intro to machine learning


What is next?

I’m going to continue learning about data manipulation with Pandas and Bokeh as those were the modules I found the most interesting to learn about. However, that could very easily change.

My approach to learning all of this is to go into practice as soon as I can even if it’s a bit scary. It exposes my mistakes and reminds me that working through tutorials often leaves me feeling as though I’ve learned more than I have.

There’s also a second problem – I’m not a Computer Science student so I don’t have the benefit of learning the theory behind all of this stuff. Part of me wants to dive in, the other part is asking that I stay on course and keep learning the practical work so I can utilise it in my work.

Quite frequently, I get frustrated by not understanding and remembering what I’m learning “straight away”. However, this stuff isn’t easy by any stretch of the imagination. So it might take some time.

And that’s alright. Because we’re improving slowly.