To survive the titanic, become a 50 year old woman | Data Science Somedays

Firstly, the title is a joke. I really have no helpful insights to share as you’ll see from my work.

This will be split into a few sections

What is machine learning?
Train and test data
Visualising the training data
Creating a feature
Cleaning the data
Converting the data
Testing predictions with the test data
Final thoughts

It should definitely be mentioned that this is the furthest thing from a tutorial you will ever witness. I’m not writing to teach but to learn and tell bad jokes.

If you want a helpful tutorial (one that I helped me along), follow Titanic – Data Science Solutions on Kaggle.

What is Machine Learning?

One of the basic tasks in machine learning is classification. You want to predict something as either “A will happen” or “B will happen”. You can do this with historical data and selecting algorithms that are best fit for purpose.

The problem we are posed with is:

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.
Kaggle – Machine Learning From Disaster

2. Train and Test data

Kaggle, the data science website, has a beginner problem called “Titanic – Machine Learning from Disaster” where you’re given data about who survives the titanic crash with information about their age, name, number of siblings and so on. You’re then asked to predict the outcome for 400 people.

The original table looks something like this:

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	S

Initial table for titanic problem

This is what we call “training data”. It is information that we know the outcome for and we can use this to make our fit our algorithms to then make a prediction.

There is also “test” data. It is similar to the data above but with the survived column removed. We will use this to check our predictions against and see how well our efforts have done with all of the visualisations and algorithm abuse we’re doing.

3. Visualising the data

To start with, it’s important to simply have a look at the data to see what insights we can gather from a birds eye view. Otherwise we’re just staring at tables and then hoping for the best.

I won’t go through everything (and yes, it is very rough) but we can gain some basic insights from this. It might influence whether we want to create any new features or focus on certain features when trying to predict survival rates.

For example, we can see from the box plots that most people were roughly 30 years old and had one sibling on board (2nd row, first two box plots). From the histograms, we can see that most people were in passenger class 3 (we have no idea what that means in real life) and a lot of people on the titanic (at least in this dataset) were pretty young.

How does this impact survival? I’m glad you asked. Let’s look at some more graphs.

Survival rates vs passenger class, sex and embarking location. Women in passenger class 1 seemed to live…

Women seemed to have a much higher chance of survival at first glance

Now, we could just make predictions based off these factors if we really wanted to. However, we can also create features based on the information that we have. This is called feature engineering.

4. Creating a feature

I know, this seems like I’m playing God with data. In part, that is why I’m doing this. To feel something.

We have their names with their titles includes. We can extract their titles and create a feature called “Title”. With this, we’ll also be able to make a distinction between whether people with fancy titles were saved first or married women and so on.

for dataset in new_combined:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

You don’t need to understand everything or the variables here. They are specific to the code written which is found on my GitHub.

It basically takes the name “Braund, Mr. Owen Harris” and finds a pattern of the kind A-Za-z with a dot at the end. When this code is run, it’ll take out “Mr.” because it fits that pattern. If it was written as “mr” then the code would miss the title and ignore the name. It’s great, I’ll definitely be using the str.extract feature again.

5. Cleaning the data

A lot of data is bad. Data can regularly contain missing values, mistakes or simply be remarkably unhelpful for our goals. I’ve been told that this is large part of the workflow when trying to solve problems that require prediction.

We can get this information pretty quickly:

new_combined.info() #This tells us all the non-null values in the data set
new_combined.isna().sum() #This tells us which rows have null values (it's quicker then the first method)

In the titanic data set, we have loads of missing data in the “age” column and a small amount in the “embarked” column.

For the “age” section, I followed the advice from the tutorial linked above and guessed the ages based on their passenger class, and sex.

For the “embarked” section, because there were so few missing values, I filled them in using the most common location someone embarked on.

As you can see, cleaning data requires some assumptions to be made and can utilise different techniques. It is definitely something to keep in mind as datasets get bigger and messier. The dataset I’m working with is actually pretty good which is likely a luxury.

It isn’t sexy but important. I suppose that’s the case with many things in life.

5. Converting the data

In order for this information to be useful to an algorithm, we need to make sure that he information we have in our table is numerical.

We can do this by mapping groups of information to numbers. I did this for all features.

It basically follows this format:

for item in new_combined:
    item.Sex = item.Sex.map({"male":0, "female":1}).astype(int)

It is important to note that this only works if all of the info is filled in (which is why the previous step is so important).

For features that have a large number of entries (for example, “age” could potentially have 891 unique values), we can group them together so we have a smaller number of numerical values. This is the same for “fare” and the “title” feature created earlier.

It is basically the same as above but there is one prior step – creating the bands! It is simply using the “pd.cut()” feature. This segments whichever column we specify into the number of bands we want. Then we use those bands and say something like:

“If this passenger is between the age of 0 and 16, we’ll assign them a “1”.”

Our final table will look like this:

Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title
0	3	0	1	1	0.0	1	3
1	1	1	2	1	3.0	3	4
1	3	1	1	0	1.0	1	2
1	1	1	2	1	3.0	1	4
0	3	0	2	0	1.0	1	3

Much less interesting to look at but more useful for our next step

6. Testing predictions with the test data

Now we have a table prepared for our predictions, we can select algorithms, fit them to our training data, then make a prediction.

While the previous stages were definitely frustrating to wrap my head around, this section certainly exposed just how much more there is to learn! Exciting but somewhat demoralising.

There are multiple models you can use to create predictions and there are also multiple ways to test whether what you have done is accurate.

So again, this is not a tutorial. Just an expose of my poor ability.

Funnily enough, I also think this is where it went wrong. My predictions don’t really make any sense.

To set the scene – we have:

A table of features we’ll use to make a prediction (the above table) = X
A prediction target (the “survived” column) = y

We can split our data into 4 sections and it looks like so:

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

This splits our data into the four variables I’ve specified. “Random_state = 0” just means we get the same data split every time the script is run (so randomised data splits is false).

Now we can define our models. I picked a variety of different models to see what results I would get and will hopefully be able to explain the models in another post. However, a detailed understanding of them isn’t necessary at the moment.

I used two linear models and four non-linear models. The most accurate model I used was “SVC” or Support Vector Classification.

SVM = SVC(gamma='auto')
 #Defines the model

SVM.fit(train_X.drop(["Survived"],axis=1), train_y) #Allows the model to "learn" from the data we have provided

Y_prediction = SVM.predict(test_X.drop(["PassengerId"], axis=1)) #predicts the values that should be in the "survived" column 

acc_log = round(SVM.score(train_X.drop(["Survived"],axis=1), train_y) * 100, 2)
 # Returns the mean accuracy based on the labels provided

acc_log # returns the accuracy as a percentage

My final result was 83.7% accuracy!

My first attempt led me to a 99.7% accuracy – Ain’t no way! And the kicker? It predicted everyone would die!

I did this entire project for this comic.

At this point, my brain rightfully died and I submitted my prediction to the Kaggle competition with it being better than 77% of other users. So there is much room for improvement.

8. Final thoughts

This is a beginner problem designed to help people get used to the basics of machine learning so the dataset is better than you’d usually get in the real world.

As I was working through this, I noticed that there are a lot of decisions we can make when creating a prediction. It sounds obvious but it’s important. This is where normal cognitive biases creep in which can go unnoticed – especially when the information we’re working with is far more complex and less complete.

For example, if any of the features were less complete, our decisions on how to fill them in would make a greater impact on our decisions. The algorithms we choose are never a one size fits all solution (which is why we often test many).

I’ll publish my code on my GitHub page when I’ve cleaned it up slightly and removed the swear words.

I’ve probably made a really dumb mistake somewhere so if you feel like looking at the code, please do let me know what that might be…

And with that, I bring it to an end.

There will be much more to improve and learn but I’m glad I’ve given this a shot.

Twitter @ImprovingSlowly

Recent Data Science Somedays posts

The Sunday Monday Post | I Can Swim

I thought I’d start the Sunday Monday Post so I can to talk more loosely about the things I’ve enjoyed within the self-improvement sphere and how I think I’ve improved in the past week (or since the time of the last edition).

It won’t be a very structured article and will probably involve more jokes than are necessary. However, you probably won’t notice them because I’m not very funny. If I say I’ve told a joke then you need to laugh to make sure I don’t cry.

Thanks.

Nonetheless, let me think about what’s happened to me this week

I have ~~great~~ ~~amazing~~ unbelievable news.

I can swim.

As in, when I go into the water and try to move forward I don’t begin to drown straight away or wonder why I decided to ever even think about getting wet with chlorine in the first place. I actually move forward (or backwards because I can do the backstroke too. Just saying.) It’s fascinating.

When I first moved through the water without touching the floor, I nearly punched the pool wall because I was so excited that it happened. I’ve only had four lessons so I didn’t expect it to happen as quickly as it did. Then I tried again but drank far too much pool water. Then I tried again, made a few changes, then I stopped drinking an excessive amount of pool water. But then I might make a different mistake like not actually kicking my legs. Then I’d go again.

But at least I’d be making small changes every time I came to stop. It made the whole swimming thing much easier to manage than trying to complete everything at once. Nonetheless, at the end of the session, I was swimming a decent amount. I can’t do it very far or for very long but it’s much better than the way I was like 15 years ago.

Any time I’d try to get into the water, I’d just flail around, it’d take me forever to progress onto the floats but as soon as I had to support the majority of my body weight, it’d be like my body mass tripled and rather than moving forward through the water, I’d just move down.

Let’s forget the general idea that humans actually float in the water or the fact that you can stand up in training pools. I couldn’t do either. I’d just be dead for the most part.

But now, I don’t die. I just swim for a bit and die a bit later.

To commemorate this moment, I drew a bunch of pictures:

Before my swimming lessons, I found a few different swimming tutorials which gave a few pointers on how to get over the fear of water.

I started to break down the different parts of swimming practiced them individually (though, I always tried to breathe). It made swimming much more manageable.

I’ve conquered years and years of fears by learning how to swim. I’m not very good but that’s OK. I’ve taken the first step. Now I can continue working on swimming and improving slowly in the process.

And dammit I’m proud.

As always, thanks for reading :)

I have facebook and twitter. Check them out @improvingslowly

Yes this is on a Tuesday. No, I don’t know why.

Sisu – Developing mental toughness in the face of adversity

World War Two. 30th November 1939.

The Finnish had a slight problem.

The Soviet Union invaded Finland with a total of one million troops to the fins’ 300 thousand. 4000 planes to their 114. 2500 tanks to their 32. Being outnumbered 3 to 1 in war isn’t only near insurmountable, it is guaranteed death.

Seeing no silver lining – they jumped into the storm anyway.

Sisu

This term, Sisu, doesn’t have a direct translation into English. Emilia Lahti, in her remarkable TED-talk, explains that it can be seen as “extraordinary determination and resoluteness in the face of extreme adversity”.

Sisu is an interesting term because it is almost like the final boost we have when we feel like we’re breaking down in front of a problem. I’d liken it to getting a blue shell in Mario Kart but that’s disgraceful – Sisu is not. It’s remarkable.

And much like those Finnish warriors, we all have it.

The Winter War presented the Finnish soldiers with the biggest challenge they will ever have to face. A Goliath knocked on their door and demanded they surrender. Finland refused. The Goliath barged in and chose to take it by force. Again, Finland stood its ground.

How do you approach such problems with high morale instead of feeling defeated?

How do you keep on going when failure is the only thing inviting itself in?

What do you do when you’ve reached the end of your capabilities?

Sisu.

Extraordinary determination in the face of extreme adversity

How do we utilise Sisu?

Stay in the present moment – Don’t create extra problems that don’t exist yet by looking into the future or mulling on past regrets. Staying in the present means we focus on the problem as it is rather than how we think it might be.

As a result, we don’t needlessly exaggerate problems. Doing so rarely helps and instead paralyses our desire to take action.

Of course, this does not mean problems can’t be horrendous or extremely tough to manage. We’ll grieve, cry, become angry and curse the gods for leaving us here.

However, this cannot be the only thing we do.

Make a choice to take action. When the Finnish were fighting, they had to make the conscious decision that they were going to do something about it. This is important. While we’re likely to think we’re going to do something, often times, we do just that. Think about it and never move forward.

It’s much easier to think about how you’re going to handle something difficult indefinitely by getting stuck in the loop of justifying yourself. Never facing the fear of completing what you set out to do. Sometimes it’s best to let the fear pass but in these difficult situations when Sisu is needed, the cloud of fear may never leave you reveal a clear sky. You jump into the storm anyway.

Becoming a person of action when faced with problems fills us with great confidence and shows us that we’re often able to handle it much more than we could have ever imagined.

If you do these two things, a few benefits follow.

We limit complaining.

Life can seriously suck sometimes. There are a multitude of barriers we might face. It ranges from eating spicy food, wiping your eyes with chili covered hands, crying then realising you have no tissues to being in a violent spiral of debt.

Endless complaining, no matter how justified it seems, prevents anything from happening and gives us reasons to complain even more. Much like venting our anger, complaining might become an enjoyable thing to do even if we can’t admit that to ourselves.

We regain confidence.

Remembering Sisu helps us realise that we have more power over our problems than at first thought. When our intense effort and determination pays off, we’ll only have ourselves to thank.

We remember that these are very human problems.

This is one of those things that’s easier to understand than it is to explain.

Everyone struggles at some point in their lives. Sometimes it lasts longer with greater intensity than others.

The idea isn’t to compare your problems to others. Nor is it to think you’re solely unique. It’s extremely helpful to find strength in the fact that others have suffered similar fates and made it through.

You might feel alone.

You might feel lost.

You might feel hopeless.

You can make it through the storm. No doubt it’ll be difficult and frustrating. But we’re stronger than we think.

Life without any adversity will be a life without any progress. We have this idea that we become better despite our problems but what if we become better because of them?

“The obstacle in the path becomes the path. Never forget, within every obstacle is an opportunity to improve our condition”

This doesn’t mean we can’t dislike the adversity. I hate being in pain for example. However, that isn’t to say that I haven’t learned anything at all. I don’t think I would have started this blog or learned how to study effectively without it.

Sometimes, we owe our strengths to the problems that forced us to develop them.

Remember Sisu when you’re writing essays and revising for exams but feel like you have no more left.

Remember Sisu when you’re in pain and the only goal you have is to get out of bed.

Remember Sisu when everything in front of you seems to fall apart and blocks the light at the end of the tunnel.

If the tunnel caves in, Sisu reminds us that there’s one more option – break the wall down and create the light at the end of the tunnel yourself.

Thanks to James Clear and Emilia Lahti (TEDx talk) for introducing this idea to me.

I have a poorly used twitter account. Follow it. Thanks. You the best. @ImprovingSlowly

July Reading List

Suddenly, two months turns into eight. I don’t know how it happened but it did. I promise I’ve been reading though. Here are the previous reading lists:

October reading list

August reading list

Onto the current books…

Deep Work: Rules for Focused Success in a Distracted World by Cal Newport

Cal writes a blog over called Study Hacks over at calnewport.com and I’ve been following his work for a few years. Over the past year or so, he’s become really interested in learning how we can focus more by employing what he calls “deep work”. He defines it as:

Professional activities performed in a state of distraction-free concentration that push your cognitive capacities to their limit. These efforts create new value, improve your skill and are hard to replicate.

The alternative, shallow work, is the opposite. Non-demanding tasks which are often performed while distracted and easy to replicate. The plight of every student around – writing an essay with Facebook in the background.

Originally, I thought that there cannot be much to say about concentrating really hard on really tough work for a really long time. After all, the crux of the book might be seen as ‘get rid of distractions and get to work’ but there’s much more to it. He goes through multiple tactics to increasing the amount of “deep work” you can get out of the day (it’s very limited since it’s quite tough. So don’t expect eight hours straight away) and why “deep work” is valuable both in a professional and personal sense.

After spending some time with the book and trying to increase my deep work (so I have to work less during the day), I found that it became much easier to do and resulted in a decent dissertation effort towards the end of my degree. More importantly, I found that this sort of stuff can be improved through training (and lost through the lack of it). Much like meditation.

I hope to share some of the things I’ve learned about working more efficiently but here’s one huge take away he loves to talk about – email is not important. Stop checking it so often.

If you do any kind of academic or creative work, you’ll benefit greatly from Deep Work.

An Astronaut’s Guide to Life on Earth by Chris Hadfield

Chris Hadfield is my Canadian dad.

I’m not sure how that’s possible but I want it to be, so it is.

An astronaut with decades of life experience writes about how to live on earth. One hugely desirable virtue of Chris’s writing style is that he gives advice without sounding patronising and without the slightest hint of superiority over the reader. What you see is a character who is confident in his skills and abilities because of his experience in space.

Each chapter goes through a lesson he’s learned from his hours in space and showing us a moment in time where it applies. The great thing about these ‘lessons’ is how applicable they are to a multitude of problems we have in every day scenarios. He might say “prepare for the worst” in the context of crying in space (without gravity, tears don’t fall to the ground – they just ball up at the front of your eyes) or falling down a flight of stairs in front of loads of people where everyone is too far away to help but close enough to see (my tears fell to the ground perfectly. Thanks for asking).

Despite being an astronaut and being closer to the stars than most of us ever will, he seems to be very well grounded. The advice he offers is enclosed in funny and interesting stories that can entertain even the most apathetic about space.

He’s achieved a lot in his life but despite the magnitude of what he’s done, it isn’t discouraging. He inspires others to do the same.

Ready Player One by Ernest Cline

When I moved home, I started using the library more and came across a book called Jimmy Coates: Killer by Joe Craig. I fell in love with that book and the whole series. I’d stay up reading it and be too tired for school. When I’d write a story in class, I’d steal half my themes from the books and brand myself a literary genius.

I even emailed Joe saying that he’s awesome and can’t wait for his next book to come out. (I’m so glad I’ve stopped ending emails with “please reply, bye (a great fan)”.)

Ready Player One is probably the closest I’ve come to feeling that way again. The content isn’t similar but the pace and overall feel is just fantastic. I always wanted to know what happened next but also caught myself wanting to slow down and appreciate feeling so excited about a story again.

“Oh this chapter isn’t too short, you can read until the end. It’ll be the last one.”

The last time I lied to myself that much, I said I’d start my dissertation “today”.

Honourable mentions:

Hyperbole and a Half by Allie Brosh – This book feels so nice. Seriously, go touch this book, you’ll understand what I mean. It feels brilliant. The stuff inside is also hilarious.

Empathy by Roman Krznaric

Better by Atul Gawande

Do you have any book recommendations? Share them below!

I’ve remembered I have a ill-used twitter account (@improvingslowly go follow it because it’s probably great).

As always, thanks for reading.

If you liked this post, share it with others!

If you liked this post, share it with others!

If you liked this post, share it with others!

If you liked this post, share it with others!