Start more projects, please | Data Science Some days

And we back. Welcome to Data Science Some Days. A series where I go through some of the things I’ve learned in the field of Data Science.

For those who are new… I write in Python, these aren’t tutorials, and I’m bad at this.

I haven’t written one of these for an entire month. My mistake… time flies when you procrastinate.

I want to spend a little bit of time talking about learning Data Science itself.

I won’t say how long I’ve been learning to code and such because I don’t really know the answer. This isn’t to say I’ve been doing it for a long time, I just don’t have much of a memory for this stuff.

I will, however, say one of the big mistakes I’ve made in my journey:

Not completing enough projects.

Any time I start a new project, I feel figuratively naked – as though all the knowledge I’ve ever gathered has deserted me and I’ll never be able to find it again.

I find it difficult to do anything at all until I get past the initial uncomfortable feeling of not having my hand held through to the end. Will I fall over? Yes. But learning how to get back up is a really useful skill, even if you fall over after another step.

Let’s talk about tutorials

There are a lot of tutorials online about all sorts of things. Many of them are good, some are bad, some are brilliant. When it comes to programming, you will never have a shortage of materials for beginners. As a result, they’re tempting and the entry point is quite low. Many even have no idea where to start.

Tutorials are also easy to get lost in because they do a lot of the heavy lifting in the background. That’s the more helpful way to teach information but perhaps less useful for the learner. This isn’t to say all tutorials and courses are “easy”. Far from it. Rather, no one has ever become a developer or programmer purely off the back of completing a handful of tutorials.

Don’t get attached to tutorials or courses. They can only take us so far. It’s also difficult to stay entertained by them for the long haul.

Learning just enough

My new enjoyment of projects comes from a video by Tina Huaug on How to self-study technical things. She mentioned a helpful principle:

“Learn just enough to start on a project”

This divorces you quite quickly from an attachment to completing courses or selecting the “right one”. If you’ve got what you need out of it, then move on and use the knowledge to create something. Fortunately, the information doesn’t disappear if you tell yourself you might not complete it. It’s fine to refer to them during projects, anyway.

You’ll get to the difficult parts more quickly which lets you understand the true gaps in your knowledge/skill. It’s perfectly fine for this to be humbling. Getting better at anything requires humility.

It’s more fun

Being the person responsible for creating something is a really satisfying feeling, even if it sucks. (It likely does, only in comparison to those who are much more experienced than you, which is unfair. Comparison is a fools game.)

You can point to a model you’ve trained or visualisation you’ve created and say “That was ME”. And it’ll be true.

When you look at a list of potential projects, you’re more likely to add your own twist to it (it doesn’t matter if that’s just experimenting with different colours). If you’re following a tutorial to the T, you miss out on something important:

Ownership.

The difficulties and successes are yours.

Leave yourself open to surprises

I’ve noticed a few things in a recent project of mine (more on that in the next DS Somedays post, it’s nothing special):

I know and understand a bit more than I gave myself credit for
There is so much more I can add to my knowledge base to improve the project
Courses, tutorials, tools are just there to help me reach my end goal. It helps explain why I always have so many tabs open

They can be challenging, which might also explain why they’re easy to avoid. However, I’ll definitely have to work towards doing more. If not for my portfolio but general enjoyment.

Project-based learning is the way forward.

Further resources:

How to self-study technical things.
Project based tutorials (many different programming languages)
Projectlearn.io

Taste the food while you’re cooking | Data Science Some Days

When I’m cooking, sometimes I’m too nervous to taste the food as I’m going along. It’s like, I must think that I’m being judged by my guests over my shoulder.

That’s a rubbish approach because I might reach the end and not realise that I’ve missed some important seasoning.

So, if you want a good dinner, it’s important you just check as you’re going along.

This analogy doesn’t quite work with what I’m about to explain but I’ve committed. It’s staying.

Cross-Validation

The first introduction to testing is usually just an 80/20 split. This is where you take 80% of your training data and use that to train your model. The final 20% is used for validation. It comes in the following form:

X = training_data
y = training_data["prediciton target"]

X_train, X_valid, y_train, y_valid = train_test_split(X,y,train_size=0.8,test_size=0.2)

This is fine for large data sets because the 20% you’re using to test your model against will be large enough to offset the random chance that you’ve just managed to pick a “good 20%”.

With small data sets (such as the one from the Housing Prices competition from Kaggle), you could just get lucky with your test split. This is where cross validation comes into play.

This means you test your model against multiple different subsets of your data. It just takes longer to run (because you’re testing your model many different times).

It looks like this instead:

# cross_val_score comes from sklean.model_selection. 
# The pipeline here just contains my model and the different things I've done to clean up the data
# You multiply by -1 because sklearn uses "higher number is better" and this allows consistency 

scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

“cv=5” determines the number of subsets (or better known as “folds”).

The more you use, the smaller the amount of data you use per test. It might take longer but with small data sets, the difference is negligible.

It’ll spit out a list of scores, then you can just average it to get the best model.

Conclusion

Cross validation is a helpful way to test your models. It helps reduce the chance that you simply got lucky with the validation portion of your data.

With larger data sets, this is less likely though.

It’s important to keep in mind that cross validation increases the run time of your code because it runs your model multiple times (one on each fold for example).

Also, I’m aware that my comic doesn’t make that much sense but it made me laugh so I’m keeping it.

I’m lost and can’t get out of this pipeline | Data Science Some Days

Today, we’re going to go talk about Pipelines – a tool that seems remarkably helpful in reducing the amount of code needed to achieve an end result.

Why pipelines?

When creating a basic model, you need a few things:

Exploratory Data Analysis
Data cleaning and sorting
Predictions
Testing
Return to number one or two and repeat.

As you make different decisions with your data, the more complex the information, the easier it might be to miss a step or simply forget the steps you’ve made! My first model involved so many random bits of code. If I had to reread it, I’d have no idea what I wanted to do. That’s even with comments.

Pipelines help combine different steps into smaller amounts of code so it’s easier. Let’s take a look at what this might look like.

The data we have to work with

As we may know, the data we work with might have many different types of data. Within the columns, they may also have missing values everywhere. From the previous post, we learned that’s a no go. We need to fill in our data with something.

test_dataset.isna().sum()

pickles 100
oats 22
biscuits 15bananas 0

Alright, a bunch of missing values, let’s see what data type they are.

test_dataset.dtype

pickles int64
oats object
biscuits object
bananas int64

We have numbers (int64) and strings (object – not strictly just strings but we’ll work with this).

So we know that we have to fill in the values for 3 of the 4 columns we have. Additionally, we have to do this for different kinds of data. We can utilise a Simple Imputation and OneHotEncoder to do this. Let’s try to do this in as few steps as possible.

Creating a pipeline

#fills missing values with the mean across the column when applied to dataset
numerical_transformer = SimpleImputer(strategy="mean") 

# replaces the categorical columns with a number (1 for yes, 0 for no) when applied to dataset
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Let's apply the above steps using a "column transformer". A very helpful tool when
preprocesser = ColumnTransformer(transformers=[("num", numerical_transformer, numerical_columns),
("cat", categorical_transformer, categorical_columns)])

Ok, we’ve done a lot here. We’ve defined the methods we want to use to fill in missing values and how we’re going to handle categorical variables.

Just to prevent confusion, “ColumnTransformer” can be imported from “sklearn.compose”.

Now we know that the methods we are using are consistent for the entirety of the dataset. If we want to change this, it’ll be easier to find and simpler to change.

Then we can put this all into a pipeline which contains the model we wish to work with:

# bundles the above into a pipeline containing a model
pipeline_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

This uses a small amount of code. So now we can use this when fitting our model to the training data we have then later making predictions. This is instead of trying to make sure our tables are all clean and accidentally applying predictions to the wrong table (I have done this…), we can just send the raw data through this pipeline and we’ll be left with a prediction.

It’ll look something like this:

fitted_model = pipeline_model.fit(X_train, y_train)
my_prediction = pipeline_model.predict(X_valid)

Conclusion

Pipelines have been an interesting introduction to my Data Science journey and I hope this helps give a rough idea of what they are and why they might be useful.

Of course, they can (and will) become more complex if you are faced with more difficult problems. You might want to apply different methods of filling in missing values to different sections of the data set. You might want to test multiple different models. You might want to delete Python from your computer because you keep getting random errors.

Whatever it may be, just keep trying, and experimenting.

Some helpful resources:

Pipelines | Kaggle Intermediate Machine Learning – Alexis Cook
A Simple Guide to Scikit-learn Pipelines | Rebecca Vickery
A Simple Example of Pipeline in Machine Learning with Scikit-learn | Saptashwa Bhattacharyya

They’re all better than this post. Promise – it’s not hard to do.

To survive the titanic, become a 50 year old woman | Data Science Somedays

Firstly, the title is a joke. I really have no helpful insights to share as you’ll see from my work.

This will be split into a few sections

What is machine learning?
Train and test data
Visualising the training data
Creating a feature
Cleaning the data
Converting the data
Testing predictions with the test data
Final thoughts

It should definitely be mentioned that this is the furthest thing from a tutorial you will ever witness. I’m not writing to teach but to learn and tell bad jokes.

If you want a helpful tutorial (one that I helped me along), follow Titanic – Data Science Solutions on Kaggle.

What is Machine Learning?

One of the basic tasks in machine learning is classification. You want to predict something as either “A will happen” or “B will happen”. You can do this with historical data and selecting algorithms that are best fit for purpose.

The problem we are posed with is:

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.
Kaggle – Machine Learning From Disaster

2. Train and Test data

Kaggle, the data science website, has a beginner problem called “Titanic – Machine Learning from Disaster” where you’re given data about who survives the titanic crash with information about their age, name, number of siblings and so on. You’re then asked to predict the outcome for 400 people.

The original table looks something like this:

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	S

Initial table for titanic problem

This is what we call “training data”. It is information that we know the outcome for and we can use this to make our fit our algorithms to then make a prediction.

There is also “test” data. It is similar to the data above but with the survived column removed. We will use this to check our predictions against and see how well our efforts have done with all of the visualisations and algorithm abuse we’re doing.

3. Visualising the data

To start with, it’s important to simply have a look at the data to see what insights we can gather from a birds eye view. Otherwise we’re just staring at tables and then hoping for the best.

I won’t go through everything (and yes, it is very rough) but we can gain some basic insights from this. It might influence whether we want to create any new features or focus on certain features when trying to predict survival rates.

For example, we can see from the box plots that most people were roughly 30 years old and had one sibling on board (2nd row, first two box plots). From the histograms, we can see that most people were in passenger class 3 (we have no idea what that means in real life) and a lot of people on the titanic (at least in this dataset) were pretty young.

How does this impact survival? I’m glad you asked. Let’s look at some more graphs.

Survival rates vs passenger class, sex and embarking location. Women in passenger class 1 seemed to live…

Women seemed to have a much higher chance of survival at first glance

Now, we could just make predictions based off these factors if we really wanted to. However, we can also create features based on the information that we have. This is called feature engineering.

4. Creating a feature

I know, this seems like I’m playing God with data. In part, that is why I’m doing this. To feel something.

We have their names with their titles includes. We can extract their titles and create a feature called “Title”. With this, we’ll also be able to make a distinction between whether people with fancy titles were saved first or married women and so on.

for dataset in new_combined:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

You don’t need to understand everything or the variables here. They are specific to the code written which is found on my GitHub.

It basically takes the name “Braund, Mr. Owen Harris” and finds a pattern of the kind A-Za-z with a dot at the end. When this code is run, it’ll take out “Mr.” because it fits that pattern. If it was written as “mr” then the code would miss the title and ignore the name. It’s great, I’ll definitely be using the str.extract feature again.

5. Cleaning the data

A lot of data is bad. Data can regularly contain missing values, mistakes or simply be remarkably unhelpful for our goals. I’ve been told that this is large part of the workflow when trying to solve problems that require prediction.

We can get this information pretty quickly:

new_combined.info() #This tells us all the non-null values in the data set
new_combined.isna().sum() #This tells us which rows have null values (it's quicker then the first method)

In the titanic data set, we have loads of missing data in the “age” column and a small amount in the “embarked” column.

For the “age” section, I followed the advice from the tutorial linked above and guessed the ages based on their passenger class, and sex.

For the “embarked” section, because there were so few missing values, I filled them in using the most common location someone embarked on.

As you can see, cleaning data requires some assumptions to be made and can utilise different techniques. It is definitely something to keep in mind as datasets get bigger and messier. The dataset I’m working with is actually pretty good which is likely a luxury.

It isn’t sexy but important. I suppose that’s the case with many things in life.

5. Converting the data

In order for this information to be useful to an algorithm, we need to make sure that he information we have in our table is numerical.

We can do this by mapping groups of information to numbers. I did this for all features.

It basically follows this format:

for item in new_combined:
    item.Sex = item.Sex.map({"male":0, "female":1}).astype(int)

It is important to note that this only works if all of the info is filled in (which is why the previous step is so important).

For features that have a large number of entries (for example, “age” could potentially have 891 unique values), we can group them together so we have a smaller number of numerical values. This is the same for “fare” and the “title” feature created earlier.

It is basically the same as above but there is one prior step – creating the bands! It is simply using the “pd.cut()” feature. This segments whichever column we specify into the number of bands we want. Then we use those bands and say something like:

“If this passenger is between the age of 0 and 16, we’ll assign them a “1”.”

Our final table will look like this:

Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title
0	3	0	1	1	0.0	1	3
1	1	1	2	1	3.0	3	4
1	3	1	1	0	1.0	1	2
1	1	1	2	1	3.0	1	4
0	3	0	2	0	1.0	1	3

Much less interesting to look at but more useful for our next step

6. Testing predictions with the test data

Now we have a table prepared for our predictions, we can select algorithms, fit them to our training data, then make a prediction.

While the previous stages were definitely frustrating to wrap my head around, this section certainly exposed just how much more there is to learn! Exciting but somewhat demoralising.

There are multiple models you can use to create predictions and there are also multiple ways to test whether what you have done is accurate.

So again, this is not a tutorial. Just an expose of my poor ability.

Funnily enough, I also think this is where it went wrong. My predictions don’t really make any sense.

To set the scene – we have:

A table of features we’ll use to make a prediction (the above table) = X
A prediction target (the “survived” column) = y

We can split our data into 4 sections and it looks like so:

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

This splits our data into the four variables I’ve specified. “Random_state = 0” just means we get the same data split every time the script is run (so randomised data splits is false).

Now we can define our models. I picked a variety of different models to see what results I would get and will hopefully be able to explain the models in another post. However, a detailed understanding of them isn’t necessary at the moment.

I used two linear models and four non-linear models. The most accurate model I used was “SVC” or Support Vector Classification.

SVM = SVC(gamma='auto')
 #Defines the model

SVM.fit(train_X.drop(["Survived"],axis=1), train_y) #Allows the model to "learn" from the data we have provided

Y_prediction = SVM.predict(test_X.drop(["PassengerId"], axis=1)) #predicts the values that should be in the "survived" column 

acc_log = round(SVM.score(train_X.drop(["Survived"],axis=1), train_y) * 100, 2)
 # Returns the mean accuracy based on the labels provided

acc_log # returns the accuracy as a percentage

My final result was 83.7% accuracy!

My first attempt led me to a 99.7% accuracy – Ain’t no way! And the kicker? It predicted everyone would die!

I did this entire project for this comic.

At this point, my brain rightfully died and I submitted my prediction to the Kaggle competition with it being better than 77% of other users. So there is much room for improvement.

8. Final thoughts

This is a beginner problem designed to help people get used to the basics of machine learning so the dataset is better than you’d usually get in the real world.

As I was working through this, I noticed that there are a lot of decisions we can make when creating a prediction. It sounds obvious but it’s important. This is where normal cognitive biases creep in which can go unnoticed – especially when the information we’re working with is far more complex and less complete.

For example, if any of the features were less complete, our decisions on how to fill them in would make a greater impact on our decisions. The algorithms we choose are never a one size fits all solution (which is why we often test many).

I’ll publish my code on my GitHub page when I’ve cleaned it up slightly and removed the swear words.

I’ve probably made a really dumb mistake somewhere so if you feel like looking at the code, please do let me know what that might be…

And with that, I bring it to an end.

There will be much more to improve and learn but I’m glad I’ve given this a shot.

Twitter @ImprovingSlowly

Recent Data Science Somedays posts

How to copy the entire internet | Data Science Somedays

My honourable guests, thank you for joining me today to learn how to copy the entire internet and store it in a less efficient format.

A recent project of mine is working on the Police Rewired Hackathon which asks us to think of ways to address hate speech online. Along with three colleagues, we started hacking the internet to put an end to hate speech online.

~~We won the Hackathon and all of us were given ownership of Google as a reward. Google is ours.~~

~~Thank you for reading and accepting your new digital overlords.~~

Our idea is simple and I will explain it without code because my code is often terrible and I’ve been saved by Sam, our python whisperer, on many occasions.

Select a few twitter users (UCL academics in this case)
Take their statuses and replies to those statuses
Analyse the replies and classify them as hate speech, offensive speech, or neither
Visualise our results and see if there are any trends.

In this post, I will only go through the first two.

Taking information from twitter

This is the part of the hackathon I’ve been most involved in because I’ve never created a Twitter scraper before (a program that takes information from Twitter and stores it in a spreadsheet or database). It was a good chance to learn.

For the next part to make sense, here’s a very small background of what a “tweet” is.

It is a message/status on Twitter. With limited amounts of text – you can also attach images.
These tweets contain a lot of information which can be used to form all sorts of analysis on. For example, a single tweet contains:

Text
Coordinates of where it was posted (if geolocation is enabled)
the platform it came from (“Twitter for iPhone”)
Likes (and who liked them)
Retweets (and who did this)
Time it was posted

And so on. With thousands of tweets, you can extract a number of potential trends and this is what we are trying to do. Does hate speech come from a specific area in the world?

OK, now how do we get this information?

There are two main ways to do this. The first is by using the Twitter Application Programming Interface (API). In short, Twitter has created this book of code which people like me can interact with with my code and it’ll give me information. For example, every tweet as a “status ID” that I can use to differentiate between tweets.

All you need to do is apply for developer status and you’ll be given authentication keys. There is a large limitation though – it’s owned by Twitter and Twitter, like most private companies, value making money.

There is a free developer status but that only allows for a small sample of tweets to be collected up to 7 days in the past. Anything beyond that, I’ll receive no information. I also can’t interact with the API too often before it tells me to shut up.

Collecting thousands of tweets at a decent rate would cost a lot of money (which people like myself… and most academics, cannot afford).

Fine.

Programmers are quite persistent. There are helpful Python modules (a bunch of code that helps you write other code) such as Twint.

Twint is a wonderfully comprehensive module that allows for significant historical analysis of Twitter. It uses a lot of the information that Twitter provides, does what the API does but without the artificial limitations from Twitter. However, it is fickle – for an entire month it was broken because twitter changed a URL.

Not sustainable.

Because I don’t want to incriminate myself, I will persist with the idea that I used the Twitter API.

How does it work?

Ok, I said no code but I lied. I didn’t know how else to explain it.

for user in users:
    tweets_pulled = dict()
    replies=[]
    for user_tweets in tweepy.Cursor(api.user_timeline,screen_name=user).items(20): 
        for tweet in tweepy.Cursor(api.search,q='to:'+user, result_type='recent').items(100): # process up to 100 replies (limit per hour)
            if hasattr(tweet, 'in_reply_to_status_id_str'):
                if (tweet.in_reply_to_status_id_str==user_tweets.id_str):
                    replies.append(tweet)

I’ve removed some stuff to make it slightly easier to read. However, it is a simple “for loop”. This takes a user (“ImprovingSlowly”) and takes 20 tweets from their timeline.

After it has a list of these tweets, it searches twitter for “ImprovingSlowly” and adds to a list whether the tweets found were replies to any statuses.

Do that for 50 users with many tweets each, you’ll find yourself with a nice number of tweets.

If we ignore the hundred errors I received, multiple expletives at 11pm, and the three times I slammed my computer shut because life is meaningless, the code was pretty simple all things considered. It helped us on our way to addressing the problem of hate speech on Twitter.

Limitations

So there are many limitations to this approach. Here are some of the biggest:

With hundreds of thousands of tweets, this is slow. Especially with the limits placed on us by Twitter, it can take hours to barely scratch the surface
You have to “catch” the hate speech. If hate speech is caught and deleted before I run the code, I have no evidence it ever existed.
…We didn’t find much hate speech. Of course this is good. But a thousand “lol” replies doesn’t really do much for a hackathon on hate speech.

Then there’s the bloody idea of “what even is hate speech?”

I’m not answering that in this blog post. I probably never will.

Conclusion

Don’t be mean to people on Twitter.

I don’t know who you are. I don’t know what you want. If you are looking for retweets, I can tell you I don’t have any to give, but what I do have are a very particular set of skills.

Skills I have acquired over a very short Hackathon.

Skills that make me a nightmare for people like you.

If you stop spreading hate, that’ll be the end of it. I will not look for you, I will not pursue you, but if you don’t, I will look for you, I will find you and I will visualise your hate speech on a Tableau graph.

What I’m currently learning in Data Science | Data Science Somedays

Black and white keyboard with red space invader icon

It is 26 September as I write this meaning that I’m on day 26 of #66daysofdata.

If this is unfamiliar to you, it’s a small journey started by a Data Scientist named Ken Jee. He decided to “restart” his data science journey is invited us all to come along for the ride.

I’m not a data scientist, I’ve just always found the young field interesting. I thought, for this instance of Data Science Somedays, I’ll go through some of the things I’ve learned (in non-technical detail).

Data Ethics

I’m starting with this because I actually think it’s one of the most important, yet overlooked parts of Data Science. Just because you can do something, doesn’t mean you should. Not everything is good simply because it can be completed with an algorithm.

One of the problems with Data Science, at least in the commercial sphere, is that there’s a lot of value in having plenty of data. Sometimes, this value is taken as a priority versus privacy. In addition, many adversaries understand the value of data and as a result, aim to muddy the waters with large disinformation campaigns or steal personal data. What does the average citizen do in this scenario?

Where am I learning this? Fast.ai’s Pratical Data Ethics course.

Coding

How do I even start?

Quite easily because I’m not that good at programming so I haven’t learned all that much. Some of the main things that come to mind are:

Object Oriented Programming (this took me forever to wrap my head around… it’s still difficult).
Python decorators
Functions

All of this stuff has helped me create:

None of them are impressive. But they exist and I was really happy when I fixed my bugs (if there are more, don’t tell me).

Where am I learning this? 2020 Complete Python Bootcamp: From Zero to Hero in Python.

(I said earlier I haven’t learned much – that’s just me being self-deprecating. It’s a good course – I’m just not good at programming… yet.

I also bought this for £12. Udemy is on sale all the time (literally))

Data visualisation and predictions

Pandas

After a while, I wanted to direct my coding practice to more data work rather than gaining a general understanding of Python.

To do this, I started learning Pandas which is a library (a bunch of code that helps you quickly do other things), that focuses on data manipulation. In short, I can now use Excel files with python. It included things such as:

How to rename columns
How to find averages, reorganise information, and then create a new table
How to answer basic data analysis questions

Pandas is definitely more powerful than the minor things I mentioned above. It’s still quite difficult to remember how to use all of the syntax so I still have to Google a lot of basic information but I’ll get there.

Where am I learning this? Kaggle – Pandas

Bokeh and Seaborn

When I could mess around with excel files and data sets, I took my talents to data visualisation.

Data visualisation will always be important because looking at tables are 1) boring, 2) slow, and 3) boring. How could I make my data sets at least look interesting?

Seaborn is another library that makes data visualisation much simpler (e.g. “creating a bar chart in one line of code” simpler).

Bokeh is another library that seems to be slightly more powerful in the sense that I can then make my visualisations interactive which is helpful. Especially when you have a lot of information to display at once.

I knew that going through tutorials will have their limit as my hand is always being held so I found a data set on ramen and created Kaggle notebooks. My aim was to practice and show others what my thought process was.

Where am I learning this? Seaborn | Bokeh

Machine learning

This is my most recent venture. How can I begin to make predictions using code, computers and coffee?

So for all of the above, I still find quite difficult and there will be a little while until I can say “I know Python” but this topic seemed like the one with the biggest black box.

If I say

filepath = “hello.csv”

“pandas.read_csv(filepath)”

I understand that I’m taking a function from the Pandas library, and that function will allow me to interact with the .csv file I’ve called.

If I say sklearn.predict(X_new_data) – honestly what is even happening? Half the time, I feel like it’s just luck that I get a good outcome.

Where am I learning this? Kaggle – Intro to machine learning

What is next?

I’m going to continue learning about data manipulation with Pandas and Bokeh as those were the modules I found the most interesting to learn about. However, that could very easily change.

My approach to learning all of this is to go into practice as soon as I can even if it’s a bit scary. It exposes my mistakes and reminds me that working through tutorials often leaves me feeling as though I’ve learned more than I have.

There’s also a second problem – I’m not a Computer Science student so I don’t have the benefit of learning the theory behind all of this stuff. Part of me wants to dive in, the other part is asking that I stay on course and keep learning the practical work so I can utilise it in my work.

Quite frequently, I get frustrated by not understanding and remembering what I’m learning “straight away”. However, this stuff isn’t easy by any stretch of the imagination. So it might take some time.

And that’s alright. Because we’re improving slowly.

Let’s write an email | Data Science Some Days

Ladies and gentlemen, this has been a long time coming.

So, I’ve finally started programming. Technically, I started months ago but I’ve made such a piss-poor effort at being consistent, that I’ve done next to nothing.

I was growing frustrated – often having ~~nightmares~~ asking the question – will I ever be able to say “Hello World”?

Evidently, simply thinking about it would never work. I can buy as many Udemy courses as I want – that won’t turn me into a data scientist. My wonderful solution to this is to go straight into a project and learn as I go. I’m learning python 3.6. But first…

print("Hello World")

I have joined the elites.

Ok, the project goes as follows:

I need to send emails to specific groups of people a week before the event starts. The email should also include attachments specific to the person I’m sending the email to and it will have HTML elements to it.

To break it down…

I need to send an email
I need to send a HTML email
I need to send a HTML email with attachments
I need to send a HTML email with attachments to certain people

There’s more but I’ve only managed the first two so far. This programming stuff is difficult and the only reason why I’m not computer illiterate is because I was born in the 90s.

I started off by opening these tutorials:

They’re both great and use slightly different methods to achieve the same result. Maybe you’ll notice that my final solution ends up being a ~~desperate cry for help~~ combination of them both.

LET US BEGIN.

Here is the first iteration of the code:

import smtplib #sets up simple mail transer protocol

smtpObj = smtplib.SMTP('smtp-mail.outlook.com', 587) type(smtpObj) #Connects to the outlok SMTP server

smtpObj.ehlo() #Says "hello" to the server

smtpObj.starttls() #Puts SMTP connection in TLS mode. #I didn't get any confirmation when I ran the program though...

smtpObj.login('email1@email.com', input("Please enter password: ") #Calls an argument to log into the server and input password.

smtpObj.sendmail('email1@email.com', 'email2@gmail.com', 'Subject: Hello mate \nLet\'s hope you get this mail') #Email it's coming from, email it's going to, the message

smtpObj.quit() #ends the session

print('Session ended') #Tells me in the terminal it's now complete

Boom, pretty simple right? I mainly took everything from Automate the Boring stuff and just swapped in my details. Well, of course not. I kept on getting an error – nothing was happening.

Well – I was somehow using the WRONG EMAIL. FUCK. It took me an hour to realise that.

Next step… let’s send an email with bold and italics.

This was frustrating because nothing worked. All it really requires is for you to put in the message in HTML format. Because I’m not learning HTML, I decided to just use this nifty HTML converter to make this part less painful.

Here are my errors…

#regularly get "Syntax error" with smtpObj.sendmail - I was missing a fucking bracket

This literally made me to go bed angry.

'''everything in the HTML goes into the subject line - smtpObj.sendmail('email@email', 'email2@gmail.com', "Subject: " f"Hello mate \n {html}")'''

This was a surprise but I figured out to stop it…

'''Now nothing shows up in the body of the message: smtpObj.sendmail('email1@.ac.uk', 'email2@gmail.com', f"Subject: \n Hello mate {html}") Solved by putting {html} next to \n'''

…then it somehow got worse…

'''Now... the email doesn't actually show up in html format smtpObj.sendmail('email1@.ac.uk', 'email2@gmail.com', f"Subject: Hello mate\n{html}")'''

…and even worse.

At this point, I changed tactic and, in the process, the entirety of my code. I won’t show you everything here otherwise this post will look too technical to those who have no experience with coding but I’ve uploaded my progress so far onto Github.

To summarise, rather than trying to send HTML with the technique I used earlier, I used a module that was essentially created to make sending emails with python much easier (email.mime). This essentially means, it contains prewritten code that you can then use to create other programs.

But… I was successful in sending myself a HTML email. Now the next step is to add a bloody attachment without putting my head through my computer.

Dear reader, you may be wondering, “why is he putting himself through so much suffering? He sounds incredibly angry.

I’m not, I promise. This has actually been the most fun I’ve had in my free time in a while. It was a good challenge and I could sense myself improving after every mistake.

Granted, I probably should have just completed an online course or something before trying to jump into this project but that would have been less fun.

Onto the next one…

Thanks for reading!

Tag: Data Science

Start more projects, please | Data Science Some days

Taste the food while you’re cooking | Data Science Some Days

Cross-Validation

Conclusion

Further reading

I’m lost and can’t get out of this pipeline | Data Science Some Days

Why pipelines?

The data we have to work with

Creating a pipeline

Conclusion

To survive the titanic, become a 50 year old woman | Data Science Somedays

How to copy the entire internet | Data Science Somedays

What I’m currently learning in Data Science | Data Science Somedays

Let’s write an email | Data Science Some Days

If you liked this post, share it with others!

Cross-Validation

Conclusion

Further reading

If you liked this post, share it with others!

Why pipelines?

The data we have to work with

Creating a pipeline

Conclusion

If you liked this post, share it with others!

If you liked this post, share it with others!

If you liked this post, share it with others!

If you liked this post, share it with others!

If you liked this post, share it with others!