How to copy the entire internet | Data Science Somedays

Photo by Markus Winkler on

My honourable guests, thank you for joining me today to learn how to copy the entire internet and store it in a less efficient format.

A recent project of mine is working on the Police Rewired Hackathon which asks us to think of ways to address hate speech online. Along with three colleagues, we started hacking the internet to put an end to hate speech online.

We won the Hackathon and all of us were given ownership of Google as a reward. Google is ours.

Thank you for reading and accepting your new digital overlords.

Our idea is simple and I will explain it without code because my code is often terrible and I’ve been saved by Sam, our python whisperer, on many occasions.

  1. Select a few twitter users (UCL academics in this case)
  2. Take their statuses and replies to those statuses
  3. Analyse the replies and classify them as hate speech, offensive speech, or neither
  4. Visualise our results and see if there are any trends.

In this post, I will only go through the first two.

Taking information from twitter

This is the part of the hackathon I’ve been most involved in because I’ve never created a Twitter scraper before (a program that takes information from Twitter and stores it in a spreadsheet or database). It was a good chance to learn.

For the next part to make sense, here’s a very small background of what a “tweet” is.

It is a message/status on Twitter. With limited amounts of text – you can also attach images.
These tweets contain a lot of information which can be used to form all sorts of analysis on. For example, a single tweet contains:

  1. Text
  2. Coordinates of where it was posted (if geolocation is enabled)
  3. the platform it came from (“Twitter for iPhone”)
  4. Likes (and who liked them)
  5. Retweets (and who did this)
  6. Time it was posted

And so on. With thousands of tweets, you can extract a number of potential trends and this is what we are trying to do. Does hate speech come from a specific area in the world?

OK, now how do we get this information?

There are two main ways to do this. The first is by using the Twitter Application Programming Interface (API). In short, Twitter has created this book of code which people like me can interact with with my code and it’ll give me information. For example, every tweet as a “status ID” that I can use to differentiate between tweets.

All you need to do is apply for developer status and you’ll be given authentication keys. There is a large limitation though – it’s owned by Twitter and Twitter, like most private companies, value making money.

There is a free developer status but that only allows for a small sample of tweets to be collected up to 7 days in the past. Anything beyond that, I’ll receive no information. I also can’t interact with the API too often before it tells me to shut up.

Collecting thousands of tweets at a decent rate would cost a lot of money (which people like myself… and most academics, cannot afford).


Programmers are quite persistent. There are helpful Python modules (a bunch of code that helps you write other code) such as Twint.

Twint is a wonderfully comprehensive module that allows for significant historical analysis of Twitter. It uses a lot of the information that Twitter provides, does what the API does but without the artificial limitations from Twitter. However, it is fickle – for an entire month it was broken because twitter changed a URL.

Not sustainable.

Because I don’t want to incriminate myself, I will persist with the idea that I used the Twitter API.

How does it work?

Ok, I said no code but I lied. I didn’t know how else to explain it.

for user in users:
    tweets_pulled = dict()
    for user_tweets in tweepy.Cursor(api.user_timeline,screen_name=user).items(20): 
        for tweet in tweepy.Cursor(,q='to:'+user, result_type='recent').items(100): # process up to 100 replies (limit per hour)
            if hasattr(tweet, 'in_reply_to_status_id_str'):
                if (tweet.in_reply_to_status_id_str==user_tweets.id_str):

I’ve removed some stuff to make it slightly easier to read. However, it is a simple “for loop”. This takes a user (“ImprovingSlowly”) and takes 20 tweets from their timeline.

After it has a list of these tweets, it searches twitter for “ImprovingSlowly” and adds to a list whether the tweets found were replies to any statuses.

Do that for 50 users with many tweets each, you’ll find yourself with a nice number of tweets.

If we ignore the hundred errors I received, multiple expletives at 11pm, and the three times I slammed my computer shut because life is meaningless, the code was pretty simple all things considered. It helped us on our way to addressing the problem of hate speech on Twitter.


So there are many limitations to this approach. Here are some of the biggest:

  1. With hundreds of thousands of tweets, this is slow. Especially with the limits placed on us by Twitter, it can take hours to barely scratch the surface
  2. You have to “catch” the hate speech. If hate speech is caught and deleted before I run the code, I have no evidence it ever existed.
  3. …We didn’t find much hate speech. Of course this is good. But a thousand “lol” replies doesn’t really do much for a hackathon on hate speech.

Then there’s the bloody idea of “what even is hate speech?”

I’m not answering that in this blog post. I probably never will.


Don’t be mean to people on Twitter.

I don’t know who you are. I don’t know what you want. If you are looking for retweets, I can tell you I don’t have any to give, but what I do have are a very particular set of skills.

Skills I have acquired over a very short Hackathon.

Skills that make me a nightmare for people like you.

If you stop spreading hate, that’ll be the end of it. I will not look for you, I will not pursue you, but if you don’t, I will look for you, I will find you and I will visualise your hate speech on a Tableau graph.

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s