It is 26 September as I write this meaning that I’m on day 26 of #66daysofdata.
I’m not a data scientist, I’ve just always found the young field interesting. I thought, for this instance of Data Science Somedays, I’ll go through some of the things I’ve learned (in non-technical detail).
Data Ethics
I’m starting with this because I actually think it’s one of the most important, yet overlooked parts of Data Science. Just because you can do something, doesn’t mean you should. Not everything is good simply because it can be completed with an algorithm.
One of the problems with Data Science, at least in the commercial sphere, is that there’s a lot of value in having plenty of data. Sometimes, this value is taken as a priority versus privacy. In addition, many adversaries understand the value of data and as a result, aim to muddy the waters with large disinformation campaigns or steal personal data. What does the average citizen do in this scenario?
Where am I learning this? Fast.ai’s Pratical Data Ethics course.
Coding
How do I even start?
Quite easily because I’m not that good at programming so I haven’t learned all that much. Some of the main things that come to mind are:
- Object Oriented Programming (this took me forever to wrap my head around… it’s still difficult).
- Python decorators
- Functions
All of this stuff has helped me create:
None of them are impressive. But they exist and I was really happy when I fixed my bugs (if there are more, don’t tell me).
Where am I learning this? 2020 Complete Python Bootcamp: From Zero to Hero in Python.
(I said earlier I haven’t learned much – that’s just me being self-deprecating. It’s a good course – I’m just not good at programming… yet.
I also bought this for £12. Udemy is on sale all the time (literally))
Data visualisation and predictions
Pandas
After a while, I wanted to direct my coding practice to more data work rather than gaining a general understanding of Python.
To do this, I started learning Pandas which is a library (a bunch of code that helps you quickly do other things), that focuses on data manipulation. In short, I can now use Excel files with python. It included things such as:
- How to rename columns
- How to find averages, reorganise information, and then create a new table
- How to answer basic data analysis questions
Pandas is definitely more powerful than the minor things I mentioned above. It’s still quite difficult to remember how to use all of the syntax so I still have to Google a lot of basic information but I’ll get there.
Where am I learning this? Kaggle – Pandas
Bokeh and Seaborn
When I could mess around with excel files and data sets, I took my talents to data visualisation.
Data visualisation will always be important because looking at tables are 1) boring, 2) slow, and 3) boring. How could I make my data sets at least look interesting?
Seaborn is another library that makes data visualisation much simpler (e.g. “creating a bar chart in one line of code” simpler).
Bokeh is another library that seems to be slightly more powerful in the sense that I can then make my visualisations interactive which is helpful. Especially when you have a lot of information to display at once.
I knew that going through tutorials will have their limit as my hand is always being held so I found a data set on ramen and created Kaggle notebooks. My aim was to practice and show others what my thought process was.
Where am I learning this? Seaborn | Bokeh
Machine learning
This is my most recent venture. How can I begin to make predictions using code, computers and coffee?
So for all of the above, I still find quite difficult and there will be a little while until I can say “I know Python” but this topic seemed like the one with the biggest black box.
If I say
filepath = “hello.csv”
“pandas.read_csv(filepath)”
I understand that I’m taking a function from the Pandas library, and that function will allow me to interact with the .csv file I’ve called.
If I say sklearn.predict(X_new_data) – honestly what is even happening? Half the time, I feel like it’s just luck that I get a good outcome.
Where am I learning this? Kaggle – Intro to machine learning
What is next?
I’m going to continue learning about data manipulation with Pandas and Bokeh as those were the modules I found the most interesting to learn about. However, that could very easily change.
My approach to learning all of this is to go into practice as soon as I can even if it’s a bit scary. It exposes my mistakes and reminds me that working through tutorials often leaves me feeling as though I’ve learned more than I have.
There’s also a second problem – I’m not a Computer Science student so I don’t have the benefit of learning the theory behind all of this stuff. Part of me wants to dive in, the other part is asking that I stay on course and keep learning the practical work so I can utilise it in my work.
Quite frequently, I get frustrated by not understanding and remembering what I’m learning “straight away”. However, this stuff isn’t easy by any stretch of the imagination. So it might take some time.
And that’s alright. Because we’re improving slowly.
One thought on “What I’m currently learning in Data Science | Data Science Somedays”