From data mining to data magic

What’s the biggest difference between doing data analysis for a client and doing it for yourself? The data part.

TL; DR: Open source data (somewhere in the middle of the text) + Excel (somewhere in your computer).

TL; did read, because it might give you some ideas

Of course, sometimes your job requires data mining as well, but it’s more common the data is already there. Probably not in a good shape: there might be some untrimmed whitespace, email address in the postal address column, typos, messed up blends, different data sources with no keys to join on, mixed-up dates for open and closed leading to the unwanted predictive analytics, too many nulls, not enough granularity but determination to drill down, … And then, even if it gets to the point when it’s clean, it won’t be before you are supposed to start your analysis. It seldom is. However, you’re usually not the data owner. Nor the steward. You’re just the data servant. (And the cook – if we keep the high-class hierarchy – who cooks a chart and serves it as hot as possible. But not before tasting testing.)

All-in-one role

However, when you do data analysis for your own pleasure or enlightenment, you play all the parts. First of all, you’re your own client – you need to know the requirements, preferably in a way you also understand them. The good thing is, you’re agile, therefore when the client changes the brief, you quickly adapt. Thanks to your small team, your stand-ups can be kept short and sweet and done when waiting for a bus.

Briefing yourself

My mind works somewhat like this: “Hmm, I’m wondering how much costs cocaine in Sydney. Where does it even come from? And how much does it cost there? Are those people able to buy some? I guess those workers are paid something close to a minimum wage in that country…”* or simply “Whoa, this Raw Data podcast on chocolate and traceability is fascinating. Where can I get some data on the child labour?” (UNICEF is the answer.) These are my client-analyst internal talks.

*Just to be clear, I don’t do cocaine and even though I’m not your mother to tell you what to do with your life, nor should you. I’m just a very curious person interested in a broad spectrum of topics.

Data magic

As the new Snow White story in Fairy Tales for Millennials say: “And besides being cleaner [than coal], data mining paid a whole lot more.”*

Get your (open) data. First, DuckDuckGo it (or Google it). You never know what the mighty SEO finds for you. Then go to open data sites and sites relevant for your topic. Want to do dirty analytics? Try PornHub’s data. Want to do something even dirtier? How about Yahoo Finance or your local government?

* Unless you live in Australia. (ABS stands for Australian Bureau of Statistics. They have great data, including censuses.

Your own data
– I have an Excel sheet that I’ve been using as my own bullet journal for the past two years. I make notes on my mood, travels, reading, gym attendance and period.
– Bank statements. Your own.
– Your website traffic.
– Fitbit, Apple Watch, fitness & health apps (Want to know how to get data from your Apple Health App? Click here!)
Do I have to mention you should always be careful with any sensitive data? I don’t, right?

Other people’s data
Kaggle
Wikipedia (Thinking about using Power BI? Good idea! Here’s why. And here’s why I stick with Tableau: iSheep)
Censuses & government data
UNICEF
NASA
Global Open Data Index
ICSU World Data System
Nature (scientific data)
Open Science Data Cloud
Center of Open Science
Google Trends
Yahoo Finance
Twitter API (APIs are great in general)
Gutenberg (books are great for text analysis)
PornHub Insights
Even more and more data sources mentioned there.
You can also scrape some data. Respect the privacy and copyrights, though. (Want to scrape someone’s Instagram? Click here! Don’t use it for creepy stuff, though, please.)

Let’s be completely honest with each other. This part won’t be easy. There’s just a very slight chance, you’d find a data set that ticks all your boxes. It’s similar to your soul mate chase…* Instead, you can do magic. I mean, not in real life and if you can, please, tell me how. Data magic. Sounds much better than hours of manual labour, doesn’t it? Some would call it blending, enriching or enhancing data, but nothing beats magic.

Remember the cocaine story? Good, means you probably also remember the 90s. Anyway, the storyline leads you from cocaine price to consumption, origin countries and minimum wage. Because as you can see, I also decided to compare the price of cocaine to beer and bottled water price. I had to find data for six different things.

So, I did. As cocaine was my main interest and I was lucky enough to find some (a bit outdated) data, all my other data searches were simply focused only on the countries I had data for in the first place. Some were easy to copy-paste to my elegant Excel spreadsheet, some required typing. I’m good at typing.* And when I’m eager to get some knowledge, I’m more than happy to work for it. Stop rolling your eyes, might happen to you, too.

*You might happen to stumble upon a fantastic data set you want to visualize, but that’s another and much shorter story.

**I have this amazing story – I love cycling. I hate helmets. Australia is one of the few countries were wearing a helmet is mandatory, and the penalties are pretty harsh. I wanted to know – does wearing the helmets save lives? How much is the penalty, and how much was it in the past? Do Australians even cycle? I got data for the penalties from Revenue NSW in a pretty pivot table. The cycling Australians data is from Censuses. I GOT DATA FOR THE PAST 40 YEARS. How good is that? Well, they didn’t use computers back then. They used typewriters. But I’m not complaining, the data is scanned and available online. The least I could do was to retype those data to an Excel sheet. The feeling? Better than that after drinking some oat milk chai latte. Delicious, BTW.

***My another data-mining story orbits around the Sun. As I was preparing my background images topic for a Tableau Train the Trainer training, I decided to do some data magic. I started looking for planets’ diameters (yes, all the solar system objects are true to their miniaturized size), weight, density, temperatures, fun facts and other stuff. One would say, that all must be Wikipedia’s domain, right? I wish. NASA has some awesun stuff on their site. But Wikipedia helped as well. Always does. At least a tiny bit.

Source: NASA

Sketch it

You get your data, yay! From here, it’s all about your data analytics skills.
Now it’s time for a pen and paper. Don’t you humour yourself, it might be old-school, but it’s definitely more efficient to sketch things in hand than to create dummy data to make a mock-up in Tableau. Drawing, no matter how ugly it is, is an important part of the process. Do you have a big window or a mirror at home? Even better. It’s more difficult to lose one of those.

Prep it

Keep it in Excel. Or use Alteryx, Tableau Prep, Python, SQL, R or anything else you know or have access to.

Make it happen

That’s it. You have your data, it’s your turn to do something about it. Use Tableau, Power BI, Python, R, Excel. Whatever you like. Data is for using, so use them. Make it pretty, make it meaningful. Learn by doing. And learn by playing. Hey, did you know that pictorial textbooks written in native languages instead of Latin, teaching based in gradual development from simple to more comprehensive concepts, lifelong learning with a focus on logical thinking over memorization, equal opportunity for impoverished children and women and universal and practical instructions were introduced by a Czech philosopher and pedagogue, Jan Amos Komensky?

One last thing to keep in mind

When working with data, be responsible. Be aware that your data set might be incomplete and might give you an incomplete image of the situation. Especially when it comes to social data unless you’re an expert in the given field or industry or a researcher, you might make wrong assumptions even when supported by data. For example, I know my Women in Movies analysis is based only on a sample of 100 movies that are most rated on IMDb. I don’t know anything about the reviewers’ demographics and their life experience. I have not analyzed all the movies ever made. Doesn’t make my analysis any less relevant, as long as I make the receivers got this message along with the one about how underrepresented women in the film industry are.

One more thing

Be honest and always state your source(s).

And one last thing, I promise

Have fun!

Leave a Reply

Your email address will not be published. Required fields are marked *