Instagram Scraping with Python

When writing an article about Czech political influencers, I came across some pretty interesting numbers. I was honestly surprised I haven’t found any data analysis covering this so I decided to do it myself (with lots of Jiri‘s help – he’s my data guru and coding mentor 🙂 ).

Getting Data

As there’s no coherent list of politicians and especially politicians with an Instagram account, I had to prepare this manually. I went through several Wikipedia lists of MPs, Senators, Ministers, and popular politicians and political parties in the Czech Republic. (No relevant open data available.) In total, about 350 subjects.

I had to look up profiles manually as one doesn’t have to use their real name or a job title on Instagram and neither there’s anything like categories. It’s possible I might have missed some insignificant profiles which isn’t relevant for the sake of this analysis.

This is the tech part of my analysis, you’ll find much more colourful outcomes here.

Instagram API

I hoped to use Instagram API to get the data I needed. Unfortunately, Instagram disabled this to “improve Instagram users’ privacy and security”.

Therefore I wrote a Python script to scrape profiles using this script as a starting point. You can find my code on my GitHub account.

Data Cleansing

Scraping got me an HTML code with a JSON somewhere in the middle. I used BeautifulSoup for data parsing and Python script for JSON prep. I stripped the code and picked data I found relevant for my analysis. I saved those into two separate TSV files:

  • profile data (name, username, number of followers, number of following, number of posts)
  • posts info (username, post id, timestamp, caption, number of comments, number of likes)
Tidy JSON with data for choco_afro account.
Dumped profile data
Dumped Posts data

Tiny Challenges

Web-based Instagram profiles displays  only 12 newest posts. Even though it’d be great to have all the data, I think this should be enough for this case study.

Every post has a UNIX time stamp relevant to the viewer’s time zone, not of the uploader. I normalized the time data to UTC to make it more universal.

How to use this scraper

For start, you need a CSV file with usernames you want to use. Single column, just the username. If there’s a header “username”, it’s ok.

Come up with a name for a TSV data file for scraped profile data and another one for scraped posts data.

You’ll use those three file names as parameters when running the script.

This is what you’ll find on GitHub:

run.py test_data.csv dumped_profile_data.tsv dumped_posts_data.tsv

Leave a Reply

Your email address will not be published. Required fields are marked *