Digital Academy Project

Digital Academy is a three-month hands-on course preparing bunch of girls, moms, and women for a brand-new career in data analysis. We learnt how to get, clean and show all kinds of data using Python, SQL, Power BI and Tableau. This project was a collaboration of my classmate and me.

Czechitas is a non-profit organization based in the Czech Republic that’s oriented on getting more women to tech industry by providing short- and long-term programming workshops.

Authors: Šárka Hubáček & Zdenka Šunková
Mentor: Vojtěch Dostál, Wikimedia Czech Republic

Wikimedia Czech Republic organizes various workshops for both current and future Wikipedia-content contributors. There are no metrics for course impact recognition.

Tools we used

As Wikimedia would like to learn more about courses efficiency, we decided to recognize number of edits done before, during, and after each course as a relevant metric. We had to ask a few more questions that might look too simple or obvious, but as this was the first course-related analytics Wikimedia CZ ever did, even this analysis made sense as a starting point. We decided to do a data analysis on set of workshops for seniors.
Also, even though it’d be nice to incorporate demographics data, Wikimedia doesn’t gather data of this nature.

Query: instructors data
Query: attendees data
Query: courses data
Query: categories data used for a word cloud
Query: filtering attendees and instructors; getting rid of user id 85654 as they’re an instructor falsely labeled as attendee

Getting & Cleansing Data

At first, the extent of data Wikimedia has, was a bit difficult to get a grasp of at the beginning. We had to decide what data we want to work with in our analysis. All the data we needed were stored in MariaDB and easily accessible via SQL queries in Quarry web interface. When I say easily accessible, I mean it included some horrific suffering and very long evenings full of desperation. The beginnings were pretty harsh but when we finally understood the beauty of SQL, we were selecting and joining across databases with pure joy.

We needed some pretty basic information about the courses:
Who were the attendees [user.user_id]
What course did they attend [ep_courses.course_id, ep_courses.course_title]
When did the course take place [ep_courses.course_start, ep_courses.course_end]
Where did the course take place [ep_courses.course_description]
Who was the instructor [ep_courses.course_instructor]
How many edits were made by the attendees [count (revision.rev_id)]
When were the edits made [revision.rev_timestamp]

We also had to filter other workshops- and attendees-related data:
Filter out any bots [user.ug_group]
Distinguish minor and major edits [revision.rev_minor_edit]
Distinguish course instructors and course attendees [up_users_per_course.upc.role]
Select courses for seniors only [ep_course.course_title]
Select articles edits only [page.page_namespace]

The most important data for us were in [users], [pages] and [ep_courses, ep_user_per_course, ep_org] tables.

Our database structure and tables we used in our project.

Tools

This documentation was very helpful for orientation in Wikimedia tables (as seen in a scheme below) and MariaDB.
The biggest chunk of data cleansing was done right away when selecting data in Quarry. We also used Excel when dealing with data prep for word cloud and Tableau for simple derived columns.
We worked with both Power BI and Tableau when making our visualisations.

Source: Wikimedia.org
Failed query: No window functions in this MariaDB version.

How to embed Tableau dashboards into a WordPress post and keep calm? Read on.

Joining in Tableau.
If statements in Tableau.
If statements helps with colouring data…
…as well as with aggregation.
No window function in MariaDB? No worries, try Tableau.
Small changes can lead to big results.

Visualisation

400 000+ registered users in the Czech Republic
175 courses in the Czech Republic
44 courses for seniors
2400 of all courses attendees
300 of senior courses attendees
157 is an average number of edits by seniors
18 is a median of edits by seniors
Source data from between 2014 and June 2018.

How many courses do seniors attend and how many edits they do?
Majority of participants attend only one course, even though there were few of those extra keen – not only attendees but editors as well.

When do they register as a Wiki user?
The largest number of attendees register after the course beginning as it’s a part of the curriculum. Around a quarter registers before the course start; that number involves long-time editors who attend courses to boost up their activity.

What’s the most popular course?
The most popular courses are the ones in Central Municipal Library of Prague (Ústřední knihovna v Praze) and Elpida community centre.
We set success criteria for an active editor to at least 5 edits in total. About 80% of senior courses attendees fulfill it. About 8% of attendees does no edit at all.

The most successful course measured by number of edits
Municipal Library of Prague – Intermediate II (Městská knihovna v Praze – Pokročilí II): 32 444 edits by 15 attendees
Municipal Library of Prague – Intermediate I (Městská knihovna v Praze – Pokročilí I): 24 055 edits by 11 attendees

Activity of attendees
First, we visualized edits per course in time (red for edits during the course, green for before and after edits). As we see, there’s a long continuity in doing edits by attendees of course in Central Municipal Library of Prague and attendees of beginners’ course in Centrum Elpida. The edits are ongoing since 2015 onwards. We deem those courses as the most successful ones.

We created four charts with data aggregated by days / weeks / months / years. The X axis is derived from course and edit dates, and represents relative distance between these. Y displaying edit count is scaled logarithmically. Green-marked are edits done before the course. Red-marked are edits done during the course (graphs for days, weeks and months). Yellow-marked are edits done 6 weeks (in days and weeks graph), 6 months (in months graph) or after the course (years graph). Purple-marked are edits done more than 6 weeks (days / weeks) or 6 months (months) after the course. Even though this might look a bit messy, those are a few of the most useful graphs we made for Wikimedia.

The following graphs are very similar to those you’ve already seen. The tiny difference is the relative starting point. We used FIXED function to set the point zero to the first attended course.

This visualization shows attendance and number of attendees with no edits (orange) and at least one edit (green). It’s possible to filter data by instructor’s name.

Minor vs major edits
Minor edits are significantly more common amongst the senior courses attendees. Minor edits include grammar, spelling or punctuation edits as well as formatting and vandalism editing with no subject changing impact. Major edits are significantly changing or extending the article.

What are the course-wise most active cities?
The most courses took place in Prague, second place goes to Brno.

The highest number of edits were made by attendees of courses that took place in Prague, then in Karlovy Vary.

Top 100 categories
I really wanted to make a word cloud. Because it’s cool and looks nice. There weren’t that many applications, so I picked articles categories edited by senior courses attendees. As there are no general categories, those top 100 are very specific and some of them are quite funny or a bit weird in a way.

Completing the task

In the beginning, we defined the most successful instructor as the one with the highest number of attendees’ edits done during and after their course. Based on our data analysis, we opted for using more than just one metric, therefore we defined three more.
The highest number of edits per instructor: 58 614 (instructor: Vojtěch Veselý)
The highest number of edits per attendee: 864,7 (instructor: Gampe)
The highest number of edits per course: 6 918 (instructor: Gampe)
The course with the highest number of edits: 32 444 (Městská knihovna v Praze, Ústřední knihovna – Pokročilí II.; instructors: Gampe & Jagro)

Are the courses successful?
Yes, they are. Even though there are courses not as successful as the others, based on the analysis we can say that the courses for seniors either make new active editors or boost up activity of already experienced contributors.

What did I learn?

During the Digital Academy courses, I was used to use SQL queries only in Snowflake RDBMS. Working with MariaDB made me realize how much different can database servers be. What worked smoothly in Snowflake (especially window functions), didn’t work in the Wikimedia’s MariaDB version at all. My nicely written queries got killed quite often because of this and other rooky mistakes – like missing commas.
In the beginning, I thought I’d ask questions, get data, do some analysis and visualize what I had. I soon realized I’d need to ask more questions, verify and validate, to understand the domain and bring added value. I learnt in the process how to use different tools and how to do some work-arounds when some tools didn’t work as expected.

Leave a Reply

Your email address will not be published. Required fields are marked *