With visualization in Python
For this mini-project, I started with two datasets. One included users to a fake social media website, and the other included activity (likes, shares, comments) from those users.
All of the data is included in dump.sql.
To create the necessary tables:
CREATE TABLE users (id INT(11) PRIMARY KEY NOT NULL, name VARCHAR(40) NOT NULL, date DATETIME NOT NULL);
CREATE TABLE events (id INT(11) PRIMARY KEY NOT NULL, type VARCHAR(15), user_id INT(11) NOT NULL, date DATETIME NOT NULL, FOREIGN KEY (user_id) REFERENCES users(id))
The goal of this project was just to gain some familiarity with SQL and cohort analysis.
import MySQLdb as mdb
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
con = mdb.connect('localhost', 'root', 'password', 'cohortanalysis');
cur = con.cursor();
Let's take a look at a sample of the data in the two tables
SQL = '''select * from users limit 5;'''
pd.read_sql(SQL, con=con)
id | name | date | |
---|---|---|---|
0 | 1 | Miriam | 2013-02-12 10:03:40 |
1 | 2 | Patricia | 2013-02-12 15:47:37 |
2 | 3 | Paula | 2013-02-12 17:20:12 |
3 | 4 | Anita | 2013-02-12 17:33:17 |
4 | 5 | Herman | 2013-02-14 16:09:33 |
SQL = '''select * from events limit 5;'''
pd.read_sql(SQL, con=con)
id | type | user_id | date | |
---|---|---|---|---|
0 | 1 | like | 1 | 2013-02-10 13:40:03 |
1 | 2 | like | 1 | 2013-02-10 15:35:58 |
2 | 3 | twitter share | 1 | 2013-02-11 16:40:14 |
3 | 4 | like | 1 | 2013-02-12 04:10:14 |
4 | 5 | twitter share | 4 | 2013-02-12 18:12:02 |
What we would like to do is to:
- For each month, find the users who signed up during that month.
- Track the activity for those users for each successive month.
Here is the final SQL query I ended up using (developed incrementally):
SELECT a.cohort, a.months, a.actives as active_users, t.total_users as total_users, a.actives/t.total_users as percent_active
FROM
(SELECT DATE_FORMAT(u.date, '%Y/%m') as cohort,
PERIOD_DIFF(DATE_FORMAT(e.date, '%Y%m'), DATE_FORMAT(u.date, '%Y%m')) as months,
COUNT(DISTINCT(u.id)) as actives
FROM users as u
JOIN events as e ON u.id = e.user_id
GROUP BY cohort, months
HAVING months >= 0) as a
JOIN
(SELECT DATE_FORMAT(gd_month(u.date), '%Y/%m') as cohort,
COUNT(DISTINCT(u.id)) as total_users
FROM users as u
GROUP BY cohort) as t
ON a.cohort = t.cohort
ORDER BY a.cohort, a.months ASC;
SQL = '''SELECT a.cohort, a.months, a.months + RIGHT(a.cohort, 1) - 2 as months_from, a.actives as active_users, t.total_users as total_users, a.actives/t.total_users as percent_active FROM (SELECT DATE_FORMAT(u.date, '%Y/%m') as cohort, PERIOD_DIFF(DATE_FORMAT(e.date, '%Y%m'), DATE_FORMAT(u.date, '%Y%m')) as months, COUNT(DISTINCT(u.id)) as actives FROM users as u JOIN events as e ON u.id = e.user_id GROUP BY cohort, months HAVING months >= 0) as a JOIN (SELECT DATE_FORMAT(gd_month(u.date), '%Y/%m') as cohort, COUNT(DISTINCT(u.id)) as total_users FROM users as u GROUP BY cohort) as t ON a.cohort = t.cohort ORDER BY a.cohort, a.months ASC;'''
df = pd.read_sql(SQL, con=con)
df['months_from'] = df['months_from'].astype(int)
And we have a tabulated version of our cohort analysis.
df
cohort | months | months_from | active_users | total_users | percent_active | |
---|---|---|---|---|---|---|
0 | 2013/02 | 0 | 0 | 19 | 48 | 0.3958 |
1 | 2013/02 | 1 | 1 | 35 | 48 | 0.7292 |
2 | 2013/02 | 2 | 2 | 35 | 48 | 0.7292 |
3 | 2013/02 | 3 | 3 | 33 | 48 | 0.6875 |
4 | 2013/02 | 4 | 4 | 42 | 48 | 0.8750 |
5 | 2013/02 | 5 | 5 | 33 | 48 | 0.6875 |
6 | 2013/03 | 0 | 1 | 113 | 338 | 0.3343 |
7 | 2013/03 | 1 | 2 | 242 | 338 | 0.7160 |
8 | 2013/03 | 2 | 3 | 259 | 338 | 0.7663 |
9 | 2013/03 | 3 | 4 | 308 | 338 | 0.9112 |
10 | 2013/03 | 4 | 5 | 182 | 338 | 0.5385 |
11 | 2013/04 | 0 | 2 | 607 | 1699 | 0.3573 |
12 | 2013/04 | 1 | 3 | 1273 | 1699 | 0.7493 |
13 | 2013/04 | 2 | 4 | 1566 | 1699 | 0.9217 |
14 | 2013/04 | 3 | 5 | 932 | 1699 | 0.5486 |
15 | 2013/05 | 0 | 3 | 2410 | 7658 | 0.3147 |
16 | 2013/05 | 1 | 4 | 7107 | 7658 | 0.9280 |
17 | 2013/05 | 2 | 5 | 4295 | 7658 | 0.5609 |
18 | 2013/06 | 0 | 4 | 16635 | 24716 | 0.6730 |
19 | 2013/06 | 1 | 5 | 13892 | 24716 | 0.5621 |
20 | 2013/07 | 0 | 5 | 6008 | 20082 | 0.2992 |
Not much use in tabulated form. Let's visualize it!
fig = plt.figure(figsize=(14, 6))
a = sns.barplot(x='months', y='percent_active', hue='cohort', data=df)
a.set_title('Cohort Analysis of User Engagement over 5 month period')
a.set_xlabel('Months Since Start of Cohort')
a.set_ylabel('% Users Active in Cohort')
fig.savefig('Cohort Analysis of User Engagement, Months since Start.png')
Observations
The majority of cohorts experience an increase in activity during the second month of their lifecycle, with the exception of the 2013/06 (June) cohort, which experiences a dip in activity.
- February cohort experience roughly 5 months of growth before a decline
- March cohort experiences 4 months of growth before a decline
- The April cohort experiences 3 months of growth before a decline
- The May cohort experiences 2 months of growth before a decline,
- The June cohort experiences 1 month of growth before a decline
When we write out the behavior of each cohort, we can see a pattern, but we can't really tell what's going on with this particular visualization.
This plot let's us easily compare the relative course activity between cohort, but it does not allow us to easily see how the cohorts perform in terms of absolute time (actual months), only relative time (compared with when the cohort started).
Let's take a look at the absolute bar chart to see what's happening there.
def get_month(x):
lst = ['February','March','April','May','June','July']
return lst[x]
df['months_from_feb'] = df['months_from'].apply(get_month)
df.drop('months_from', axis=1, inplace=True)
fig = plt.figure(figsize=(14, 6))
a = sns.barplot(x='months_from_feb', y='percent_active', hue='cohort', data=df)
a.set_title('Cohort Analysis of User Engagement over 5 month period')
a.set_xlabel('Month')
a.set_ylabel('% Users Active in Cohort')
fig.savefig('Cohort Analysis of User Engagement, Absolute Months.png')
Observations
In this plot it is easier to view the activity of each cohort relative to the month that cohort was active.
Conclusions
Here it is easy to see that in June, all cohorts have more active users, but in July, all cohorts experience a drop in active users. So the problem isn't something about the length of time of the cohorts, but instead the problem is due to either a change in the product in June, or circumstances external to the product that occurred in June.