Cohort Analysis in SQL

With visualization in Python

For this mini-project, I started with two datasets. One included users to a fake social media website, and the other included activity (likes, shares, comments) from those users.

All of the data is included in dump.sql.
To create the necessary tables:

CREATE TABLE users (id INT(11) PRIMARY KEY NOT NULL, name VARCHAR(40) NOT NULL, date DATETIME NOT NULL);

CREATE TABLE events (id INT(11) PRIMARY KEY NOT NULL, type VARCHAR(15), user_id INT(11) NOT NULL, date DATETIME NOT NULL, FOREIGN KEY (user_id) REFERENCES users(id))

The goal of this project was just to gain some familiarity with SQL and cohort analysis.

import MySQLdb as mdb
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

con = mdb.connect('localhost', 'root', 'password', 'cohortanalysis');
cur = con.cursor();

Let's take a look at a sample of the data in the two tables

SQL = '''select * from users limit 5;'''
pd.read_sql(SQL, con=con)

	id	name	date
0	1	Miriam	2013-02-12 10:03:40
1	2	Patricia	2013-02-12 15:47:37
2	3	Paula	2013-02-12 17:20:12
3	4	Anita	2013-02-12 17:33:17
4	5	Herman	2013-02-14 16:09:33

SQL = '''select * from events limit 5;'''
pd.read_sql(SQL, con=con)

	id	type	user_id	date
0	1	like	1	2013-02-10 13:40:03
1	2	like	1	2013-02-10 15:35:58
2	3	twitter share	1	2013-02-11 16:40:14
3	4	like	1	2013-02-12 04:10:14
4	5	twitter share	4	2013-02-12 18:12:02

What we would like to do is to:

For each month, find the users who signed up during that month.
Track the activity for those users for each successive month.

Here is the final SQL query I ended up using (developed incrementally):

SELECT a.cohort, a.months, a.actives as active_users, t.total_users as total_users, a.actives/t.total_users as percent_active
FROM 
(SELECT DATE_FORMAT(u.date, '%Y/%m') as cohort,
        PERIOD_DIFF(DATE_FORMAT(e.date, '%Y%m'), DATE_FORMAT(u.date, '%Y%m')) as months, 
        COUNT(DISTINCT(u.id)) as actives
FROM users as u
JOIN events as e ON u.id = e.user_id
GROUP BY cohort, months
HAVING months >= 0) as a
JOIN 
(SELECT DATE_FORMAT(gd_month(u.date), '%Y/%m') as cohort,
        COUNT(DISTINCT(u.id)) as total_users
FROM users as u
GROUP BY cohort) as t
ON a.cohort = t.cohort
ORDER BY a.cohort, a.months ASC;

SQL = '''SELECT a.cohort, a.months, a.months + RIGHT(a.cohort, 1) - 2 as months_from, a.actives as active_users, t.total_users as total_users, a.actives/t.total_users as percent_active FROM (SELECT DATE_FORMAT(u.date, '%Y/%m') as cohort,        PERIOD_DIFF(DATE_FORMAT(e.date, '%Y%m'), DATE_FORMAT(u.date, '%Y%m')) as months,         COUNT(DISTINCT(u.id)) as actives FROM users as u JOIN events as e ON u.id = e.user_id GROUP BY cohort, months HAVING months >= 0) as a JOIN  (SELECT DATE_FORMAT(gd_month(u.date), '%Y/%m') as cohort,         COUNT(DISTINCT(u.id)) as total_users FROM users as u GROUP BY cohort) as t ON a.cohort = t.cohort ORDER BY a.cohort, a.months ASC;'''


df = pd.read_sql(SQL, con=con)
df['months_from'] = df['months_from'].astype(int)

And we have a tabulated version of our cohort analysis.

df

	cohort	months	months_from	active_users	total_users	percent_active
0	2013/02	0	0	19	48	0.3958
1	2013/02	1	1	35	48	0.7292
2	2013/02	2	2	35	48	0.7292
3	2013/02	3	3	33	48	0.6875
4	2013/02	4	4	42	48	0.8750
5	2013/02	5	5	33	48	0.6875
6	2013/03	0	1	113	338	0.3343
7	2013/03	1	2	242	338	0.7160
8	2013/03	2	3	259	338	0.7663
9	2013/03	3	4	308	338	0.9112
10	2013/03	4	5	182	338	0.5385
11	2013/04	0	2	607	1699	0.3573
12	2013/04	1	3	1273	1699	0.7493
13	2013/04	2	4	1566	1699	0.9217
14	2013/04	3	5	932	1699	0.5486
15	2013/05	0	3	2410	7658	0.3147
16	2013/05	1	4	7107	7658	0.9280
17	2013/05	2	5	4295	7658	0.5609
18	2013/06	0	4	16635	24716	0.6730
19	2013/06	1	5	13892	24716	0.5621
20	2013/07	0	5	6008	20082	0.2992

Not much use in tabulated form. Let's visualize it!

fig = plt.figure(figsize=(14, 6))
a = sns.barplot(x='months', y='percent_active', hue='cohort', data=df)
a.set_title('Cohort Analysis of User Engagement over 5 month period')
a.set_xlabel('Months Since Start of Cohort')
a.set_ylabel('% Users Active in Cohort')
fig.savefig('Cohort Analysis of User Engagement, Months since Start.png')

Observations
The majority of cohorts experience an increase in activity during the second month of their lifecycle, with the exception of the 2013/06 (June) cohort, which experiences a dip in activity.

February cohort experience roughly 5 months of growth before a decline
March cohort experiences 4 months of growth before a decline
The April cohort experiences 3 months of growth before a decline
The May cohort experiences 2 months of growth before a decline,
The June cohort experiences 1 month of growth before a decline

When we write out the behavior of each cohort, we can see a pattern, but we can't really tell what's going on with this particular visualization.
This plot let's us easily compare the relative course activity between cohort, but it does not allow us to easily see how the cohorts perform in terms of absolute time (actual months), only relative time (compared with when the cohort started).

Let's take a look at the absolute bar chart to see what's happening there.

def get_month(x):
    lst = ['February','March','April','May','June','July']
    return lst[x]

df['months_from_feb'] = df['months_from'].apply(get_month)
df.drop('months_from', axis=1, inplace=True)

fig = plt.figure(figsize=(14, 6))
a = sns.barplot(x='months_from_feb', y='percent_active', hue='cohort', data=df)
a.set_title('Cohort Analysis of User Engagement over 5 month period')
a.set_xlabel('Month')
a.set_ylabel('% Users Active in Cohort')
fig.savefig('Cohort Analysis of User Engagement, Absolute Months.png')

Observations
In this plot it is easier to view the activity of each cohort relative to the month that cohort was active.

Conclusions
Here it is easy to see that in June, all cohorts have more active users, but in July, all cohorts experience a drop in active users. So the problem isn't something about the length of time of the cohorts, but instead the problem is due to either a change in the product in June, or circumstances external to the product that occurred in June.

NathanZorndorf/sql-cohort-analysis

Cohort Analysis in SQL