/sql-cohort-analysis

A short cohort analysis performed in SQL, visualized in Python.

Primary LanguageJupyter Notebook

Cohort Analysis in SQL

With visualization in Python

For this mini-project, I started with two datasets. One included users to a fake social media website, and the other included activity (likes, shares, comments) from those users.

All of the data is included in dump.sql.
To create the necessary tables:

CREATE TABLE users (id INT(11) PRIMARY KEY NOT NULL, name VARCHAR(40) NOT NULL, date DATETIME NOT NULL);

CREATE TABLE events (id INT(11) PRIMARY KEY NOT NULL, type VARCHAR(15), user_id INT(11) NOT NULL, date DATETIME NOT NULL, FOREIGN KEY (user_id) REFERENCES users(id))

The goal of this project was just to gain some familiarity with SQL and cohort analysis.

import MySQLdb as mdb
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

con = mdb.connect('localhost', 'root', 'password', 'cohortanalysis');
cur = con.cursor();

Let's take a look at a sample of the data in the two tables

SQL = '''select * from users limit 5;'''
pd.read_sql(SQL, con=con)
id name date
0 1 Miriam 2013-02-12 10:03:40
1 2 Patricia 2013-02-12 15:47:37
2 3 Paula 2013-02-12 17:20:12
3 4 Anita 2013-02-12 17:33:17
4 5 Herman 2013-02-14 16:09:33
SQL = '''select * from events limit 5;'''
pd.read_sql(SQL, con=con)
id type user_id date
0 1 like 1 2013-02-10 13:40:03
1 2 like 1 2013-02-10 15:35:58
2 3 twitter share 1 2013-02-11 16:40:14
3 4 like 1 2013-02-12 04:10:14
4 5 twitter share 4 2013-02-12 18:12:02

What we would like to do is to:

  1. For each month, find the users who signed up during that month.
  2. Track the activity for those users for each successive month.

Here is the final SQL query I ended up using (developed incrementally):

SELECT a.cohort, a.months, a.actives as active_users, t.total_users as total_users, a.actives/t.total_users as percent_active
FROM 
(SELECT DATE_FORMAT(u.date, '%Y/%m') as cohort,
        PERIOD_DIFF(DATE_FORMAT(e.date, '%Y%m'), DATE_FORMAT(u.date, '%Y%m')) as months, 
        COUNT(DISTINCT(u.id)) as actives
FROM users as u
JOIN events as e ON u.id = e.user_id
GROUP BY cohort, months
HAVING months >= 0) as a
JOIN 
(SELECT DATE_FORMAT(gd_month(u.date), '%Y/%m') as cohort,
        COUNT(DISTINCT(u.id)) as total_users
FROM users as u
GROUP BY cohort) as t
ON a.cohort = t.cohort
ORDER BY a.cohort, a.months ASC;
SQL = '''SELECT a.cohort, a.months, a.months + RIGHT(a.cohort, 1) - 2 as months_from, a.actives as active_users, t.total_users as total_users, a.actives/t.total_users as percent_active FROM (SELECT DATE_FORMAT(u.date, '%Y/%m') as cohort,        PERIOD_DIFF(DATE_FORMAT(e.date, '%Y%m'), DATE_FORMAT(u.date, '%Y%m')) as months,         COUNT(DISTINCT(u.id)) as actives FROM users as u JOIN events as e ON u.id = e.user_id GROUP BY cohort, months HAVING months >= 0) as a JOIN  (SELECT DATE_FORMAT(gd_month(u.date), '%Y/%m') as cohort,         COUNT(DISTINCT(u.id)) as total_users FROM users as u GROUP BY cohort) as t ON a.cohort = t.cohort ORDER BY a.cohort, a.months ASC;'''


df = pd.read_sql(SQL, con=con)
df['months_from'] = df['months_from'].astype(int)

And we have a tabulated version of our cohort analysis.

df
cohort months months_from active_users total_users percent_active
0 2013/02 0 0 19 48 0.3958
1 2013/02 1 1 35 48 0.7292
2 2013/02 2 2 35 48 0.7292
3 2013/02 3 3 33 48 0.6875
4 2013/02 4 4 42 48 0.8750
5 2013/02 5 5 33 48 0.6875
6 2013/03 0 1 113 338 0.3343
7 2013/03 1 2 242 338 0.7160
8 2013/03 2 3 259 338 0.7663
9 2013/03 3 4 308 338 0.9112
10 2013/03 4 5 182 338 0.5385
11 2013/04 0 2 607 1699 0.3573
12 2013/04 1 3 1273 1699 0.7493
13 2013/04 2 4 1566 1699 0.9217
14 2013/04 3 5 932 1699 0.5486
15 2013/05 0 3 2410 7658 0.3147
16 2013/05 1 4 7107 7658 0.9280
17 2013/05 2 5 4295 7658 0.5609
18 2013/06 0 4 16635 24716 0.6730
19 2013/06 1 5 13892 24716 0.5621
20 2013/07 0 5 6008 20082 0.2992

Not much use in tabulated form. Let's visualize it!

fig = plt.figure(figsize=(14, 6))
a = sns.barplot(x='months', y='percent_active', hue='cohort', data=df)
a.set_title('Cohort Analysis of User Engagement over 5 month period')
a.set_xlabel('Months Since Start of Cohort')
a.set_ylabel('% Users Active in Cohort')
fig.savefig('Cohort Analysis of User Engagement, Months since Start.png')

png



Observations
The majority of cohorts experience an increase in activity during the second month of their lifecycle, with the exception of the 2013/06 (June) cohort, which experiences a dip in activity.

  • February cohort experience roughly 5 months of growth before a decline
  • March cohort experiences 4 months of growth before a decline
  • The April cohort experiences 3 months of growth before a decline
  • The May cohort experiences 2 months of growth before a decline,
  • The June cohort experiences 1 month of growth before a decline

When we write out the behavior of each cohort, we can see a pattern, but we can't really tell what's going on with this particular visualization.
This plot let's us easily compare the relative course activity between cohort, but it does not allow us to easily see how the cohorts perform in terms of absolute time (actual months), only relative time (compared with when the cohort started).

Let's take a look at the absolute bar chart to see what's happening there.

def get_month(x):
    lst = ['February','March','April','May','June','July']
    return lst[x]

df['months_from_feb'] = df['months_from'].apply(get_month)
df.drop('months_from', axis=1, inplace=True)
fig = plt.figure(figsize=(14, 6))
a = sns.barplot(x='months_from_feb', y='percent_active', hue='cohort', data=df)
a.set_title('Cohort Analysis of User Engagement over 5 month period')
a.set_xlabel('Month')
a.set_ylabel('% Users Active in Cohort')
fig.savefig('Cohort Analysis of User Engagement, Absolute Months.png')

png

Observations
In this plot it is easier to view the activity of each cohort relative to the month that cohort was active.

Conclusions
Here it is easy to see that in June, all cohorts have more active users, but in July, all cohorts experience a drop in active users. So the problem isn't something about the length of time of the cohorts, but instead the problem is due to either a change in the product in June, or circumstances external to the product that occurred in June.