The objective of this exercise is to build a model which can predict in advance, the merchants who are currently active but are likely to stop transacting in the near future.The data we have available is that of around 1.5 million transactions across 2 years for 14,351 customers to train this model.
A merchant is said to have churned if he has stopped transacting over a given observation period. The length of this observation period is an assumption we have to make. There are two key consideration for chosing an appropriate lengh for the observation period -
-
The observation period should be long enough so that inactivity in the observation period can be inferred as the merchant having stopped using the service. If we choose a short period, say 1 week, it is possible that the merchants we identifiy as having churned, might not have had any transactions in a given week, but may have transactions at a later point in time. This would make our churn identification inaccurate.
-
The length of the window should be short enough so that -
- We are able to identify churned merchants from the given dataset, which only has 2 years worth of data. If we choose a long window, say 18 months, we will not have enough data to identify the churn indicator for a majority of the customers and might not have enough data to train a model.
- The churn prediction from the model is actually actionable. This is much more of a subjective criterial determined by the actual use case of the model. We might care more about the customers not likely to transact withing the next 30 days, in which case, the model shouls be trained to predict ove a similar time frame.
import pandas as pd
import numpy as np
import os
import project_s.src.common_utils as utils
import project_s.src as src
import math
from plotnine import *
from datetime import timedelta
import warnings
warnings.filterwarnings('ignore')
pd.DataFrame.removeUnivarOutliers = utils.remove_univariate_outliers
raw_data = (pd.read_csv(os.path.join(src.dir_path , "data/raw/takehome_ds_written.csv"),index_col=0,parse_dates = ["time"]).
assign(date = lambda x : pd.to_datetime(x["time"].dt.strftime('%Y-%m-%d')),# add date
amount_in_usd = lambda x :x.amount_usd_in_cents/100)) # amount in usd
def add_next_transaction(df):
"""
add next transaction time for the same merchant to the raw data
"""
return (df
.sort_values("time")
.assign( next_transaction_date = lambda x : x.date.shift(-1),
days_to_next_transaction = lambda x : (x.next_transaction_date - x.date).dt.days )) # null if no transaction after
tranformed_df= raw_data.groupby("merchant",as_index=False).apply(add_next_transaction ) # add fields for next transaction date and days to next trnsaction
churn_window_candidates = list(np.linspace(0,360,int(360/30)+1))
churn_data = []
for this_window in churn_window_candidates:
churn_df = (tranformed_df.
loc[lambda x : x["time"] < (x["time"].max() -timedelta( days = max(churn_window_candidates)))].
groupby("merchant").
apply(lambda x: x["days_to_next_transaction"].fillna(np.inf).gt(this_window).any() ))
churn_data.append({"churn_window_len" : this_window , "churn_percent" :churn_df.mean() , "n_merchants" :churn_df.count() })
churn_df = pd.DataFrame(churn_data).assign( decrease_percent = lambda x : ((x.churn_percent.shift(1) - x.churn_percent))*100 )
churn_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
churn_window_len | churn_percent | n_merchants | decrease_percent | |
---|---|---|---|---|
0 | 0.0 | 1.000000 | 6433 | NaN |
1 | 30.0 | 0.811130 | 6433 | 18.886989 |
2 | 60.0 | 0.679776 | 6433 | 13.135396 |
3 | 90.0 | 0.602363 | 6433 | 7.741334 |
4 | 120.0 | 0.547179 | 6433 | 5.518421 |
5 | 150.0 | 0.505518 | 6433 | 4.166019 |
6 | 180.0 | 0.468677 | 6433 | 3.684129 |
7 | 210.0 | 0.436966 | 6433 | 3.171149 |
8 | 240.0 | 0.414581 | 6433 | 2.238458 |
9 | 270.0 | 0.398414 | 6433 | 1.616664 |
10 | 300.0 | 0.381004 | 6433 | 1.741023 |
11 | 330.0 | 0.367480 | 6433 | 1.352402 |
12 | 360.0 | 0.350070 | 6433 | 1.741023 |
ggplot(churn_df, aes(x = "churn_window_len" , y = "decrease_percent")) + geom_line() + scale_y_continuous(labels=lambda l: ["%d%%" % (v) for v in l])
<ggplot: (-9223371958434406004)>
Above we can see churn rates of the same cohort of active merchants(all active merchants in 2033) when we vary the size of the observation period. As we increase the churn window from 0 to 360, the churn rate starts to stablize around 240 days, where most merchants who have churned in a 360 day observation period have already churned within a 240 day window and the incremental benefit of increasing the observation period is marginal. This indicates that a 240 day window for identification of churn will work reasonably well for training our model.
Now that we have fixed our definition of churn to be the absence of any transaction for 240 days, we can identfiy merchatns that have already churned at least once in the dataset.
churned_merchants = list((tranformed_df
.loc[lambda x : x["date"] < (x["date"].max() -timedelta( days = 240))] # filter all activity for the last 240 days
.assign(days_to_next_transaction = lambda x : x.days_to_next_transaction.fillna(np.inf))
.loc[lambda x : x.days_to_next_transaction.gt(240)])["merchant"].unique())
pd.DataFrame(churned_merchants).to_csv(os.path.join(src.dir_path , "data/processed/churned_merchants.csv"))
4017 out of the 14351 merchants have churned at one point during thse 2 years.
To generate the training data, I have chosen to sample merchant activity every 30 days. At each sampling date, if the merchant has not churned as of that date, I compute features over the past 240 days, the training period, to capture the merchant activity.The target variable would indicate wheather of not the customer had any transactions over the performance window, which we have chosen to be from the sampling date to 240 days after the sampling date. I did not choose a higher sampling frequnecy because oversampling can result in a huge number of samples that can take a lot of time to compute and would not have a lot of additional information given that the merchants seem to be operating at about a 1 transaction per week median frequency. Additionally, the choice to keep the training period also 240 days is arbitrary. Ideally this would be a parameter we would tune depending on model performance and data availability.
training_period = 240
performance_window = 240
date_samples = pd.date_range(raw_data["date"].min() + timedelta(days =training_period ) , raw_data["date"].max() - timedelta(days =performance_window ) , freq = '30D')
To capture merchant activity over the training window, I compute the following features. Each of them capture one of 3 attributes -
- Average activity during the training window
- Recent activity
- Recent activity normalised by average activity in the performance window
feature_dict = { "total_amount_usd" : "Total amount transacted over the training window" ,
"count_txns" : "Number of transactions over the training window" ,
"avg_txn_amount" : "Average transaction amount over the training window",
"count_txns_l30d" : "Number of transactions in the last 30 days",
"count_txns_l60d" :"Number of transactions in the last 60 days" ,
"count_txns_l90d" : "Number of transactions in the last 90 days",
"count_txns_l30d_ratio" : "Number of transactions in the last 30 days divided by total transactions over the training window" ,
"count_txns_l60d_ratio" :"Number of transactions in the last 60 days divided by total transactions over the training window",
"count_txns_l90d_ratio" :"Number of transactions in the last 90 days divided by total transactions over the training window"}
feature_dict
{'total_amount_usd': 'Total amount transacted over the training window',
'count_txns': 'Number of transactions over the training window',
'avg_txn_amount': 'Average transaction amount over the training window',
'count_txns_l30d': 'Number of transactions in the last 30 days',
'count_txns_l60d': 'Number of transactions in the last 60 days',
'count_txns_l90d': 'Number of transactions in the last 90 days',
'count_txns_l30d_ratio': 'Number of transactions in the last 30 days divided by total transactions over the training window',
'count_txns_l60d_ratio': 'Number of transactions in the last 60 days divided by total transactions over the training window',
'count_txns_l90d_ratio': 'Number of transactions in the last 90 days divided by total transactions over the training window'}
def create_features_target(grouped_df):
training_window = grouped_df[grouped_df.offset_date <=0]
if training_window.shape[0]==0: # no activity in traning window, customer already churned
return None
performance_window = grouped_df[grouped_df.offset_date >0]
features = {}
features = { "total_amount_usd" : training_window["amount_in_usd"].sum(), # total amount of transactions
"count_txns" : training_window["amount_in_usd"].count() , # total number of transactions
"avg_txn_amount" : training_window["amount_in_usd"].mean(), # average transaction size
"count_txns_l30d" : training_window.loc[(training_window["offset_date"]>=-30) ,"amount_in_usd"].count(),
"count_txns_l60d" : training_window.loc[(training_window["offset_date"]>=-60) ,"amount_in_usd"].count(),
"count_txns_l90d" : training_window.loc[(training_window["offset_date"]>=-90) ,"amount_in_usd"].count(),
"count_txns_l30d_ratio" : training_window.loc[(training_window["offset_date"]>=-30) ,"amount_in_usd"].count()/training_window["amount_in_usd"].count(),
"count_txns_l60d_ratio" : training_window.loc[(training_window["offset_date"]>=-60) ,"amount_in_usd"].count()/training_window["amount_in_usd"].count(),
"count_txns_l90d_ratio" : training_window.loc[(training_window["offset_date"]>=-90) ,"amount_in_usd"].count()/training_window["amount_in_usd"].count()}
churn_indicator = 1 if performance_window.shape[0]==0 else 0 # churned if merchant did not transact within performance window
features.update({"churn_indicator":churn_indicator})
training_row = pd.DataFrame( features, index= [0] )
return training_row
training_df = pd.DataFrame()
for date_sample in date_samples:
training_df = pd.concat([training_df ,
(tranformed_df.assign(sample_date =date_sample ,
offset_date = lambda x :(x.date - x.sample_date).dt.days)) ],axis = 0)
training_data = (training_df
.loc[lambda x : (x.offset_date >-training_period) & (x.offset_date <=performance_window)]
.groupby(["merchant","sample_date"])
.apply(create_features_target)).reset_index().drop(["level_2"],axis = 1)
import py_mob
features = [ "total_amount_usd" ,
"count_txns" ,
"avg_txn_amount",
"count_txns_l30d",
"count_txns_l60d",
"count_txns_l90d",
"count_txns_l30d_ratio",
"count_txns_l60d_ratio" ,
"count_txns_l90d_ratio" ]
target = "churn_indicator"
feature_summary = []
for feat in features :
feat_binned = py_mob.qtl_bin(training_data[feat], training_data[target] )
summ_bin = py_mob.summ_bin(feat_binned )
summ_bin.update({"feature" : feat})
feature_summary.append(summ_bin)
pd.DataFrame(feature_summary)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
sample size | bad rate | iv | ks | missing | feature | |
---|---|---|---|---|---|---|
0 | 51764 | 0.3032 | 0.2557 | 21.13 | 0.0 | total_amount_usd |
1 | 51764 | 0.3032 | 0.2735 | 21.04 | 0.0 | count_txns |
2 | 51764 | 0.3032 | 0.0054 | 3.25 | 0.0 | avg_txn_amount |
3 | 51764 | 0.3032 | 1.3531 | 51.26 | 0.0 | count_txns_l30d |
4 | 51764 | 0.3032 | 1.2477 | 48.32 | 0.0 | count_txns_l60d |
5 | 51764 | 0.3032 | 1.0598 | 42.23 | 0.0 | count_txns_l90d |
6 | 51764 | 0.3032 | 1.0685 | 48.07 | 0.0 | count_txns_l30d_ratio |
7 | 51764 | 0.3032 | 0.4920 | 33.98 | 0.0 | count_txns_l60d_ratio |
8 | 51764 | 0.3032 | 0.2410 | 24.16 | 0.0 | count_txns_l90d_ratio |
feat = "count_txns_l30d_ratio"
feat_binned = py_mob.qtl_bin(training_data[feat], training_data[target] )
pd.DataFrame(feat_binned['tbl']).to_clipboard()
training_data[feat].describe(np.linspace(0,1,11))
count 51764.000000
mean 0.235151
std 0.332740
min 0.000000
0% 0.000000
10% 0.000000
20% 0.000000
30% 0.000000
40% 0.000000
50% 0.064516
60% 0.159091
70% 0.250523
80% 0.489177
90% 1.000000
100% 1.000000
max 1.000000
Name: count_txns_l30d_ratio, dtype: float64
Above is information value for all our features over the training set, which indicates the strength of a feature's relationship with churn. The churn rate for the sampled training set is around 30% on a traning set size of 51,764 after sampling. Looking at the information value, almost all features except for average transaction amount seem to be well correlated with churn. Clealy, transaction activity has a lot correlation future churn. Also ,the more recent transaction activity is more powerfully correlated with future churn. We go ahead with training an xgboost model on this traning set with these features.
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV,train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
# A parameter grid for XGBoost
params = {
'min_child_weight': [1, 5, 10],
'max_depth': [3, 4, 5],
'learning_rate' : [0.001, 0.01,0.1],
'n_estimators' : [50,100,150,200]
}
xgb = XGBClassifier(objective='binary:logistic',silent=True, nthread=18)
folds = 3
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
X_train, X_test, y_train, y_test = train_test_split(training_data[features], training_data[target], test_size=0.33, random_state=42)
grid_search = GridSearchCV(xgb, param_grid=params, scoring='roc_auc', n_jobs=4, cv=skf.split(X_train,y_train), verbose=3 )
# Here we go
grid_search.fit(X_train, y_train)
Fitting 3 folds for each of 108 candidates, totalling 324 fits
GridSearchCV(cv=<generator object _BaseKFold.split at 0x000001242D6C5648>,
estimator=XGBClassifier(base_score=None, booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain',
interaction_constraints=None,
learning_rate=None, max_delta_step=None,
max_depth=None, min_child_weight=None,
missing=n...
n_estimators=100, n_jobs=None, nthread=18,
num_parallel_tree=None, random_state=None,
reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, silent=True,
subsample=None, tree_method=None,
validate_parameters=False,
verbosity=None),
n_jobs=4,
param_grid={'learning_rate': [0.001, 0.01, 0.1],
'max_depth': [3, 4, 5], 'min_child_weight': [1, 5, 10],
'n_estimators': [50, 100, 150, 200]},
scoring='roc_auc', verbose=3)
def ks(data=None, target=None, prob=None, asc=False):
data['target0'] = 1 - data[target]
data['bucket'] = pd.qcut(data[prob], 10, duplicates='raise')
grouped = data.groupby('bucket', as_index=False)
kstable = pd.DataFrame()
kstable['min_prob'] = grouped.min()[prob]
kstable['max_prob'] = grouped.max()[prob]
kstable['events'] = grouped.sum()[target]
kstable['nonevents'] = grouped.sum()['target0']
kstable['total'] = grouped.count()['target0']
kstable['bucket_event_rate'] =kstable.events/kstable.total
kstable = kstable.sort_values(by="min_prob", ascending=asc).reset_index(drop=True)
kstable['event_rate'] = (kstable.events / data[target].sum()).apply('{0:.2%}'.format)
kstable['nonevent_rate'] = (kstable.nonevents / data['target0'].sum()).apply('{0:.2%}'.format)
kstable['cum_eventrate'] = (kstable.events / data[target].sum()).cumsum()
kstable['cum_noneventrate'] = (kstable.nonevents / data['target0'].sum()).cumsum()
kstable['KS'] = np.round(kstable['cum_eventrate'] - kstable['cum_noneventrate'], 3) * 100
# Formating
kstable['cum_eventrate'] = kstable['cum_eventrate'].apply('{0:.2%}'.format)
kstable['cum_noneventrate'] = kstable['cum_noneventrate'].apply('{0:.2%}'.format)
# kstable.index = range(1, len(kstable)+1)
kstable.index = range(1, 11)
kstable.index.rename('Decile', inplace=True)
return kstable
from sklearn.metrics import roc_auc_score
classifier = grid_search.best_estimator_
classifier.fit(X_train,y_train)
y_train_probs = classifier.predict_proba(X_train)[:, 1]
y_test_probs = classifier.predict_proba(X_test)[:, 1]
train_auc = roc_auc_score(y_train, y_train_probs)
test_auc = roc_auc_score(y_test, y_test_probs)
print(f"Model results. Train AUC : {train_auc}. Test AUC : {test_auc}")
Model results. Train AUC : 0.8787946226704462. Test AUC : 0.851806268393682
test_result_df = pd.DataFrame({"target" : y_test , "proba": y_test_probs })
ks(test_result_df,"target","proba")
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
min_prob | max_prob | events | nonevents | total | eventrate | event_rate | nonevent_rate | cum_eventrate | cum_noneventrate | KS | |
---|---|---|---|---|---|---|---|---|---|---|---|
Decile | |||||||||||
1 | 0.680263 | 0.933116 | 1380 | 315 | 1695 | 0.814159 | 26.10% | 2.67% | 26.10% | 2.67% | 23.4 |
2 | 0.606068 | 0.679982 | 1132 | 588 | 1720 | 0.658140 | 21.41% | 4.99% | 47.50% | 7.66% | 39.8 |
3 | 0.492019 | 0.605952 | 909 | 801 | 1710 | 0.531579 | 17.19% | 6.79% | 64.69% | 14.45% | 50.2 |
4 | 0.325948 | 0.491943 | 660 | 1043 | 1703 | 0.387551 | 12.48% | 8.84% | 77.17% | 23.29% | 53.9 |
5 | 0.225464 | 0.325637 | 522 | 1191 | 1713 | 0.304729 | 9.87% | 10.10% | 87.05% | 33.39% | 53.7 |
6 | 0.158576 | 0.225438 | 316 | 1393 | 1709 | 0.184903 | 5.98% | 11.81% | 93.02% | 45.20% | 47.8 |
7 | 0.096585 | 0.158527 | 181 | 1527 | 1708 | 0.105972 | 3.42% | 12.95% | 96.44% | 58.14% | 38.3 |
8 | 0.050022 | 0.096504 | 111 | 1597 | 1708 | 0.064988 | 2.10% | 13.54% | 98.54% | 71.68% | 26.9 |
9 | 0.017068 | 0.050011 | 62 | 1646 | 1708 | 0.036300 | 1.17% | 13.96% | 99.72% | 85.64% | 14.1 |
10 | 0.000351 | 0.016972 | 15 | 1694 | 1709 | 0.008777 | 0.28% | 14.36% | 100.00% | 100.00% | 0.0 |
The final classifier trained is a pretty good one. The test AUC is almost 0.85. However, from a usage perspective, we need to choose a threshold for the predicted probability of churn above which we will take necessary actions to avoid churn. This obviously comes down to the trade off between the cost of said action and our need to cut down on churn. The above table will help us navigate this tradeoff. For example, if we keep the probability threshold at around 0.5(row 3), we will have to action on 30% of the population on average. However, this 30% population would account for 65% of all churners on average.
feature_importance_df = pd.DataFrame({"features":features , "importance": classifier.feature_importances_})
ggplot(feature_importance_df , aes(x = "features" , y = "importance")) + geom_col() +theme(axis_text_x= element_text(angle = 90))
<ggplot: (-9223371944075838280)>
Within the model , the most important 2 features are those relating to recent activity and recent activity nomalised by overall level of activity which makes a lot of sense.
# predicting which merchants, currently active are likely to churn in the future.
as_of_date = tranformed_df["date"].max()
X_oot = (tranformed_df
.assign(offset_date = lambda x : (x.date - as_of_date).dt.days)
.loc[lambda x : x.offset_date > -performance_window]
.groupby("merchant")
.apply(create_features_target)
).reset_index().drop(["level_1","churn_indicator"],axis = 1)
predicted_churn_prob_df = pd.DataFrame({"merchant" : X_oot["merchant"] , "predicted_churn_prob":classifier.predict_proba(X_oot[features])[:, 1]})
predicted_churn_prob_df.to_csv(os.path.join(src.dir_path , "data/processed/active_merchant_churn_probs.csv"))
- Explore more hyperparamenter combinations and try to run other models possible
- Feature engineering by creating new features as well as tweaking the windows for time window features. Tweaking the training window to make it shorter. Clearly most of the value lies in recent activity so we might be able to make the training window shorter
- We need to choose an appropriate threshold for the probability outpit from the model, above which we will comission an action to curtail possible churn.
- We need to find similar data to be able to validate this model.