Churn risk score prediction

HackerEarth machine learning challenge: How NOT to lose a customer in 10 days

Problem
Task
Data description
Evaluation metric
Steps
Ideas
Areas to improve
Final submission

Problem

Churn rate is a marketing metric that describes the number of customers who leave a business over a specific time period. Every user is assigned a prediction value that estimates their state of churn at any given time. This value is based on:

User demographic information
Browsing behavior
Historical purchase data among other information

It factors in our unique and proprietary predictions of how long a user will remain a customer. This score is updated every day for all users who have a minimum of one conversion. The values assigned are between 1 and 5.

↑ back to top

Task

To predict the churn score for a website based on the features provided in the dataset.

↑ back to top

Data description

The dataset folder contains the following files:

train.csv: 36992 x 25
test.csv: 19919 x 24

See the columns in the dataset here

↑ back to top

Evaluation metric

score = 100 x metrics.f1_score(actual, predicted, average="macro")

↑ back to top

Steps

↑ back to top

Ideas

NaNs in gender column replaced with F based on customer names
Rows with churn risk score = -1 removed
- Trial 1. Found correlation of all columns with churn risk score column
- Noticed that replacing -1 score with 4 had best correlation
- Trial 2. Removing rows with -1 score gives best model accuracy
NaNs in medium of operation replaced with 'both' (increased correlation with churn risk score)
Columns had incorrect negative values which were converted to positive
- avg_time_spent
- points_in_wallet
- avg_frequency_login_days
NaNs for other columns were filled with mean in case of float datatype and ffill method otherwise
Values in columns joining_date and last_visit_time were converted to datetime
Created new columns (Increased model f1 score)
- joining_year
- joining_month
- joining_day
- diff (total days)
Label encoding for columns
- gender
- used_special_discount
- offer_application_preference
- past_complaint
- joined_through_referral
- membership_category
- feedback
One hot encoding for columns
- region_category
- preferred_offer_types
- medium_of_operation
- internet_option
- complaint_status
Dropped unnecessary columns
Tried various oversampling techniques as churn risk scores 1 and 2 had very few data points compared to 3, 4, and 5
Tried various models and found that xgboost and random forest models worked best with the former having an edge

Nilavan/churn-risk-score-prediction

Churn risk score prediction

Contents

Problem

Task

Data description

Evaluation metric

Steps

Ideas

Areas to improve

Final submission