HackerEarth machine learning challenge: How NOT to lose a customer in 10 days
Churn rate is a marketing metric that describes the number of customers who leave a business over a specific time period. Every user is assigned a prediction value that estimates their state of churn at any given time. This value is based on:
- User demographic information
- Browsing behavior
- Historical purchase data among other information
It factors in our unique and proprietary predictions of how long a user will remain a customer. This score is updated every day for all users who have a minimum of one conversion. The values assigned are between 1 and 5.
To predict the churn score for a website based on the features provided in the dataset.
The dataset folder contains the following files:
- train.csv: 36992 x 25
- test.csv: 19919 x 24
See the columns in the dataset here
score = 100 x metrics.f1_score(actual, predicted, average="macro")
- Load data
- Preprocess data
- Perform exploratory data analysis
- Feature engineer
- Build and test different models
- Make predictions using best model (XGBoost)
- Submit
- NaNs in gender column replaced with F based on customer names
- Rows with churn risk score = -1 removed
- Trial 1. Found correlation of all columns with churn risk score column
- Noticed that replacing -1 score with 4 had best correlation
- Trial 2. Removing rows with -1 score gives best model accuracy
- NaNs in medium of operation replaced with 'both' (increased correlation with churn risk score)
- Columns had incorrect negative values which were converted to positive
- avg_time_spent
- points_in_wallet
- avg_frequency_login_days
- NaNs for other columns were filled with mean in case of float datatype and ffill method otherwise
- Values in columns joining_date and last_visit_time were converted to datetime
- Created new columns (Increased model f1 score)
- joining_year
- joining_month
- joining_day
- diff (total days)
- Label encoding for columns
- gender
- used_special_discount
- offer_application_preference
- past_complaint
- joined_through_referral
- membership_category
- feedback
- One hot encoding for columns
- region_category
- preferred_offer_types
- medium_of_operation
- internet_option
- complaint_status
- Dropped unnecessary columns
- Tried various oversampling techniques as churn risk scores 1 and 2 had very few data points compared to 3, 4, and 5
- Tried various models and found that xgboost and random forest models worked best with the former having an edge
- Response coding instead of one hot encoding
- More feature engineering
- Different methods of handling NaNs
- Online score: 76.76408
- Offline score: 76.64014
- Rank: 61