/churn-risk-score-prediction

Hackerearth ML challenge - How NOT to lose a customer in 10 days

Primary LanguageJupyter Notebook

Churn risk score prediction

HackerEarth machine learning challenge: How NOT to lose a customer in 10 days

Contents

Problem

Churn rate is a marketing metric that describes the number of customers who leave a business over a specific time period. Every user is assigned a prediction value that estimates their state of churn at any given time. This value is based on:

  • User demographic information
  • Browsing behavior
  • Historical purchase data among other information

It factors in our unique and proprietary predictions of how long a user will remain a customer. This score is updated every day for all users who have a minimum of one conversion. The values assigned are between 1 and 5.

↑ back to top

Task

To predict the churn score for a website based on the features provided in the dataset.

↑ back to top

Data description

The dataset folder contains the following files:

  • train.csv: 36992 x 25
  • test.csv: 19919 x 24

See the columns in the dataset here

↑ back to top

Evaluation metric

score = 100 x metrics.f1_score(actual, predicted, average="macro")

↑ back to top

Steps

↑ back to top

Ideas

  1. NaNs in gender column replaced with F based on customer names
  2. Rows with churn risk score = -1 removed
    • Trial 1. Found correlation of all columns with churn risk score column
    • Noticed that replacing -1 score with 4 had best correlation
    • Trial 2. Removing rows with -1 score gives best model accuracy
  3. NaNs in medium of operation replaced with 'both' (increased correlation with churn risk score)
  4. Columns had incorrect negative values which were converted to positive
    • avg_time_spent
    • points_in_wallet
    • avg_frequency_login_days
  5. NaNs for other columns were filled with mean in case of float datatype and ffill method otherwise
  6. Values in columns joining_date and last_visit_time were converted to datetime
  7. Created new columns (Increased model f1 score)
    • joining_year
    • joining_month
    • joining_day
    • diff (total days)
  8. Label encoding for columns
    • gender
    • used_special_discount
    • offer_application_preference
    • past_complaint
    • joined_through_referral
    • membership_category
    • feedback
  9. One hot encoding for columns
    • region_category
    • preferred_offer_types
    • medium_of_operation
    • internet_option
    • complaint_status
  10. Dropped unnecessary columns
  11. Tried various oversampling techniques as churn risk scores 1 and 2 had very few data points compared to 3, 4, and 5
  12. Tried various models and found that xgboost and random forest models worked best with the former having an edge
↑ back to top

Areas to improve

  • Response coding instead of one hot encoding
  • More feature engineering
  • Different methods of handling NaNs
↑ back to top

Final submission

  • Online score: 76.76408
  • Offline score: 76.64014
  • Rank: 61
↑ back to top