Customer Churn is the number or presentation of customers who stop using a product or unsubscribe during a certain period, this is caused by customer dissatisfaction, cheaper offers from competitors, better marketing by competitors, or other causes.
In a growing business, the cost of getting new customers is far greater than the cost of keeping existing customers. Customer churn impact on lossing revenue and company's reputation, in such a position it is more difficult to get new customers.
- Analyze the factors that are potential causes of customer churn from this dataset
- Build Machine Learning Model to predict customers churn
The dataset provide by Kaggle to analyze and predict customer churning. The dataset is a sample data from IBM consists of 7043 samples and 21 columns with the following description:
- customerID: Customer ID
- gender: Whether the customer is a male or a female
- SeniorCitizen: Whether the customer is a senior citizen or not (Yes, No)
- Partner: Whether the customer has a partner or not (Yes, No)
- Dependents: Whether the customer has dependents or not (Yes, No)
- tenure: Number of months the customer has stayed with the company
- PhoneService: Whether the customer has a phone service or not (Yes, No)
- MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
- InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
- OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
- OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
- DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
- TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
- StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
- StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
- Contract: The contract term of the customer (Month-to-month, One year, Two year)
- PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
- PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- MonthlyCharges: The amount charged to the customer monthly
- TotalCharges: The total amount charged to the customer
- Churn: Whether the customer churned or not (Yes or No)
Build Supervised Machine Learning with 4 algorithm
- Logisitic Regression
- XGBoost Classifier
- Random Forest
- K-Nearest Neighbors methods.
Since the dataset is imbalance, I used 4 Experiment :
- Training models use an Imbalance dataset
- Training models use a balance dataset (Under Sample)
- Training models use a balance dataset (Random Over Sample)
- Training models use a balance dataset (SMOTE)
Best Model:
- Because we used imbalance dataset, evaluation matrics based on Recall,F1 Score, and AUC Score
- Logistic Regression who train used a balance dataset (Random Over Sample) get highest Recall, F1 Score and AUC Score