- Classification Modeling project to predict customer churn in a Teclecommunications database from SQL
- Project created using the data science pipeline (Acquisition, Preparation, Exploration, Analysis & Statistical Testing, and finally Modeling and Evaluation)
- Target: Use the initial phases of the data science pipeline to discover drivers of churn and create a model that can predict whether a customer will churn using those derived parameters
Target | Description | Data Type |
---|---|---|
Churn | Column indicating whether or not a customer will churn originally valued as 'Yes' or 'No' | Object |
Column Name | Description | Data Type |
---|---|---|
payment_type_id | Numerical identity of the payment_type column (1, 2, 3, 4) | int64 |
internet_service_type_id | Numerical identity of the internet_service column (1, 2, 3) | int64 |
contract_type_id | Numerical identity of the contract_type column (1, 2, 3) | int64 |
customer_id | Alphanumeric code for unique customer identity | object |
gender | Categorical variable determining customer gender (male, female) | object |
senior_citizen | Categorical variable determining customer's age represented as 0 or 1 | int64 |
partner | Categorical variable determining if a customer has a partner or not | object |
dependants | Categorical variable determining if a customer has dependants | object |
tenure | Numerical value determining how many months a customer has been with the company from their origin | int64 |
phone_service | Categorical variable of the phone service for each customer (Yes, No) | object |
multiple_lines | Categorical variable of the customer's number of phone line status (Yes: Multiple, No: 1 line, No phone service) | object |
paperless_billing | Categorical variable of the customer's payment arrangement for monthly charges (Yes: paperless billing, No: no paperless) | object |
monthly_charges | Numerical value determining the customer's monthly charges for service | float64 |
total_charges | Numerical value determining the customer's total charges since point of origin with the company | converted to float64 from object |
contract_type | Categorical variable determining the type of contract the customer is in (Month-to-Month, Two year, One year) | object |
internet_service_type | Categorical variable determining the type of internet a customer is using (Fiber optic, DSL, None) | object |
payment_type | Categorical variable determining the type of payment arrangement a customer has agreed too (Electronic Check, Mailed Check, Bank transfer, Credit card) | object |
Column Name | Description | Data Type |
---|---|---|
month-to-month | categorical variable with a boolean value of 1 if customer is not in a contract or 0 if they are | int64 |
fiber | categorical variable with a boolean value of 1 if a customer has fiber internet and 0 if they do not | int64 |
e_check | categorical variable with a boolean value of 1 if the customer uses electronic checks and 0 if they don't | int64 |
2_contract | categorical variable with a boolean value of 1 if a customer has a 2 year contract and 0 if not | int64 |
Question 1: Is there a relationship between customers with paperless billing and whether or not they churned
- Null Hypothesis: There is no relationship between paperless billing and whether or not a customer has churned
- Alternate Hypothesis: There is a relationship between paperless billing customers and whether or not they have churned
After some chi squared testing we determined there is a significant relationship between paperless billing customers and whether or not they have churned
Question 2: Is there a relationship between if a customer has multiple lines and whether or not they have churned
- Null Hypothesis: There is no relationship between customers having multiple lines and whether or not they have churned
- Alternate Hypothesis: There is a relationship between customers having multiple lines and whether or not they have churned
After some chi squared testing we can determine there is a significant relationship between customers who have multiple lines and whether or not they have churned.
- After lots of feature and parameter tweaking in conclusion the Decision Tree model performed the best and was used on the test dataframe.
- Accuracy was 76.72% compared to the basline of 73.47%
- Precision was a solid 82.5%
- The best recall achieved was 60.8%
- With more time I think more feature engineering could be used to better optimize for recall to improve the True Positive rate.
- Use the prediction model to uncover 6 out of every 10 customers who will churn and target them with greater incentives or promotions.
- Given the models good precision and decent recall prediction we can use it to save an estimated 60% of the churn customers.
-
You will need your own env.py file with your Codeup database credentials to use the sql_connect function
-
Read the readme file and download the acquire, prepare, and evaluate.py files. Use the functions to recreate the dataframes and seperate the data ad the final_notebook suggest. Follow the comments to complete statistical testing and finally use the visualizations to create your own parameters and build the models.
-
Last but not least choose your best models from the training dataframe to run a validate test. Then choose your best validate scored model to run the model on the test dataframe.
-
Finish with concluding remarks and reccomendations for business use of the model.