
Tag along project for Machine Learning: Classification course

Primary LanguageJupyter Notebook

Telco Customer Churn


Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.


In the dataset, each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

  • Customers who left within the last month – the column is called Churn

  • Services that each customer has signed up for :

    • phone,

    • multiple lines,

    • Internet,

    • online security,

    • online backup,

    • device protection,

    • tech support, and

    • streaming TV and movies

  • Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

  • Demographic info about customers – gender, age range, and if they have partners and dependents

Project Instructions


  • Perform initial data preparation by converting the 'TotalCharges' column to numeric values and filling missing values with 0.
  • Convert the 'Churn' column to binary values, where 'No' is mapped to 0 and 'Yes' is mapped to 1.- - Split the data into an 80-20 train-test split with a random state of “1”. Select these features:
    categorical = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService','OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies','Contract', 'PaperlessBilling', 'PaymentMethod'] numerical = ['tenure', 'MonthlyCharges', 'TotalCharges']

Feature engineering

  • The numerical features should be scaled using StandardScaler, convert the output back to a dataframe and put back the column names.
  • The categorical features are one-hot encoded using OneHotEncoder(set sparse_output to false), convert the output back to a dataframe and put back the column names.
  • Combine scaled numerical and one-hot encoded categorical features into train and test set dataframes (use pd.concat)
  • Use scikit learn to train a random forest and extra trees classifier, and use xgboost and lightgbm to train an extreme boosting model and a light gradient boosting model. Use random_state = 1 for training all models and evaluate on the test set. Answer the following questions:
