/IBM-Attrition

Why do employees leave? This project first compares the predictive performance of three different models, then uses the best model to help reveal the top contributing factors.

Primary LanguageJupyter Notebook

IBM Attrition

For this notebook, I use the IMB Attrition dataset (available on Kaggle) to do the following:

  • Compare the predictive performance of Logistic Regression, Random Forest, and XGBoost models using Confusions Matrices and ROC Curves
  • Explore relative feature importance with respect to employee attrition for the XGBoost model using Gini Importances and Permutation_Importance
  • Further explore feature importance through visualization tools PDP Plots, PDP Interaction Plots, and Shapley Waterfall Plots
  • Use insights from the feature importances to perform dimensionality reduction
  • Rerun models on smaller feature matrix

Insights

  • The three models performed similarly on the original feature matrix, with the XGBoost slightly edging out the others
  • Some important features included 'OverTime', 'NumberofCompaniesWorked', and 'BusinessTravel'
  • After simplifying the model by reducing dimensionality, the simple Logistic Regression outperformed the others, which suffered from overfitting to the training data