Predicting Employee Attrition: Project Overview

Built a model for predicting whether or not an employee will leave (86% accuracy, 38% recall on test data) with a interactive tool to help in designing retention interventions for employees that are high-risk and hard to replace

Utilized a kaggle sample dataset for attrition analysis from ibm
validated the data and explored features with respect to attrition
reduced dimensions of compensation, experience, and sentiment features using PCA
standardized features and compared CV accuracy and recall on four model types: k-nearest neighbors, logistic regression, random forest classifier, and stochastic gradient boosting classifier
optimized the highest CV performer (logisistic regression) using gridsearch CV
interpreted model coefficients to understand what features are likely to increase vs decrease attrition risk which employers also have some degree of control over in order to have some retention intervention strategies
built a interactive visualization to assist in the identification of individuals for retention intervention and the type of intervention strategies that could help lower attrition risk for that individual

Data Source

kaggle ibm attrition dataset

Data Validation

no missingness, but some features had zero variance and would have been useless to the model
only 16% of observations represented employees that left, meaning a baseline model of guessing everyone stays would achieve 84% accuracy and 0% recall

Exploratory Analysis

I analyzed how numeric features were distributed differently for leavers vs stayers and found leavers were on average:

less experience
lower paid
longer commute
lower sentiment on employee survey values
traveled more for work
worked more overtime

Many of the features theoretically were covering similar constructs, so I ran made a correlation matrix to see if there is some redundancy. I found a lot of correlation between features having to do with years experience or some variant on that idea:

Dimension Reduction (PCA)

Because of the redundancy in experience variables I decided to do PCA do see whether we could decorrelate them and use fewer features. Conceptually, there were also a number of features related to compensation and more related to sentiment so I wanted to see if those could also be decorrelated and reduced. All three analyses revelaed good candidates for PCA dimension reduction, overall going from 16 features to 7. Here is a plot showing you can explain 87% of the variance in the 'experience' related features with only the first 2 principle components:

Feature Engineering, Transformations, and Train/Test Split

7 experience related features were reduced to 2 PC's (87% of variance explained)
3 compensation related features were reduced to 2 PC's (99% of variance explained)
5 sentiment related features were reduced to 3 PC's (80% of variance explained)
category features were encoded as dummies for sklearn to interpret
features without variance were removed
dataset was split randomly in to training (70%) and test (30%) sets
training and test features standardized (after splitting so as not to 'leak' information from test to training set) using standard scaler (knn requires standardization and it helps with interpretation when there is large difference in scale and variance between numeric features like in our dataset)

Model Fitting and Evaluation

The following model types were compared on accuracy and recall on 6 fold cross validation scores:

Logitistic regression was the clear winner when you combine the two metrics. Recall is important in our case because the risk of someone leaving and we didn't identify it or attempt to stop it is greater than the risk of falsely labeling a non-risky person as risky.

A logit model was then tuned using gridsearch cross validation and our best model had an accuracy of 86% (2% increase over guessing everyone stays) and a recall of 38% (38% increase over guessing everyone stays).

Interpretation

An examination of model coefficients shows which features had the strongest relationships to attrition. I found it helpful to differentiate the features over which employers have some amount of control to alter vs those that cannot or should not be altered, in order to suggest possible interventions for retention. The model suggests the following retention interventions (in order of coef size/strength of relationship):

cap hours, ensuring employee does not work overtime very often
pay employee more
increase employee's stock options
lower the rate at which employee must travel for work
allow work from home days or pay for relocation closer to office (distance from home is probably effects of long commutes)

The coefficients on sentiment features are puzzling and would deserve follow up analysis.

Use Case

Lastly, I built an interactive visualization tool meant to assist in narrowing down the retention intervention pool of employees to ones that are theoretically difficult to replace (high performers, high level, critical job roles). In the tool you can see the leave probabilities produced by the model to look at employees more likely to leave. The tooltip then shows you values for the individual that correspond to rentetion levels our model suggested might lower attrition risk (overtime hours, pay, stock options, travel, commute). Below is a screenshot, but I was able to retain the interactivity if you scroll to the bottom in the html version of my 'Attrition Prediction' notebook.

The idea is our model ID's risky employees and retention levels and you ID hard to replace employees as well as which retention levers are relevant to each employee. Because our model is observational only, we should track interventions and subsequent outcomes (if left over a certain period of time) to see which interventions have a causal link to attrition.

christianclayton7/ibm-attrition-prediction