The RH Analytics project aims to streamline the employee promotion process for a large multinational corporation (MNC) by predicting eligible candidates for promotion to manager positions and below. The goal is to identify potential candidates at a checkpoint to expedite promotions and reduce delays.
The client faces delays in the promotion cycle as the final promotions are only announced after training and evaluations. This project uses predictive analytics to identify candidates for promotion earlier in the process, based on various performance metrics and employee data.
- Training Data:
train.csv
- Test Data:
test.csv
The model's performance is evaluated using the F1 Score, which balances precision and recall.
- Data Loading: Read the CSV files using pandas.
- Exploration: Conduct exploratory data analysis (EDA) to understand the distribution of data, missing values, and correlations.
- Categorization: Classify variables into categories such as employee information, demographics, and performance metrics.
- Handling Missing Values: Fill missing values in categorical columns with the mode.
- Feature Encoding:
- Numerical Features: Scale numerical features using
StandardScaler
. - Categorical Features: One-hot encode categorical features.
- Target Encoding: Apply target encoding to the
region
column.
- Numerical Features: Scale numerical features using
- Handling Imbalance: Use SMOTE to address class imbalance.
- Preprocessing:
- Numerical features: StandardScaler
- Categorical features: OneHotEncoder
- Target encoding: TargetEncoder
- Model: CatBoostClassifier
- Integration: Combine preprocessing and modeling using imbalanced pipeline.
- Grid Search: Optimize hyperparameters using GridSearchCV.
- Metrics: Evaluate model performance using F1 Score.
- Best Parameters: Parameters identified through GridSearchCV.
- Test F1 Score: F1 Score of the best model on the test set.
- Accuracy: Accuracy score of the final model.
pip install category_encoders xgboost ydata_profiling imblearn catboost
The code performs the following steps:
- Load and inspect the data.
- Preprocess the data.
- Define and tune the model pipeline.
- Evaluate and report the results.
- Load Data: Read the
train.csv
andtest.csv
files. - Preprocess Data: Handle missing values and encode features.
- Train Model: Fit the model pipeline on training data.
- Evaluate Model: Assess performance using F1 Score and other metrics.
This project is licensed under the MIT License. See the LICENSE file for details.