This dataset can be retrieved from UCI Machine Learing Repository.
The original website is at link.
The dataset contains 740 observations and 21 features on the profiles of employees who report their absenteeism.
Features include Reasons for absence,Seasons, Transportation Expense, etc.
Logistic regression, randomForest, and Gradient boosting are selected to model such phenomena.
As a result, RandomForest is the best to model absenteeism at work and the overall auc is 0.87.
The prediction panel is hosted on GCP flask and user can input their profile and get a prediction of whether they will be absent from work.
Here is a demo about how the system works:
Step 1: Clone this repository
git clone https://github.com/shangwenyan/IDS721FinalProject.git
Step 2:Create virtual environment if none exists (optional)
virtualenv --python $(which python3) venv
source venv/bin/activate
Step 3: Install all the required packages
pip install -r requirements.txt
Step 4: Run the following command to build a tmeporary testing website
python main.py