This project requires Python 3 with GPU Hardware accelerator(for enchanced performance) and the following Python libraries installed:
- [NumPy]
- [Pandas]
- [matplotlib]
- [scikit-learn]
- [seaborn]
- [xgboost]
Clone the environment_insurance.yml file before you start
Train data is given by ########. I have split data into Train and Test in 8:2 ratio
Code is divided into two sections
- Python Notebook (Under Code Tab) - Is used for data manipulation/cleaning/visulation and model selection
- insurance.ipynb NoteBook is used for data and model selection process
- model_pipeline_full.ipynb is used for training selected model and to create pipeline for future use
- Automation - Codes with pickle pipeline of trained model is used to run on future test or unseen data
- autoscript.py - check for new data in Input folder. If present then only run model
- predict.py - use pickle trained model pipeline to predict claim amount and saved with claim number in Output folder
- config.ini - can be used to change folder location easily for automation code
Data visualization is done using seaborn heatmap and pandas scatter matrix to visualization correlation among features.
Data Preprocessing is done using PCA to reduce dimensioanlity of data and transformation is done with PowerTransformer to achieve normal distribution.
Three ML approaches discussed and implemented in this project:
Model is tested on 20% of train data splitted from orginial train data in ratio of 8:2. Remaining 80% of train data is used for model training and validation.Metric used to measure performance of model is Mean Absolute error (MAE). Goal of all selected ML model approches implemented is to reduce MAE. MAE is average over difference between predicted and actual Incurred on test data.
MAE score from all three implementations are reported and final model is chosen with best MAE score.