Focus on what matters and let go of what doesn't!
- Project Motivation
- Methods Used
- File Descriptions
- Results
- Installation
- Licensing, Authors, and Acknowledgements
In this project we are going to look at ways we can make the insurance claims process more efficient. Efficiencies in insurance claims severity analysis can help provide suitable insurance packages to customers and provide targeted assistance to better serve them. Finally, we would like to build a model that can predict severity of claims so as to improve the claims service to ensure a worry-free customer experience.
Using a dataset from Kaggle; provided by AllState, a US-based insurance company; the training dataset consists of 130 attributes (features) and the loss value for each observation. The dataset contains 188,318 observations where each row represents an insurance claim. This means each claim is a process that requires 130 different information. So, the main questions are:
- Do we require all these attributes/information?
- Can we eliminate any of these attributes to be more efficient? If yes, then:
- 2.1) which continuous variable are least important and can be dropped?
- 2.2) which categorical values is least important and can be dropped?
- Which attributes are most important for AllState?
- Finally, can we create an algorithm to predict claims severity?
- Exploratory data analysis to understand the Allstate insurance claim dataset
- Feature selection and elimination using Correlation, Constant Variance and Chi-Square statistical tests
- Use PCA and Feature Importances to find the most important features
- Understanding ensemble Machine Learning algorithms
- Hyper-parameter tuning using Scikit-Learn functions
- Model selection using RMSE as the model evaluation metric
Insurance_severity_claims.ipynb
: Notebook containing the whole project combined including EDA & Machine learning modelAPI
: folder containing 3 files : 1.claimsPrediction_model_API.py
: flask API code for deployment
2.columns_to_drop.csv
: csv files containing features to be dropped and is used in the API file 3.tunedmodel_rf
: pickle file containing the RandomForest model used for predictiondataset.zip
:zipped folder containing the train and test datasetsrequirements.txt
: text file containing the required libraries & packages to execute the code
- I was able to drop the number of features from 130 to 39 and I trained a ML algorithm which work quite was and was able to make prediction
- The prediction using the test dataset was submitted on Kaggle and a score of 3011.62 was achieved which can be improved.
More information about the project and the main findings of the code can be found at the post available here
-
To clone the repository use: git clone https://github.com/fardil-b/Insurance-Claims-Severity-Prediction.git
-
The code should run with no issues using Python versions 3.0 and above. The additional libraries required to execute the code can be installed using
pip
withpip install -r requirements.txt
Must give credit to AllState for the data. You can find the Licensing for the data and other descriptive information at the Kaggle link available here. Otherwise, feel free to use the code here as you would like!