/ML_Capstone_Project

Allstate insurance, the second largest personal lines insurer in the United States and the largest that is publicly held, approximately 16 million households. In this Project, through machine learning and data analsis techniques, I try to best predict which labels and columns are the best indicators for detecting the severity of an insurance claim.

Primary LanguageJupyter Notebook

Machine Learning Nanodegree Capstone

Allstate Insurence Severity Detection

Project Overview

Allstate insurence, the second largest personal lines insurer in the United States and the largest that is publicly held, approximatly 16 million households. For this challenge, Allstate has provided vast dataset on its insurence claims, several dozen key attributes, thorugh which data analysis could help uncover the underlying pattern in detecting the severeness of an insurance claim. Data is provided in the form of categorical, continues format. This challenge, is a very great problem to tackle with machine learning. The automation of reviweing insurance claims will improves productivity and efficiency which will inevitably lead to happier customers and overall satisfaction.

Problem Statement

The challenge requires the data scientists to best predict the label column, or 'loss' as indicated in the data provided. Since the dataset contains both categorical and continious data along with several dozen attributes, and several hundread rows of insurence claim instances, the final algorithm will have to take these constraints into consideraiton. And as such, I belive that a mix of decision tree, supervised regression algorithm will best be able to solve this challenge. For example as refrence to some of the publicly avialable best performing kernels in the challenge page on kaggle can show that XGBoost, MLP, and such perform the best.

Challenge Roadmap

Understanding the Dataset

  • Data Shape
  • Skew
  • Mean, STD deviation, Max, etc.

Data Visualization

  • Scatter Plots
  • Correlation
  • Density Plots

Data Processing

  • Transformation
  • One-Hot encoding
  • Train/Test Data preparation

Model Evalaution

  • Algorithm implementations
  • Model Evaluations
  • Model Predicitons
  • Model Analysis

Appendix

  • Conclusion
  • Personal Thoughts
  • Thanks and Appreciations

Project Highlights

Check the report.md for in depth analysis of the challenge and it's highlights.

Libraries and Dependencies used

  • Sklearn
    • metrics
    • Linear Regression
    • GradientBoosting
    • DecisionTree
    • SGD
    • RandomForest
    • MLP
  • Numpy
  • Pandas
  • Matplotlib
  • Seaborn

Core Files

train.csv = train Data provided by Allstate Challenge. The original Shape of the data is in (188318, 131). This dataset also includes the 'loss' column by which we'll be completing the challenge.

test.csv = test data provided by the challenge. The original shape of the data is in (125546, 130). This datset however does not include the 'loss' column. also train set does not include as munch insurance isntance claims as much as the train dataset.

CapstoneProject.ipynb = This is the main notebook, it includes all the analysis, model building, and predicting.

Supplementary Refrences and Link

Kaggle Challnege: https://www.kaggle.com/c/allstate-claims-severity Refrences: