Insurance Premium Prediction Projecct

Build a solution that should able to predict the premium of the personal for health insurance.

About Dataset

The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value designated for each level.

Please check the link below:

Streamlit Deployment - [Current Live Link]: Streamlit-app*

Elastic Beanstalk: [*Deployment Link Live-Beanstalk*](http://insurance-env-1.eba-ztjhym2p.ap-south-1.elasticbeanstalk.com/)

Documentation:

Detailed Project Report

(back to top)

Steps Taken:

Installed Python, VS Code and Git.
Create env python=3.9
Run Requirement file
Created an account on Atlas MongoDB.
Download the source dataset from Kaggle Repository.
For Regression Problem algorithm decided to predict the feature expenses.
Deployed on AWS-EC2.

Data Cleaning:

Data was cleaned which has an header issue, missing values, misplaced values and outliers.

EDA and Feature Engineering:

In this step, we will apply Exploratory Data Analysis (EDA) to extract insights from the data set to know which features have contributed more in predicting Forest fire by performing Data Analysis using Pandas and Data visualization using Matplotlib & Seaborn.
Done Feature scaling by Standard Scaler in which data lies between -1 and +1.

Model Building

For Regression Problem algorithm decided to predict the feature expenses.
Models used : Linear regression, Random forest, Decision tree, Ada-boost and Grad-boost.

(back to top)

Model Selection

HyperParameter Tuning with Gridsearch CV is done for both Regression.
For Regression: Metrics are r2 score, adjusted r2 and mean absolute error.

Flask, Docker and AWS Deployment:

Build a Flask App with Docker file.
Deployed on AWS-EC2 with CI/CD pipeline through Github actions.

ML-Flow and DVC [facilitate collaboration ml-lifecycle]:

Used MLflow for experiment tracking, logging metrics, parameters, and artifacts during model training.
Used DVC to version control and manage your large datasets efficiently.
By integrating MLflow and DVC, we can create a more robust and reproducible machine learning workflow that addresses both code and data versioning concerns.