Build a solution that should able to predict the premium of the personal for health insurance.
The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value designated for each level.
Streamlit Deployment - [Current Live Link]:
Streamlit-app*
Elastic Beanstalk: [*Deployment Link Live-Beanstalk*](http://insurance-env-1.eba-ztjhym2p.ap-south-1.elasticbeanstalk.com/)
- Installed Python, VS Code and Git.
- Create env python=3.9
- Run Requirement file
- Created an account on Atlas MongoDB.
- Download the source dataset from Kaggle Repository.
- For Regression Problem algorithm decided to predict the feature
expenses
. - Deployed on AWS-EC2.
- Data was cleaned which has an header issue, missing values, misplaced values and outliers.
- In this step, we will apply Exploratory Data Analysis (EDA) to extract insights from the data set to know which features have contributed more in predicting Forest fire by performing Data Analysis using Pandas and Data visualization using Matplotlib & Seaborn.
- Done Feature scaling by Standard Scaler in which data lies between -1 and +1.
- For Regression Problem algorithm decided to predict the feature
expenses
. - Models used : Linear regression, Random forest, Decision tree, Ada-boost and Grad-boost.
- HyperParameter Tuning with Gridsearch CV is done for both Regression.
- For Regression: Metrics are r2 score, adjusted r2 and mean absolute error.
- Build a Flask App with Docker file.
- Deployed on AWS-EC2 with CI/CD pipeline through Github actions.
- Used MLflow for experiment tracking, logging metrics, parameters, and artifacts during model training.
- Used DVC to version control and manage your large datasets efficiently.
- By integrating MLflow and DVC, we can create a more robust and reproducible machine learning workflow that addresses both code and data versioning concerns.