Multi-Class-Prediction-of-Obesity-Risk

This project is an extension of improving the models, productionizing the project with best practices previously developed for Kaggle Competition "Multi Class Prediction of Obesity Risk"where we placed within the top 5%. The project aims at redoing the project with added production using best practices learned from class MGSC-695-076. For the sake of security, no access keys were shared.

Tech Stack: Apache Kafka, MLflow, Azure ML, VS Code, Poetry, AutoGluon, H2O, PyCaret, FLAML, PandasAI, Docker, Streamlit, Postman, FastAPI, SHAP

Project Overview

1. Data Preparation and Simulation

Data Source: Original Kaggle CSV data split into Model Development and Hold-Off datasets.
Live Data Simulation: Used Apache Kafka for simulating real-time data feeds.

2. Azure Machine Learning Setup

Workspace Configuration: Established Azure ML Workspace with RBAC.
Team Roles: Assigned roles for Data Science, Data Engineering, ML Engineering, and Governance.

3. Exploratory Data Analysis (EDA)

Comprehensive Analysis:
- Univariate Analysis: Leveraged PandasAI for detailed insights.
- Bivariate Analysis: Used pairplots and interaction plots.
- Dimensionality Reduction: Applied PCA with KMediansClustering.

4. Data Preprocessing

Feature Engineering: Enhanced performance based on EDA insights.
Normalization and Scaling: Ensured optimal feature scaling.
Missing Data Handling: Applied appropriate strategies for missing data.

Step 9: EDA [Owner to Update Step]

5. Dependency Management

Poetry Integration: Managed dependencies for reproducibility.

6. Model Development and Optimization

State-of-the-Art Models:
- Custom models like XGBoost, LightGBM, CatBoost.
- Hyperparameter Tuning: Used Optuna for optimization.
AutoML Exploration:
- Explored Pycaret, AutoGluon, H2O for benchmarking.
- Advanced Techniques: Stacked models, Isolation Forest, custom loss functions.

7. Experiment Tracking and Management

MLflow & Azure MLFlow Integration:
- Tracked global and local metrics, target distribution.
- SHAP Analysis: Utilized SHAP values for explainability and error analysis.

8. Deployment Strategies

Containerization: Used FastAPI and Docker.
Azure Deployment: Azure Container Instances, planned Kubernetes.
Conversion to Azure Scripts:
- Converted Jupyter notebooks to Python scripts for Azure jobs.
- Azure Pipelines: CI/CD with GitHub Actions and Azure Container Registry.

9. User Interface and Interaction

Streamlit Application: User-friendly interface integrated with APIs.

10. Model Monitoring and Drift Management

Monitoring Strategy: Drift detection, automated endpoint management.

11. Azure ML Designer Integration

UI-Based Experiments: Used Azure ML Designer for experiments additionally for learning purposes using SDK v2, and UI.

12. Additional Expert Considerations

Cross-Validation: Ensured model generalizability.
Model Governance: Versioning, lineage tracking, compliance.
Scalability and Optimization: Performance tests, scalability checks.
Feedback Loop: Integrated feedback for continuous improvement.

13. Branches:

Main: For Final Product [Owner - Team]
Experiments: For ML Experiments and tracking [Owners - Arham, Krishan]
ArchDevelopment: For CICD [Owner - Nandani]
Streamlit: For front end [Owner - Nandani]
Data Engineering: For Kafka Streaming [Owner- Yash]
Backup: For Backup [Owner - Aasna, Mahrukh]

Technologies Used

Data Analysis/Model Training: Python, Jupyter Notebooks
Experiment Tracking: MLFlow
Model Building: PyCaret, LightGBM, XGBoost, CatBoost
Hyperparameter Optimization: Optuna
Containerization: Docker
Realtime Data Streaming: Kafka
Version Control and CI/CD: Git, GitHub Actions
Cloud Deployment: Azure Machine Learning, Azure Blob Storage
User Interface: Streamlit
Dependency and Environment Management: Poetry

How to Run the Code

Prerequisites

Python 3.8+
Poetry
Docker
Azure Account
Kafka

Setup

Clone the Repository

git clone https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk.git
cd Multi-Class-Prediction-of-Obesity-Risk

Install Dependencies
```
poetry install
```

Set Up Environment Variables

Create a .env file in the root directory and add the necessary environment variables. Example:

AZURE_SUBSCRIPTION_ID=your_subscription_id
AZURE_RESOURCE_GROUP=your_resource_group
AZURE_WORKSPACE_NAME=your_workspace_name

Start Docker

Ensure Docker is running on your machine. Build and run the Docker containers:
```
docker-compose up --build
```
Run Streamlit Application
```
streamlit run Streamlit/app.py
```
Run Jupyter Notebooks

Start Jupyter Lab to run and explore notebooks:
```
poetry run jupyter lab
```

Deployment

Azure ML Deployment
- Configure your Azure workspace by setting up the necessary resources.
- Use the provided Azure scripts to deploy models and services.
```
poetry run python deploy/deploy_to_azure.py
```
CI/CD Setup
- Ensure GitHub Actions are configured correctly.
- Push changes to the repository to trigger CI/CD pipelines.
```
git add .
git commit -m "Your commit message"
git push origin main
```

Monitoring and Maintenance

Model Monitoring: Utilize integrated monitoring tools to track model performance and detect drift.
Endpoint Management: Automated endpoint management to ensure availability and performance.

Business Case

Our solution targets healthcare providers for early identification of at-risk patients, public health officials for data-driven policy making, and insurance companies for premium adjustment based on individual risk. The economic impact includes significant healthcare cost savings and revenue generation from tailored wellness programs.

Acknowledgements

This project is an effort by the team to tackle the global health crisis of obesity by employing advanced data science and machine learning techniques, aiming to make a significant impact in the healthcare sector.

Meet the Team

Product Manager - Aasna
Machine Learning Engineer - Arham
ML Ops - Krishan
Data Engineer - Yash
Cloud SME - Nandani
Business Analyst - Mahrukh