/Multi-Class-Prediction-of-Obesity-Risk

Advanced Multiclass ML pipeline predicts obesity class gamifying healthcare with real-time data (Apache Kafka), Azure ML, robust MLFlow Experiments & Tracking, Docker deployment,, FastAPI, Poetry, drift management, blue-green deployment, Streamlit UI, SHAP Dashboard, Error Analysis, Governance, and Market Analysis

Primary LanguageJupyter Notebook

Multi-Class-Prediction-of-Obesity-Risk

This project is an extension of improving the models, productionizing the project with best practices previously developed for Kaggle Competition "Multi Class Prediction of Obesity Risk"where we placed within the top 5%. The project aims at redoing the project with added production using best practices learned from class MGSC-695-076. For the sake of security, no access keys were shared.

Tech Stack: Apache Kafka, MLflow, Azure ML, VS Code, Poetry, AutoGluon, H2O, PyCaret, FLAML, PandasAI, Docker, Streamlit, Postman, FastAPI, SHAP

Project Overview

1. Data Preparation and Simulation

  • Data Source: Original Kaggle CSV data split into Model Development and Hold-Off datasets.
  • Live Data Simulation: Used Apache Kafka for simulating real-time data feeds.

2. Azure Machine Learning Setup

  • Workspace Configuration: Established Azure ML Workspace with RBAC.
  • Team Roles: Assigned roles for Data Science, Data Engineering, ML Engineering, and Governance.

3. Exploratory Data Analysis (EDA)

  • Comprehensive Analysis:
    • Univariate Analysis: Leveraged PandasAI for detailed insights.
    • Bivariate Analysis: Used pairplots and interaction plots.
    • Dimensionality Reduction: Applied PCA with KMediansClustering.

4. Data Preprocessing

  • Feature Engineering: Enhanced performance based on EDA insights.
  • Normalization and Scaling: Ensured optimal feature scaling.
  • Missing Data Handling: Applied appropriate strategies for missing data.

Step 9: EDA [Owner to Update Step]

5. Dependency Management

  • Poetry Integration: Managed dependencies for reproducibility.

6. Model Development and Optimization

  • State-of-the-Art Models:

    • Custom models like XGBoost, LightGBM, CatBoost.
    • Hyperparameter Tuning: Used Optuna for optimization.
  • AutoML Exploration:

    • Explored Pycaret, AutoGluon, H2O for benchmarking.
    • Advanced Techniques: Stacked models, Isolation Forest, custom loss functions.

7. Experiment Tracking and Management

  • MLflow & Azure MLFlow Integration:
    • Tracked global and local metrics, target distribution.
    • SHAP Analysis: Utilized SHAP values for explainability and error analysis.

8. Deployment Strategies

  • Containerization: Used FastAPI and Docker.

  • Azure Deployment: Azure Container Instances, planned Kubernetes.

  • Conversion to Azure Scripts:

    • Converted Jupyter notebooks to Python scripts for Azure jobs.
    • Azure Pipelines: CI/CD with GitHub Actions and Azure Container Registry.

9. User Interface and Interaction

  • Streamlit Application: User-friendly interface integrated with APIs.

10. Model Monitoring and Drift Management

  • Monitoring Strategy: Drift detection, automated endpoint management.

11. Azure ML Designer Integration

  • UI-Based Experiments: Used Azure ML Designer for experiments additionally for learning purposes using SDK v2, and UI.

12. Additional Expert Considerations

  • Cross-Validation: Ensured model generalizability.
  • Model Governance: Versioning, lineage tracking, compliance.
  • Scalability and Optimization: Performance tests, scalability checks.
  • Feedback Loop: Integrated feedback for continuous improvement.

13. Branches:

  1. Main: For Final Product [Owner - Team]
  2. Experiments: For ML Experiments and tracking [Owners - Arham, Krishan]
  3. ArchDevelopment: For CICD [Owner - Nandani]
  4. Streamlit: For front end [Owner - Nandani]
  5. Data Engineering: For Kafka Streaming [Owner- Yash]
  6. Backup: For Backup [Owner - Aasna, Mahrukh]

Technologies Used

  • Data Analysis/Model Training: Python, Jupyter Notebooks
  • Experiment Tracking: MLFlow
  • Model Building: PyCaret, LightGBM, XGBoost, CatBoost
  • Hyperparameter Optimization: Optuna
  • Containerization: Docker
  • Realtime Data Streaming: Kafka
  • Version Control and CI/CD: Git, GitHub Actions
  • Cloud Deployment: Azure Machine Learning, Azure Blob Storage
  • User Interface: Streamlit
  • Dependency and Environment Management: Poetry

How to Run the Code

Prerequisites

  • Python 3.8+
  • Poetry
  • Docker
  • Azure Account
  • Kafka

Setup

  1. Clone the Repository

    git clone https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk.git
    cd Multi-Class-Prediction-of-Obesity-Risk
  2. Install Dependencies

    poetry install
  3. Set Up Environment Variables

    Create a .env file in the root directory and add the necessary environment variables. Example:

    AZURE_SUBSCRIPTION_ID=your_subscription_id
    AZURE_RESOURCE_GROUP=your_resource_group
    AZURE_WORKSPACE_NAME=your_workspace_name
  4. Start Docker

    Ensure Docker is running on your machine. Build and run the Docker containers:

    docker-compose up --build
  5. Run Streamlit Application

    streamlit run Streamlit/app.py
  6. Run Jupyter Notebooks

    Start Jupyter Lab to run and explore notebooks:

    poetry run jupyter lab

Deployment

  1. Azure ML Deployment

    • Configure your Azure workspace by setting up the necessary resources.
    • Use the provided Azure scripts to deploy models and services.
    poetry run python deploy/deploy_to_azure.py
  2. CI/CD Setup

    • Ensure GitHub Actions are configured correctly.
    • Push changes to the repository to trigger CI/CD pipelines.
    git add .
    git commit -m "Your commit message"
    git push origin main

Monitoring and Maintenance

  • Model Monitoring: Utilize integrated monitoring tools to track model performance and detect drift.
  • Endpoint Management: Automated endpoint management to ensure availability and performance.

Business Case

Our solution targets healthcare providers for early identification of at-risk patients, public health officials for data-driven policy making, and insurance companies for premium adjustment based on individual risk. The economic impact includes significant healthcare cost savings and revenue generation from tailored wellness programs.

Acknowledgements

This project is an effort by the team to tackle the global health crisis of obesity by employing advanced data science and machine learning techniques, aiming to make a significant impact in the healthcare sector.

Meet the Team

  1. Product Manager - Aasna
  2. Machine Learning Engineer - Arham
  3. ML Ops - Krishan
  4. Data Engineer - Yash
  5. Cloud SME - Nandani
  6. Business Analyst - Mahrukh