Repository for the final project - course 02476 at DTU
Table of Contents
-
Overall goal: The goal is to classify five different varieties of rice (Arborio, Basmati, Ipsala, Jasmine and Karacadag). A framework and model(s) is chosen alongside a simple dataset, to focus on the utilization and implementation of frameworks and tools within MLOps. The main focus will thus be on reproducability, profiling, visualizing and monitoring multiple experiments to assess model performance.
-
Framework: For the project the PyTorch Image Models (TIMM) is being used. This is a framework/collection of models with pretrained weights, data-loaders, traning/validation scripts, and more to be used for multiple different models. Only the model functionality is used in the project.
-
Data: The rice image dataset publicly available on Kaggle contains 75.000 images with 15.000 pieces for each of the 5 classes. The total size of the dataset is 230MB. Each image is 250 x 250 pixels with a single grain of rice against a black background. This is
-
Deep learning models used? For the classification process we will use a Convolutional Neural Network, specifically the model EVA which is the best performing model as of Jan23 based on the ImageNet validataion set (reference).
Local environment
Run the following:
cd dtu-02476-mlops
make create_environment
conda activate mlops_group8
make requirements
Using the cloud (GCP)
A Google Cloud Platform (GCP) account with credits is necessary for:
- Buckets (storing data and model)
- Container Registry (storing Docker images)
- Trigger (automatically building the Docker images from dockerfiles from the GitHub repository)
- Vertex AI (running the training)
- Cloud Run (to host the inference API)
Training can be done in one of the following three ways:
- Locally (potentially without any cloud-connection)
- Containerized locally using the gcp buckets. The container and entrypoint is the same used in the cloud.
- Cloud training utilizing Vertex AI as a virtual compute engine running the training image/container from the cloudbuild_dockerfiles_train.yaml.
NB: Hyperparams and settings for training is stated in the config file referred to from the default_config.yaml file. For a new training you must create a new exp.yaml
file in the experiment folder and refer to that from the default_config.yaml.
1 - Locally
- Pull data
dvc pull -r remote_storage data.dvc # Pulls latest from gcp bucket
make data # Pulls from kaggle - this does not require gcp connection
- Run training
make train-local
- (if desired) Push data to gcp:
dvc add models
dvc push -r remote_storage_models_train models.dvc
2 - Local container
NB: This uses dvc
pull and push from/to gcp
buckets as well as the config file specified
See docker/train/ folder for entrypoint and dockerfile.
- Build container from dockerfile and run image:
make train-container
- (If you dont want to rebuild the image) Run:
docker compose up trainer
3 - In the cloud (using Vertex AI):
- On
gcp
atrigger
has been set up for the GitHub repository using the cloudbuild_dockerfiles_train.yaml every time the main branch is updated (also experimented with a webhook from the GitHub Workflows). This rebuilds the training image (from this Dockerfile) and thus the current config file is being used in the next step. - Following creates a compute instance and runs the image (pulled from gcp
container registry
). This will pull from thedata bucket
, do training, and push to themodels bucket
after training. See docker/train/ folder to see entrypoint and dockerfile used.
make train-cloud
NB: The region (default: europe-west1
) and name (default: training
) of the training is specified in the Makefile
.
Tested a few machines:
- n1-highmem-2: Approx 15 s per iteration (0.14$/hour)
- n1-highmem-32: Approx 1 s per iteration (2.2$/hour)
- c2-standard-16: Approx 1 s per iteration (1$/hour)
- c2-standard-30: Not tested (3.6$/hour) - NB: Most likely need to increase quota
Quota Increase Requests
The validation and testing is implemented in the training loop.
- Validation evaluates the current model after each epoch during training on a set of unseen data.
- Testing uses the final model and evaluates on a new set of unseen data.
As of now it does not run independently but this could easily be implemented if desried.
The prediction is implemented by utilizing fastapi as back-end and streamlit as frontend.
Prediction can be run in one of the following three ways
- Locally (NB: without fastapi or streamlit)
- Containerized (NB: you need to at least run
api-fastapi
and if frontend is desired also theapi_streamlit
) - Cloud Run which is activated by the trigger and
cloudbuild_dockerfiles_api.yaml
1 - Locally
- Run predict
make predict_test model=<path-to-model-file> path_image=<path-to-image-file>
2 - Local container NB: You can run both fastapi and streamlit.
- Build container from dockerfile and run image:
make <api-fastapi/api_streamlit>
- (If you dont want to rebuild the image) Run:
docker compose up <api_fastapi/api_streamlit>
3 - Using API and Cloud Run
- On
gcp
atrigger
has been set up for the GitHub repository using the cloudbuild_dockerfiles_api.yaml every time the main branch is updated. This rebuilds the api images. - Create a Cloud Run service for each api and use the docker image in gcr.io:
Fastapi:
gcr.io/mlops-group8/api_fastapi:latest
Streamlit:
gcr.io/mlops-group8/api_streamlit:latest
The API URLs will change according to setup. For the project the FastAPI (back-end) URL and Streamlit (front-end) URL were used (given by Cloud Run - Presumably not active anymore).
The following contains miscellaneous information useful for development.
Pull from Google Cloud Bucket (must be logged in to gcp):
dvc pull
Create locally from Kaggle dataset:
-
Have the
Rice_Image_Dataset
folder saved todata/raw
-
Kaggle API:
- Go to www.kaggle.com and log in
- Go to Settings -> API -> Create new token
- Save the json file to your local
home/user/.kaggle
folder
- Run the make_dataset file
make data
To create a smaller dataset for unit tests,
make unittest_data
pytest tests/ # to run all unit tests
pytest tests/test_data.py # to run a specific unit test
To run pytest together with coverage,
coverage run -m pytest tests/
coverage report # to get simple coverage report
coverage report -m # to get missing lines
coverage report -m --omit "/opt/ros/*" # to omit certain files
Enable the pre-commit
pre-commit install
Check the commit with pre-commit
pre-commit run --all-files
After this you can commit as normally. To omit/skip the pre-commit use:
git commit -m "<message>" --no-verify
To see eva
models available (use different model names if needed):
python -c "import timm; print(timm.list_models('*eva*'))"
Choose a model with size 224 (to match the image size in the pipeline)
Profiling is added to the evaluation script to show how it can be used. It can be done with the python profilers and Tensorboard.
Using python profilers
Saving profiling to output file:
mkdir outputs/profiling
python -m cProfile -o outputs/profiling/profiling_output.txt mlops_group8/eval_model.py
Show output from the file:
python mlops_group8/utility/profiling_pstats.py
Using Tensorflow Tensorboard
tensorboard --logdir=./log
-
Be aware that all services needed are enabled on gcp:
- Cloud Build (in setting also enable Cloud Run, Service Accounts and Cloud Build)
- Cloud Run Admin API
- Cloud Storage (remember to make buckets public)
- Vertex AI
- Artifact Registry (remember to make images public)
Logs Explorer is extremely useful for logging and tracing errors on gcp.
The report for the course is found in the reports folder.
The directory structure of the project looks like this (minor folders and files are omitted):
├── Makefile <- makefile with convenience commands like `make data` or `make train`
├── README.md <- the top-level README for developers using this project.
├── data
│ ├── (processed) <- the final data sets for modeling (only available after data pull or command)
│ ├── (raw) <- the original, immutable data dump (only available after data pull)
│ └── test <- test data
│
├── docker <- dockerfiles and utilities (e.g. shell script for entrypoint)
│ ├── api_fastapi/
│ ├── api_streamlit/
│ └── train/
│
├── docker-compose.yaml <- Docker Compose configuration file for setting up project services
│
├── docs <- documentation folder (NOT used)
│ ├── index.md <- homepage for your documentation
│ ├── mkdocs.yml <- configuration file for mkdocs
│ └── source/ <- source directory for documentation files
│
├── models <- trained and serialized models, model predictions, or model summaries
│
├── notebooks <- jupyter notebooks.
│
├── pyproject.toml <- project configuration file
│
├── reports <- generated analysis as HTML, PDF, LaTeX, etc.
│ ├── figures <- generated graphics and figures to be used in reporting
│ ├── README.md <- answers to the report questions
│ └── report.py <- script for checking the markdown file and generating a html from it
│
├── requirements.txt <- the requirements file for reproducing the complete environment
├── requirements_dev.txt <- the requirements file for reproducing the complete environment for developers (exteneded installtions)
├── requirements_predict.txt <- the requirements file for reproducing the prediction environment
├── requirements_tests.txt <- the requirements file for reproducing the test environment (unittests)
├── requirements_train.txt <- the requirements file for reproducing the training environment
│
├── tests <- test files for unittests
│ └── data/ <- data used for the unittests
│
├── mlops_group8 <- source code for use in this project.
│ │
│ ├── __init__.py <- makes folder a Python module
│ │
│ ├── config <- config files with hyperparameters and run settings
│ │ ├── __init__.py
│ │ ├── experiment/ <- individual config.yaml experiment files containing hyperparams etc.
│ │ └── default_config.yaml <- default config file used in training referring to experiment config.yaml file
│ │
│ ├── data <- scripts to download and/or generate data
│ │ ├── __init__.py
│ │ └── make_dataset.py
│ │
│ ├── utility <- scripts used as utility functions in multiple main scripts or minor misc. scripts used for testing functions etc.
│ │ ├── __init__.py
│ │ └── ...
│ │
│ ├── predict_fastapi.py <- script for predicting from a model, hosting back-end API by fastapi
│ ├── predict_model.py <- script for predicting from a model (used for local testing)
│ ├── streamlit_app.py <- script for hosting front-end APY by streamlit
│ ├── sweep_train_model.py <- script for doing hyperparameter sweep on training the model
│ ├── train_model.py <- script for training the model
│ └── validate_model.py <- script for validating the model
│
├── .dvc <- DVC configurations and cache
│ └── config <- DVC configuration file
├── data.dvc <- DVC file for tracking changes and versions in the data directory bucket (gcp)
├── models.dvc <- DVC file for tracking changes and versions in the models directory bucket (gcp)
│
├── .pre-commit-config.yaml <- dvc configurations
├── cloudbuild_dockerfiles_api.yaml <- gcp cloudbuild file for deploying the model by the APIs
├── cloudbuild_dockerfiles_train.yaml <- gcp cloudbuild file for building and pushing the train image
├── config_vertexai_train_cpu.yaml <- gcp config file used for Vertex AI
│
└── LICENSE <- open-source license if one is chosen
This project exists thanks to the following contributors:
Lucas Sandby |
Yu Fan |
Esquivelrs |
Steven |
The MIT License (MIT)
Created using mlops_template, a cookiecutter template for getting started with Machine Learning Operations (MLOps).