/model-training

Model Training Demo

Primary LanguagePythonMIT LicenseMIT

Mushroom App: Serverless Model Training Demo

In this tutorial we will move code from notebooks to a python file and then perform serverless training jobs using Vertex AI:

Prerequisites

  • Have Docker installed
  • Cloned this repository to your local machine with a terminal up and running
  • Check that your Docker is running with the following command

docker run hello-world

Install Docker

Install Docker Desktop

Ensure Docker Memory

  • To make sure we can run multiple container go to Docker>Preferences>Resources and in "Memory" make sure you have selected > 4GB

Install VSCode

Follow the instructions for your operating system.
If you already have a preferred text editor, skip this step.

Setup Environments

In this tutorial we will setup a container to manage packaging python code for training and creating jobs on Vertex AI (AI Platform) to run training tasks.

In order to complete this tutorial you will need your GCP account setup and a WandB account setup.

Clone the github repository

  • Clone or download from here

API's to enable in GCP for Project

Search for each of these in the GCP search bar and click enable to enable these API's

  • Vertex AI API

Setup GPU Quotas

In order to do serverless training we need access to GPUs from GCP.

  • Go to Quotas in your GCP console
  • Filter by Quota: Custom model training and select a GPU type, e.g: Custom model training Nvidia T4 GPUs per region

  • Select a few regions

  • Click on EDIT QUOTAS

  • Put in a new limit and click NEXT
  • Enter your Name and SUBMIT REQUEST

  • This processes usually takes a few hours to get approved
  • Also based on how new your GCP account is, you may not be approved

Setup GCP Credentials

Next step is to enable our container to have access to Storage buckets & Vertex AI(AI Platform) in GCP.

Create a local secrets folder

It is important to note that we do not want any secure information in Git. So we will manage these files outside of the git folder. At the same level as the model-training folder create a folder called secrets

Your folder structure should look like this:

   |-model-training
   |-secrets

Setup GCP Service Account

  • Here are the step to create a service account:
  • To setup a service account you will need to go to GCP Console, search for "Service accounts" from the top search box. or go to: "IAM & Admins" > "Service accounts" from the top-left menu and create a new service account called "model-trainer". For "Service account permissions" select "Storage Admin", "AI Platform Admin", "Vertex AI Administrator".
  • This will create a service account
  • On the right "Actions" column click the vertical ... and select "Manage keys". A prompt for Create private key for "model-trainer" will appear select "JSON" and click create. This will download a Private key json file to your computer. Copy this json file into the secrets folder. Rename the json file to model-trainer.json

Create GCS Bucket

We need a bucket to store the packaged python files that we will use for training.

  • Go to https://console.cloud.google.com/storage/browser
  • Create a bucket mushroom-app-trainer [REPLACE WITH YOUR BUCKET NAME]

Get WandB Account API Key

We want to track our model training runs using WandB. Get the API Key for WandB:

  • Login into WandB
  • Go to to User settings
  • Scroll down to the API keys sections
  • Copy the key
  • Set an environment variable using your terminal: export WANDB_KEY=...

Run Container

Run docker-shell.sh or docker-shell.bat

Based on your OS, run the startup script to make building & running the container easy

This is what your docker-shell file will look like:

export IMAGE_NAME=model-training-cli
export BASE_DIR=$(pwd)
export SECRETS_DIR=$(pwd)/../secrets/
export GCS_BUCKET_URI="gs://mushroom-app-trainer" [REPLACE WITH YOUR BUCKET NAME]
export GCP_PROJECT="mlproject01-207413" [REPLACE WITH YOUR PROJECT]


# Build the image based on the Dockerfile
docker build -t $IMAGE_NAME -f Dockerfile .
# M1/2 chip macs use this line
#docker build -t $IMAGE_NAME --platform=linux/arm64/v8 -f Dockerfile .

# Run Container
docker run --rm --name $IMAGE_NAME -ti \
-v "$BASE_DIR":/app \
-v "$SECRETS_DIR":/secrets \
-e GOOGLE_APPLICATION_CREDENTIALS=/secrets/model-trainer.json \
-e GCP_PROJECT=$GCP_PROJECT \
-e GCS_BUCKET_URI=$GCS_BUCKET_URI \
-e WANDB_KEY=$WANDB_KEY \ [MAKE SURE YOU HAVE THIS VARIABLE SET]
$IMAGE_NAME
  • Make sure you are inside the model-training folder and open a terminal at this location
  • Run sh docker-shell.sh or docker-shell.bat for windows
  • The docker-shell file assumes you have the WANDB_KEY as an environment variable and is passed into the container

Package & Upload Python Code

Review Trainer Code

  • Open & Review model-training > package > setup.py
  • All required third party libraries needed for training are specified in setup.py
  • Open & Review model-training > package > trainer > task.py
  • All training code for the mushroom app models are present in task.py

Run sh package-trainer.sh

  • This script will create a trainer.tar.gz file with all the training code bundled inside it
  • Then this script will upload this packaged file to your GCS bucket can call it mushroom-app-trainer.tar.gz

Create Jobs in Vertex AI

  • Open & Review model-training > cli.sh
  • cli.sh is a script file to make calling gcloud ai custom-jobs create easier by maintaining all the parameters in the script
  • Make any required changes to your cli.sh file:
  • List of ACCELERATOR_TYPE are:
    • NVIDIA_TESLA_T4
    • NVIDIA_TESLA_K80
    • NVIDIA_TESLA_P100
    • NVIDIA_TESLA_P4
    • NVIDIA_TESLA_A100
    • NVIDIA_TESLA_V100
  • List of some GCP_REGION are:
    • us-central1
    • us-east1
    • us-east4
    • us-south1
    • us-west1
    • ...
  • Run sh cli.sh

OPTIONAL:Create Jobs in Vertex AI using CPU

  • Edit your cli.sh to not pass the accelerator-type and accelerator-count
  • Run sh cli.sh

View Jobs in Vertex AI

  • Go to Vertex AI Custom Jobs
  • You will see the newly created job ready to be provisioned to run.

View Training Metrics

  • Go to WandB
  • Select the project mushroom-training-vertex-ai
  • You will view the training metrics tracked and automatically updated

OPTIONAL: Multi GPU Training

  • Open & Review model-training > cli-multi-gpu.sh
  • Open & Review model-training > package > trainer > task_multi_gpu.py
  • cli-multi-gpu.sh is a script file to make calling gcloud ai custom-jobs create easier by maintaining all the parameters in the script
  • Make any required changes to your cli-multi-gpu.sh
  • Run sh cli-multi-gpu.sh