Machine Learning (ML) applications can change in three axes (data, code and model) and we need to implement a mechanism to track the changes during the ML experimentation process. This repository is for active development of a demo that can demosntrate how to use Amazon Sagemaker for the purpose of ML application development and use Amazon SageMaker Experiment and Data Version Control (DVC) to track both ML model and dataset.
For the purpose of this demo, we use:
- Git to track changes in code,
- DVC to track changes in datasets and model stored in Amazon S3
- Amazon SageMaker Experiment to capture the experiment observations
- AWS Account
- Git
- DVC
Note: This initial setup needs to be done just for the first time
Create an Empty Repository (e.g. Github) for source code (e.g. sagemaker-dvc-demo)
Setup a Notebook instance in Amazon Sagemaker
Note: make sure to attach S3 Full acess policy to sagemaker execution role
Create an S3 Bucket to store data and DVC artifacts
Note: Make sure to create this bucket in the same region as of Amazon Sagemaker instance
Create the following folder structure in S3 bucket:
sagemaker
|----train
|----validation
|----test
|----output
Setup git...
-
From Notbook instance, open Jupyter lab and from Git menu bar, choose “open git repo in terminal”
-
Use the following commands to setup git in SageMaker Notebook
cd SageMaker mkdir sagemaker-dvc-demo cd sagemaker-dvc-demo git config --global user.name "USERNAME" #Replace USERNAME with your source control username git config --global user.email "EMAIL" #Replace with your source control accouunt email address git init git remote add origin <remote> #Replace REMOTE with your git repo. url
-
(Optional) To avoid entering password with every git command, we recommand you to generate a ssh key, add it to your git account and update the remote origin url
ssh-keygen -t rsa -b 4096 -C "EMAIL" ##Replace with your source control account email address cat ~/.ssh/id_rsa.pub # Copy the content and add it as ssh key to git account git remote set-url origin git+ssh://git@github.com/USERNAME/REPONAME.git # Replace USERNAME and REPONAME with yours
-
Pull the repo
git pull origin master
Install DVC & commit changes to Git
pip install dvc # dvc version 1.6.4 is used for the purpose of this demo
dvc init
git commit -m "initial commit"
Note: DVC automatically add the changes to git commit
Setup the remote for DVC (we use S3 as the remote for DVC) and commit the changes into git
dvc remote add -d storage s3://BUCKETNAME/ # use the bucket name that was created in step 3
git commit .dvc/config -m "initialize DVC local remote"
dvc remote add s3cache s3://BUCKETNAME/cache # use the bucket name that was created in step 3
dvc config cache.s3 s3cache
dvc add --external s3://BUCKETNAME/sagemaker/train
dvc add --external s3://BUCKETNAME/sagemaker/validation
dvc add --external s3://BUCKETNAME/sagemaker/test
dvc add --external s3://BUCKETNAME/sagemaker/output
git add .dvc/config train.dvc validation.dvc test.dvc output.dvc
git commit -m "add source data to DVC"
git push --set-upstream origin master
Go to sagemaker-dvc-demo folder, create a new notebook (use conda_python3 kernel) and follow the main.ipynb file in this repo. to integrate Amazon SageMaker experiment with DVC and track the changes.