- ⭐ -> https://github.com/iterative/dvc
- ⭐ -> https://github.com/iterative/dvclive
- ⭐ -> https://github.com/iterative/cml
- ⭐ -> https://github.com/huggingface/transformers
Completed solution: https://github.com/iterative/workshop-uncool-mlops-solution
We have a DVC Pipeline defined in dvc.yaml file.
The pipeline is composed of stages using Python scripts, defined in src:
flowchart TD
node1[compute-data-metrics]
node2[eval]
node3[get-data]
node4[split-data]
node5[train]
node3-->node1
node3-->node4
node4-->node2
node4-->node5
node5-->node2
We use DVC Params, defined in params.yaml, to configure the pipeline.
The pipeline can be reproduced locally:
Local Reproducibility
git clone git@git@github.com:iterative/workshop-uncool-mlops.git
cd workshop-uncool-mlops
pip install -r requirements.txt
dvc repro
This generates DVC Metrics and DVC Plots to evaluate model performance, which can be found in outs
These files are small enough to be tracked by git, so after we run the pipeline we can share the results with others:
git add `dvc.lock` outs
git push
- Go to https://studio.iterative.ai (It's free)
- Connect your GitHub account.
- Add a new view.
https://studio.iterative.ai/user/daavoo/views/workshop-uncool-mlops-solution-ix8fxl0eob
More info:
You should be able to follow all the steps bellow without leaving the browser.
Navigate to your for fork and press .
or change the URL from "github.com" to "github.dev"
DVC remotes provide a location to store arbitrarily large files and directories.
First, you need to create a new folder on our Google Drive, navigate to the folder and copy the last part of the URL.
You can now add a DVC remote to our project:
From web
Update .dvc/config
:
https://github.com/iterative/workshop-uncool-mlops-solution/blob/main/.dvc/config
From CLI
dvc remote add --default gdrive://{YOUR_URL}
More info:
https://dvc.org/doc/command-reference/remote/add#description
Other remote?:
https://dvc.org/doc/command-reference/remote/add#supported-storage-types
The results of the pipeline can now be shared with others by using dvc push and dvc pull.
You will be prompted for Google Drive credentials the first time you run dvc push/pull
.
Shared Reproducibility
# Researcher A
# Updates hparam
dvc repro
git add . git commit -m "Updated hparam"
git push && dvc push
# Researcher B
git pull && dvc pull
# Receives all changes
You need to grant GitHub access to the DVC Remote:
From web
-
Get the credentials: https://colab.research.google.com/drive/1Xe96hFDCrzL-Vt4Zj-cVHOxUgu-fyuBW
-
Create a new GitHub Secret:
secrets.GDRIVE_CREDENTIALS_DATA
From CLI
- Get the credentials:
cat ".dvc/tmp/gdrive-user-credentials.json"
- Create a new GitHub Secret:
secrets.GDRIVE_CREDENTIALS_DATA
Then, you can create a workflow that runs when a Pull Request is created:
Create and fill `.github/workflows/on_pr.yml`
https://github.com/iterative/workshop-uncool-mlops-solution/blob/main/.github/workflows/on_pr.ymlAnd now you can reproduce the pipeline from the web:
From GitHub
-
Edit
params.yaml
from the GitHub Interface. -
Change
train.epochs
. -
Select
Create a new branch for this commit and start a pull request
From Studio
-
https://studio.iterative.ai/user/daavoo/views/workshop-uncool-mlops-5fgmd70rkt
-
Click on
Run new experiment
button
More compute?:
https://cml.dev/doc/self-hosted-runners
--
You can also create a workflow that runs on a daily schedule:
Create and fill `.github/workflows/daily.yml`
https://github.com/iterative/workshop-uncool-mlops-solution/blob/main/.github/workflows/daily.ymlFor deployment you can create a workflow that builds and deploys a docker images.
Create and fill `Dockerfile`
https://github.com/iterative/workshop-uncool-mlops-solution/blob/main/DockerfileCreate and fill `.github/workflows/deploy_model.yml`
https://github.com/iterative/workshop-uncool-mlops-solution/blob/main/.github/workflows/deploy_model.ymlYou can use the published image from anywhere:
Create and fill `.github/workflows/issue_labeler.yml`
See predictions on a new created issue:
Or use from anywhere:
docker run "ghcr.io/iterative/workshop-uncool-mlops-solution:main" "dvc pull fails when using my S3 remote"