This toolbox is designed to facilitate the synchronization between a Kubernetes (Kube) cluster and local GPU machines, aiming to streamline the process of running and managing batch experiments. The main advantages of using this toolbox include:
- Synchronization of environment variables (found in
.env
files) between Nautilus cluster executions and local runs, allowing for easy access to credentials viaos.environ
. - Simplification of the process to load environment variables, Python environments, and startup scripts.
- Data and output folder synchronization through S3 storage.
- Central management of all potential hyperparameters for various datasets and models.
- Compatibility with a wide range of projects, even those already utilizing configuration files.
- Capability to execute all combinations of experiments in parallel with a single command.
Python packages:
- boto3 (required)
- pyyaml (required)
- loguru (optional, for better logging)
Other requirements:
- kubectl (required, for interacting with Kubernetes cluster)
- s5cmd (optional, needed to be in PATH, for faster S3 operations)
- Use hyphen - instead of underscore _ for separators in your project name
- The setup would be a lot easier if you sync up your github, gitlab repo name, conda environment name, and carefully follow all naming conventions. While you can customize if some of them do not match, I cannot test over all scenarios.
Imagine a scenario where you're handling "machine learning" workloads with two datasets you wish to use both in the cluster and locally, and you have two baseline models to evaluate.
Start by creating a new git repository with src
and data
directories:
mkdir example; cd example; mkdir src; mkdir data; git init
Add the toolbox repository as a submodule:
git submodule add https://github.com/Rose-STL-Lab/Zihao-s-Toolbox.git src/toolbox
Create symbolic links for Makefile and launch.py at the root of your workspace:
ln -s src/toolbox/launch.py .
ln -s src/toolbox/Makefile .
Generate two baseline files and two datasets as follows:
echo "1,2,3" > data/1.txt
echo "4,5,6" > data/2.txt
echo "import sys; print(sum(map(float, open(sys.argv[1]).read().split(','))) / 3)" > src/avg.py # Compute average
echo "import sys, statistics; print(statistics.median(map(float, open(sys.argv[1]).read().strip().split(','))))" > src/med.py # Compute median
Create a .env file with your S3 bucket details:
vim .env
S3_BUCKET_NAME=example
S3_ENDPOINT_URL=https://s3-west.nrp-nautilus.io
Setup your Kubernetes configuration in config/kube.yaml
(replace <NAMESPACE>
and <USER>
with your details):
mkdir -p config; vim config/kube.yaml
project_name: example
namespace: <NAMESPACE>
user: <USER>
image: gitlab-registry.nrp-nautilus.io/zihaozhou/example
Note: Here we created the image for you. If you want to create your own image, please refer to Section 4. After creation of your own image, you no longer need to specify the
image
field inkube.yaml
. (unless you are collaborating with others and the repo is created under their accounts)
Define the experiment configurations in config/launch.yaml
:
vim config/launch.yaml
project_name: example
model:
average:
# <fn> will be automatically replaced by hparam values
command: >-
[](make download file=data/;) python src/avg.py <fn>
# # Model can also have hparam, hparam could be either list or single value
# hparam:
# hid_dim: [256, 512]
# ...
# # Override *non-projectwise* kube config, see Section 2
gpu_count: 0
# memory: 8 # 8GiB
# image: gitlab-registry.nrp-nautilus.io/prp/jupyter-stack/minimal
# ...
median:
command: >-
[](make download file=data/;) python src/med.py <fn>
gpu_count: 0
dataset:
data1:
hparam:
# Launch will automatically consider all combinations of hparam
# hparam preceded by _ will NOT appear in the job name
# If your hparam contain special characters, you must _ them
_fn: data/1.txt
data2:
hparam:
_fn: data/2.txt
run:
dataset: [data1]
model: [average]
[](make download file=data/;)
is a syntax sugar. Sometimes, we expect slight difference between local and remote commands (like we don't need to re-download data in local runs). Here we use[]
to indicate the local command (which is doing nothing here), and()
to indicate the remote command.- Another useful syntax sugar is
##comment##
, as yaml does not support comments in multi-line strings. You can use a pair of##
to indicate comments, and they will be removed before execution.- In
python src/med.py <fn>
,<fn>
will be replaced by the values defined inhparam
section. Hypermeters can be defined in bothmodel
anddataset
sections. They are placeholders such that you don't need to copy and paste the same command with slight modifications.gpu_count
are model-wise / dataset-wise kubernetes configurations that can be overridden inlaunch.yaml
. See Section 2 for all overridable fields. If you don't specify them, they will be inherited fromkube.yaml
. If you specify the same field in bothmodel
anddataset
, the one inmodel
will take precedence.run
section specifies the combinations of experiments to run. You can also addhparam
to therun
section to specify the hyperparameters you want to run. If you don't specify thehparam
, all possible combinations of hyperparameters will be run.
- You can specify an additional file section. (example:
file: [src/temp.py]
). Then, when you runmake pod
ormake job
, the specified files will be automatically uploaded (and overwrites the preexisting file) to the pod or job. This is particularly useful when you are debugging and don't want to make git commit. By default,config/kube.yaml
,config/launch.yaml
, and.env
will be uploaded. You can specifyfile: null
to disable this behavior. You can also runmake copy pod=<pod_name>
to upload files to a running pod.
- This only supports a limited number of text files and will fill the command section with encoding text. The advantage is that you don't need to worry about file uploads for every job or pod creation. If your file section is too long, the pod could fail due to command length limit.
- The hparam sections can be a list of hparam dictionaries with the same keys. See below for an example. Why do we need this? Sometimes we don't want to run all combinations of hyperparameters, but only a subset of them. In this case,
make
will create three jobs,train=paper
,train=original
, andtrain=scale
.hparam: train: paper: _learning_rate: 0.000000008192 _lr_scheduler: linear _lr_warmup_steps: 0 original: _learning_rate: 1e-5 _lr_scheduler: constant _lr_warmup_steps: 500 scale: _learning_rate: 1e-5 _lr_scheduler: constant _lr_warmup_steps: 500
- Overridable kube fields can also be directly specified at the root level of
launch.yaml
. Examples aremodel: ... dataset: ... run: ... gpu_count: 1
They will override the corresponding fields in
kube.yaml
.
Execute the following command to run experiments locally: make local
Example output:
Running {
"_fn": "data/1.txt"
} ...
python src/avg.py data/1.txt
2.0
Change the run section in config/launch.yaml
to
run:
dataset: [data1, data2]
model: [average, median]
and execute make local
to run all possible combinations of experiments sequentially.
Running {
"_fn": "data/1.txt"
} ...
python src/avg.py data/1.txt
2.0
Running {
"_fn": "data/1.txt"
} ...
python src/med.py data/1.txt
2.0
Running {
"_fn": "data/2.txt"
} ...
python src/avg.py data/2.txt
5.0
Running {
"_fn": "data/2.txt"
} ...
python src/med.py data/2.txt
5.0
If you have an S3 bucket, update the credentials in .env
and use the following command to upload your dataset: make upload file=data/
Example output:
Uploaded data/1.txt to data/1.txt
Uploaded data/2.txt to data/2.txt
If you have installed s5cmd, make
will automatically use it as the backend for S3 operations to improve performance. Otherwise, it will use boto3.
For users without an S3 bucket, request access from the Nautilus matrix chat or use buckets provided by rosedata.ucsd.edu
. You don't need a bucket for this tutorial.
To create a remote pod, run: make pod
Example output:
pod/<YOUR-USERNAME>-example-interactive-pod created
You can now navigate to the shell of the pod by running: kubectl exec -it <YOUR-USERNAME>-example-interactive-pod -- /bin/bash
. Now, run make download file=data/; make local
. You should see exactly the same output as running locally.
To run all possible combinations of experiments in parallel with Nautilus, run: make job
Example output:
Job 'example-average-data1' not found. Creating the job.
job.batch/example-average-data1 created
Job 'example-median-data1' not found. Creating the job.
job.batch/example-median-data1 created
Job 'example-average-data2' not found. Creating the job.
job.batch/example-average-data2 created
Job 'example-median-data2' not found. Creating the job.
job.batch/example-median-data2 created
After a while, you can run kubectl logs -f <YOUR-USERNAME>-example-average-data1
to check the logs of the job. You would see
Downloaded data/1.txt to ./data/1.txt
Downloaded data/2.txt to ./data/2.txt
2.0
Finally, run make delete
to cleanup all workloads.
Be careful:
make delete
operates by removing all pods and jobs under your user label.
config/kube.yaml
:
##### Project-wise configuration, should be the same across all experiments
# Will not be overwritten by the launch.yaml
project_name: str, required, used for k8s resource name, env name and more, no underscore_, hyphen- allowed
user: str, required, k8s user name
namespace: str, required, k8s namespace
# If you want to use a different environment name
conda_env_name: str, default to <project_name>
##### Other field, can be overwritten in launch.yaml #####
# env will be overridden by `.env`, therefore never effective in `kube.yaml`
# however, specify env in `launch.yaml` can add new env variables
env:
<env-key>: <env-value>
## If startup_script is not explicitly specified, the script will automatically pull the latest git repo using ssh_host and ssh_port, and activate the default environment using conda_home and conda_env_name.
startup_script: str, default to pull the latest git repo and submodules, activate the default conda environment, switch external S3 to internal S3 endpoint
extra_startup_script: str, default to empty, if you want to run the default script and add a few lines of additional commands
conda_home: str, default to /opt/conda
ssh_host: str, default to gitlab-ssh.nrp-nautilus.io
ssh_port: int, default to 30622
# Command for interactive pod
server_command: str, default to `sleep infinity`
## For CPU and Memory, the limit will be twice the requested
gpu_count: int, default to 0
cpu_count: int, in cores, default to 5
ephemeral_storage: int, in gigabytes, default to 100
memory: int, in gigabytes, default to 32
## Mount PVC to path
volumes:
<pvc-name>: <mount-path>
## Image pull related
image: str, default to <registry_host>/<gitlab_user>/<project_name>:latest
gitlab_user: str, default to <user>
registry_host: str, default to gitlab-registry.nrp-nautilus.io
image_pull_secrets: str, default to <project-name>-read-registry
## Prefix of the names of your workloads
prefix: str
## Will tolerate no-schedule taints
tolerations:
- <toleration-key>
gpu_whitelist:
- <usable-gpu-list>
hostname_blacklist:
- <unusable-node-hostnames-list>
## High-performance GPU specified in https://ucsd-prp.gitlab.io/userdocs/running/gpu-pods/#choosing-gpu-type. Example: "a100", "rtxa6000". Once set, gpu_whitelist and gpu_blacklist will be ignored.
special_gpu: str
gpu_whitelist
and gpu_blacklist
cannot be both set. If gpu_whitelist is set, only the specified GPUs will be used. If gpu_blacklist is set, all GPUs except the specified ones will be used. The same applies to hostname_blacklist
and hostname_whitelist
.
Example GPU list:
- NVIDIA-TITAN-RTX
- NVIDIA-RTX-A4000
- NVIDIA-RTX-A5000
- Quadro-RTX-6000
- NVIDIA-GeForce-RTX-3090
- NVIDIA-GeForce-GTX-1080-Ti
- NVIDIA-GeForce-GTX-2080-Ti
- NVIDIA-A10
Ensure consistency in project_name across your GitLab repository (<project_name>.git), conda environment (envs/<project_name>), image pull secret (<project_name>-read-registry), and S3 configuration (<project_name>-s3cfg). Avoid underscores in project_name (use hyphen instead).
Your GitLab username would be used as user to label your kube workloads (label: ). For registry details, refer to the GitLab container registry documentation.
Create a .env file in your project repository with these values:
S3_BUCKET_NAME=<your_s3_bucket_name>
AWS_ACCESS_KEY_ID=<your_access_key>
AWS_SECRET_ACCESS_KEY=<your_secret_key>
S3_ENDPOINT_URL=https://...
Load environment by export $(grep -v '^#' .env | xargs -d '\n')
or through make
commands.
You can perform wildcard searches, downloads, uploads, or deletions on S3 files:
❯ make interactive file='Model/*t5*wise*419*'
Local files matching pattern:
S3 files matching pattern:
Model/Yelp/t5_model_12weeks_wise-sky-249-northern-dawn-288_epoch_1419/config.json
...
Choose an action [delete (local), remove (S3), download, upload, exit]:
Use single quotes to prevent shell wildcard expansion. The S3 bucket will sync with your current directory by default, maintaining the original file structure and creating necessary directories.
Beyond Make commands, you can also directly import the functions from src/toolbox/s3utils.py
to your Python scripts.
from toolbox.s3utils import download_s3_path
folder_path = f"Data/{dataset}"
download_s3_path(folder_path)
If your Python script is invoked via make
, the environment variables will be automatically loaded.
4. Example Creation of Nautilus Gitlab Image
This section will guide you through the process of creating a GitLab Docker image based on your git repo using the Nautilus platform. This is useful for those looking to automate their deployment and integration workflows using GitLab's CI/CD features. The result image can integrate nicely with Kubeutils.
Note: If Nautilus SSH is no longer
gitlab-ssh.nrp-nautilus.io:30622
, please modifiesSSH_CONFIG
and .gitlab-ci.yml
correspondingly.
Before you begin, make sure you have:
- A Github account, where you can register at here.
- A Nautilus Gitlab account, where you can register at here.
- Familiarity with SSH, docker container, continuous integration and deployment (CI/CD) concepts
- Create a git repo at Nautilus Gitlab with the name
example
. Don't initialize the repository with any file. If you want to use a different name, remember to replaceexample
with your repo name in the following steps. Also, make sure your name is all lowercase and without any special characters. - Create a git repo with the same name at Github.
- Generate an SSH key pair with the name
example
on your local machine using the following command:ssh-keygen -f example -N ""
. - Generate an SSH key pair with the name
example-write
on your local machine using the following command:ssh-keygen -f example-write -N ""
.
Note: The
example
key is kept in the image for pulling the code from the private repository, while theexample-write
key is used for mirroring the code to Gitlab. Be careful — if you accidentially dropped theexample-write
key in the image and later make it public, anyone can push code to your repository.
- Add the public key
example.pub
to Gitlab Repo - Settings - Deploy Keys. Title:example
, don't grant write permission. - Add the public key
example-write.pub
to Gitlab Repo - Settings - Deploy Keys. Title:example-write
, grant write permission. - Add the public key
example.pub
to Github Repo - Settings - Deploy Keys. Title:example
, don't grant write permission. - Deploy tokens are used to securely download (pull) Docker images from your GitLab registry without requiring sign-in. Under Gitlab Repo - Settings - Repository - Deploy Tokens, create new deploy token with name
example-write-registry
. Grant bothwrite_registry
andread_registry
access. Take a note of theusername
andpassword
for this token for Gitlab CI. - Create new deploy token with name
example-read-registry
. Grantread_registry
access. Take a note of theusername
andpassword
for this token for Kubernetes experiments.
Note: The
example-write-registry
token is used for pushing the image to the registry from Github, while theexample-read-registry
token is used in the kube cluster to pull the image.
- Run the follow command to upload the read tokens to the cluster.
kubectl create secret docker-registry example-read-registry \
--docker-server=gitlab-registry.nrp-nautilus.io \
--docker-username=<username> \
--docker-password=<password>
- In Github Repo - Settings - Secrets and variables - Actions, enter the following repository secrets:
- SSH_CONFIG:
SG9zdCBnaXRodWIuY29tCiAgSG9zdE5hbWUgZ2l0aHViLmNvbQogIFVzZXIgZ2l0CiAgSWRlbnRpdHlGaWxlIH4vLnNzaC9pZF9yc2EKCkhvc3QgZ2l0bGFiLXNzaC5ucnAtbmF1dGlsdXMuaW8KICBIb3N0TmFtZSBnaXRsYWItc3NoLm5ycC1uYXV0aWx1cy5pbwogIFVzZXIgZ2l0CiAgUG9ydCAzMDYyMgogIElkZW50aXR5RmlsZSB+Ly5zc2gvaWRfcnNhCgo=
, which is the base64 encoding of
Host github.com
HostName github.com
User git
IdentityFile ~/.ssh/id_rsa
Host gitlab-ssh.nrp-nautilus.io
HostName gitlab-ssh.nrp-nautilus.io
User git
Port 30622
IdentityFile ~/.ssh/id_rsa
- DOCKER_PASSWORD: the write
password
from the previous step. - DOCKER_USERNAME: the write
username
from the previous step. - GIT_DEPLOY_KEY: base64 encode the read deploy key you created (
base64 -i example
, don't include any new lines). - GITLAB_DEPLOY_KEY: base64 encode the write deploy key you created (
base64 -i example-write
, don't include any new lines). - GITLAB_USERNAME: your gitlab user name, which is in the middle of your gitlab repo URL.
- Create the following files under your repo:
environment.yml
name: example
channels:
- conda-forge
- nvidia
dependencies:
- python=3.11.*
- pip
- poetry=1.*
Dockerfile
- Can be copied using
cp src/toolbox/Dockerfile .
- Can be copied using
mkdir -p .github/workflows; cp src/toolbox/workflows/docker.yml .github/workflows/docker.yml
- Can be copied using
mkdir -p .github/workflows; cp src/toolbox/workflows/mirror.yml .github/workflows/mirror.yml
- Can be copied using
You shall verify the environment creation on your local machine:
conda env create -n example --file environment.yml
conda activate example
poetry init
## Add dependencies interactively or through poetry add
## Examples:
poetry source add --priority=explicit pytorch-gpu-src https://download.pytorch.org/whl/<cuda_version>
poetry add --source pytorch-gpu-src torch
poetry add numpy==1.26.2
...
## Run code on your local machine to make sure all required dependencies are installed.
This procedure creates the lock file, poetry.lock
. Commit it to the git repository. Push to the Github will compile the image. Any modification of the environment related files (see workflow file) will trigger the image update.
You may check out https://github.com/ZihaoZhou/example
and https://gitlab.nrp-nautilus.io/ZihaoZhou/example
as a reference.
If you:
- Don't need CI
- Don't have full control over the repository
- Use different branch than
main
ormaster
- Are not allowed to use Github, or
- Just want to build the image manually,
You can follow the following steps.
- Create a git repo at Nautilus Gitlab ... (same as above)
- Create a git repo at Github or any other git hosting service ...
- Generate a read SSH key pair ...
- Generate a write SSH key pair ...
- Add the public key to Gitlab ...
- Create folder
.ssh/
in your working directory, copy your SSH read private key to.ssh/
and rename it toid_rsa
. Create the.ssh/config
file with the following content:
Host gitlab-ssh.nrp-nautilus.io
HostName gitlab-ssh.nrp-nautilus.io
User git
Port 30622
IdentityFile ~/.ssh/id_rsa
Host github.com
HostName github.com
User git
IdentityFile ~/.ssh/id_rsa
...
(or other git hosting service)
Warning: Add
/.ssh*
to your.gitignore
to avoid uploading your secret credentials to the repository.
- Run
ssh-keyscan -p 30622 gitlab-ssh.nrp-nautilus.io >> .ssh/known_hosts
,ssh-keyscan github.com >> .ssh/known_hosts
(or other git hosting service) to add the host key to known_hosts. - Copy the Dockerfile to your local working directory, and then change PROJECT_NAME to your project name, change PROJECT_SSH_URL to your hosting service URL. You may switch to different branch by adding
--branch <branch-name>
after git clone. - Run the following command to build the image:
docker build -t gitlab-registry.nrp-nautilus.io/<user-name>/<project-name>:<custom-tag> .
docker login gitlab-registry.nrp-nautilus.io
# Enter your write-registry username and password
docker push gitlab-registry.nrp-nautilus.io/<user-name>/<project-name>:<custom-tag>
docker tag gitlab-registry.nrp-nautilus.io/<user-name>/<project-name>:<custom-tag> gitlab-registry.nrp-nautilus.io/<user-name>/<project-name>:latest
docker push gitlab-registry.nrp-nautilus.io/<user-name>/<project-name>:latest
- If you encountered any error, you may comment out error lines in the Dockerfile and then run
docker run -it /bin/bash gitlab-registry.nrp-nautilus.io/<user-name>/<project-name>:<custom-tag> /bin/bash
to enter the image and test the following command manually.
- After the image is successfully pushed, you can free space and delete all built images by running
docker images | grep '<project-name>' | awk '{print $3}' | xargs docker rmi
anddocker rmi $(docker images -f "dangling=true" -q)
.