This repository stores the RUNAI images for DLABs compute resources. The usage is trivial, simply pass the image uri as parameter to RUNAI. If your setting this up for the first time, check out First Time Instructions.
There are currently two images available:
- base:
ghcr.io/jkminder/dlab-runai-images/base:master
- Logs you in with your GASPAR UID/GUID and sets the correct permissions
- Installs basic packages (conda, htop, vim, ssh, etc.).
- Should
dlabscratch
be mapped, it sets your $HOME to/dlabscratch1/{GASPAR_USERNAME}
. - Has CUDA 12.2.2 installed.
- Has Powershell GO installed, a shell wrapper that makes life a bit easier.
- Automatically generates a
.bashrc
file in your $HOME if you don't have one.
- pytorch:
ghcr.io/jkminder/dlab-runai-images/pytorch:master
- Creates
default
conda environment with pytorch and other default ML python libraries installed. Seepytorch/environment.yml
andpytorch/requirements.txt
for an exhaustive list.
- Creates
Use the runai submit {JOB_NAME} -i {IMAGE} -- {COMMAND}
command. To map the scratch partition add the flag --pvc runai-dlab-{GASPAR_USERNAME}-scratch:/mnt
. If you plan on iteractively using the container add the --interactive
flag. This will give you priority in the queue, but be sure to only add it if you need interactive jobs. With -g {num}
you can select the number of GPUS, with --cpu {num}
the number of CPUs. The flag --memory 10G
will allocate you at least 10G of RAM. Should you run into shared memory issues, add the flag --large-shm
(sometimes required for massively parallel dataloaders). With --node-type G10
you select the node type.
A few examples:
Submit an interactive job which runs for 1 hour with the name test
with 1 GPU.
runai submit -i ghcr.io/jkminder/dlab-runai-images/pytorch:master --pvc runai-dlab-{GASPAR_USERNAME}-scratch:/mnt --interactive -g 1.0 test -- sleep 3600
Submit a training job with the name train
with 0.5 GPU.
runai submit -i ghcr.io/jkminder/dlab-runai-images/pytorch:master --pvc runai-dlab-{GASPAR_USERNAME}-scratch:/mnt -g 0.5 train -- python ~/trainer/train.py --my-training-arg 2
Submit an interactive job which runs for 2 hour with the name test
with 0.5 GPU and at least 12 CPUs
runai submit -i ghcr.io/jkminder/dlab-runai-images/pytorch:master --pvc runai-dlab-{GASPAR_USERNAME}-scratch:/mnt -g 0.5 --cpu 12 test -- sleep 3600
Submit a job with a specific node type Node types
- ICC: [S8|G9|G10] "S8" (CPU only), "G9" (Nvidia V100) or "G10" (Nvidia A100)
- RCP: there is only one node type
runai submit -i ghcr.io/jkminder/dlab-runai-images/pytorch:master --pvc runai-dlab-{GASPAR_USERNAME}-scratch:/mnt --interactive -g 1.0 --node-type G10 test -- sleep 3600
I strongly recommend creating some aliases/shell scripts to make your life easier, e.g. alias rs="runai submit -i ghcr.io/jkminder/dlab-runai-images/pytorch:master --pvc runai-dlab-{GASPAR_USERNAME}-scratch:/mnt"
. See RUNAI ALIASES. Should your shell not support aliases, use the submit.sh
script (replace the ENVS in the file first).
For a detailed instruction manual on the runai submit
command, see here.
- Don't add the
--command
flag to runai submit. This will overwrite the script that sets up your GASPAR user. - You can't login to a root bash session (with
su -
). You have password lesssudo
rights on your GASPAR user, use this. - If you already have a
.bashrc
file in/dlabscratch1/{GASPAR_USERNAME}
, please copy the contents ofbase/.bashrc
to your file. This is necessary because the script does not create it if one already exists.
The following steps have to be done once.
- Start a container or use ssh to connect directly to the IC cluster and create the file
.ssh/authorized_keys
in your folder on scratch (/dlabscratch1/{GASPAR_USERNAME}
). Paste your public ssh key (see here for a tutorial on creating ssh keys). - On your local computer append the following lines to your
~/.ssh/config
file:Should it not exist, create the file. This will allow you to easily connect VS Code or via SSH to your runai containers without any annoying warnings. Be sure to replace the placeholdersHost runai HostName localhost User {GASPAR_USERNAME} ForwardAgent yes IdentityFile {PATH_TO_YOUR_PUBLIC_KEY} StrictHostKeyChecking no UserKnownHostsFile=/dev/null Port 2222
{...}
with the appropriate values. After portforwarding from your container (install the RUNAI ALIASES and runrpf container-name
), you can connect to it withssh runai
or use the SSH extension of VS Code to directly develop in the container. You can also use this to run jupyter notebooks on the container via VS Code. - Check out RUNAI ALIASES
- If you want to customize the images, look at Customization
I added a few aliases that makes life a bit easier. Source them in your .bashrc
(or whatever shell your using) by adding the line source {pathtothisrepo}/.runai_aliases
to it.
Available Aliases:
-
rl
: Short forrunai list
-
rb
: Short forrunai bash
Usage:
rb container-name
-
rdj
: Short forrunai delete job
Usage:
rdj container-name
-
rpf
: Portforward to your container. Especially useful for VS Code usage.Usage:
rpf container-name
-
rs
: Short forrunai submit -i {image} --pvc runai-dlab-{GASPAR_NAME}-scratch:/mnt
. Make sure you adapt this to your needs by replacing the image and the ´{GASPAR_USERNAME}´ in the.runai_aliases
file.Usage:
rs --interactive -g 1.0 eval -- sleep 3600
-
ric
: Switches the context to the IC cluster. Short forrunai config cluster ic-caas
-
rrcp
: Switches the context to the RCP cluster. Short forrunai config cluster ic-caas
You can easily customize these images to your desire. No need to manually build docker images because GitHub will do that for you.
-
Fork this repository.
-
Either create a new folder with a
Dockerfile
that usesghcr.io/jkminder/dlab-runai-images/base:master
as base image or modify the existing ones. If you just want to install other packages, you can simply modify thepytorch/environment.yml
(for conda install) andpytorch/requirements.txt
(for pip install) files.- Should you create/modify your own
Dockerfile
, make sure that you don't overwrite theENTRYPOINT
. If you need to overwriteENTRYPOINT
make sure that it ends with:..., "/tmp/user-entrypoint.sh"] CMD ["/bin/bash"]
- Should you create/modify your own
-
(Optional) If you have created a new folder with a new Dockerfile, you need to also create a new Github Action that builds and uploads the image. For that duplicate the
.github/workflows/docker-base.yml
file, rename it todocker-{yourimagename}.yml
and search-replacebase
with{yourimagename}
. This will automatically build and publish the image underghcr.io/{github_shortname}/dlab-runai-images/{yourimagename}:master
. -
Push your repository. Github Actions will automatically build your image. This may take a minute, check the progress under the
Actions
tab. Once it's done