A set of scripts to help prepare, submit, run and monitor cluster jobs.
All scripts have options, so use the -h|--help
argument for more info. If not, just run as-is for the defaults.
Scripts to:
- Create a Docker image (based on the nvidia pytorch container):
create_docker_im.sh
. - Submit a job:
rubmit
. - The script that is executed when the job runs:
runai_startup.sh
.
Example:
- Create the docker image:
create_docker_im.sh --docker_push
. - Submit with
rubmit
. - Modify
runai_startup.sh
as necessary, this will be run once the runai job is created.
Typical submit command:
rubmit --job-name rb-train -- "cd <somewhere>\npython training.py --output_model model.pt"
Scripts to:
- Submit a job:
jubmit
. - View jobs (wraps
sacct
):jlist
- View cluster-wide resource usage:
jtop
.