/cluster_scripts

DGX scripts

Primary LanguageShell

Cluster scripts

A set of scripts to help prepare, submit, run and monitor cluster jobs.

All scripts have options, so use the -h|--help argument for more info. If not, just run as-is for the defaults.

DGX

Scripts to:

  1. Create a Docker image (based on the nvidia pytorch container): create_docker_im.sh.
  2. Submit a job: rubmit.
  3. The script that is executed when the job runs: runai_startup.sh.

Example:

  • Create the docker image: create_docker_im.sh --docker_push.
  • Submit with rubmit.
  • Modify runai_startup.sh as necessary, this will be run once the runai job is created.

Typical submit command:

rubmit --job-name rb-train -- "cd <somewhere>\npython training.py --output_model model.pt"

JADE

Scripts to:

  1. Submit a job: jubmit.
  2. View jobs (wraps sacct): jlist
  3. View cluster-wide resource usage: jtop.

jtop

image