This repository includes helper code for cluster management, including:

  • generate submission scripts for different clusters/servers given command list (

  • monitor jobs and maintain a constant number of running jobs (

  • monitor job end states and resubmit failed jobs (

  • one-liner job submission command (e.g. python "echo excited" --gpu 1 --submit)

Most of the code is moved from