jaywonchung/pegasus

Cancelling commands

jaywonchung opened this issue · 1 comments

Cancelling commands ran by Pegasus is very difficult. You essentially have to ssh into each node and manually figure out the PIDs of commands and kill them.

Nested commands, so to say, make things more complicated. For instance, docker exec sh -c "python train.py" will run the following commands:

  • Ran by user: sh -c docker exec sh -c "python train.py"
  • Ran by user: docker exec sh -c "python train.py"
  • Ran by root:sh -c "python train.py"
  • Ran by root: python train.py

Only killing the fourth python train.py command will truely achieve cancellation. The bottom line is, it is difficult for Pegasus to infer how to properly terminate a command.

Potential solutions

  • We might ask the user for a cancellation command in queue.yaml. For example, sudo kill $(pgrep -f 'train.py'). Then the ctrl_c handler will create a new connection to the hosts and run the designated cancellation command.
  • Somehow figure out the PGID of the sh process and run sudo kill -- -PGID. Can we pgrep -f with the entire command? Shell escaping might become a problem. (pgrep -f with every single word in the command and kill the intersection of all PIDs returned?)

Commands ran with docker exec currently have no standard way to kill.

Following moby/moby#41548