Cancelling commands
jaywonchung opened this issue · 1 comments
jaywonchung commented
Cancelling commands ran by Pegasus is very difficult. You essentially have to ssh into each node and manually figure out the PIDs of commands and kill them.
Nested commands, so to say, make things more complicated. For instance, docker exec sh -c "python train.py"
will run the following commands:
- Ran by user:
sh -c docker exec sh -c "python train.py"
- Ran by user:
docker exec sh -c "python train.py"
- Ran by root:
sh -c "python train.py"
- Ran by root:
python train.py
Only killing the fourth python train.py
command will truely achieve cancellation. The bottom line is, it is difficult for Pegasus to infer how to properly terminate a command.
Potential solutions
- We might ask the user for a cancellation command in
queue.yaml
. For example,sudo kill $(pgrep -f 'train.py')
. Then the ctrl_c handler will create a new connection to the hosts and run the designated cancellation command. - Somehow figure out the PGID of the
sh
process and runsudo kill -- -PGID
. Can wepgrep -f
with the entire command? Shell escaping might become a problem. (pgrep -f
with every single word in the command and kill the intersection of all PIDs returned?)
jaywonchung commented
Commands ran with docker exec
currently have no standard way to kill.
Following moby/moby#41548