jaywonchung/pegasus

Ctrl-c is the wrong key to trigger cancellation

Opened this issue · 4 comments

Pressing ctrl-c when Pegasus is running will also send ctrl-c to the entire foreground process group, thus also sending SIGINT to the ssh processes.

Instead of waiting for ctrl-c with the tokio ctcl-c catcher, find another way to catch the user's intent to cancel.

Possible solutions:

  • Listen to stdin for something random. Like q<Enter>.
  • Create a pegasus stop command. This should make sure to identify the specific pegasus process that is running on the current working directory. This will probably require a pid file.
    • It's okay to assume that there is only one instance of pegasus per directory (if not, that means more than one pegasus processes are manipulating queue.yaml and consumed.yaml).

I think the best way to fix this is to double fork and create a new session, or run pegasus as a daemon since many users have the habit of using ctrl+c

@NobodyXu Thanks for the advice! I gave it some more thought. One possible solution that popped up is for the ssh command builder to include this:

pre_exec(|| unsafe { nix::libc::setsid() })

Single-forking would still be okay of ssh does not try to open("/dev/console") somewhere.

I think this will make the ssh processes ctrl-c-proof while still allowing Pegasus to be ergonomic (accepts ctrl-c). But probably this cannot be done without introducing changes in the openssh crate. WDYT?

@NobodyXu Thanks for the advice! I gave it some more thought. One possible solution that popped up is for the ssh command builder to include this:

pre_exec(|| unsafe { nix::libc::setsid() })

Single-forking would still be okay of ssh does not try to open("/dev/console") somewhere.

I think this will make the ssh processes ctrl-c-proof while still allowing Pegasus to be ergonomic (accepts ctrl-c). But probably this cannot be done without introducing changes in the openssh crate. WDYT?

Yes that requires modifications to openssh, however I don't think this is a great solution.

That would mean the ssh multiplex master won't exit and release the resource (including the TCP connection).

Next invocation of pegasus would create another ssh multiplex master and TCP connection.

Since pegasus is designed to run tasks on clusters, I think it is more reasonable to run it as a daemon that keep running tasks on the remote until you explicitly terminate it.

Ah, I get your point. It's not easy to guarantee that all the ssh masters terminate when pegasus dies/terminates/is killed/panics.

Making Pegaus a daemon seems good. 👍 I think I'll implement a stop command for termination.