Program hangs after successful shutdown
Closed this issue · 33 comments
Hi there,
During some recent testing I ran into an "interesting" issue with SIGINT and similar shutdowns. The exact situation I identified is the following:
Deep inside one of my subsystems, I do an async lookup_host()
, which has a timeout of apparently 20 seconds, which is await there. When hitting CTRL+C, all subsystems shut down and I get tokio_graceful_shutdown::toplevel] Shutdown successful.
.
At this point the program does not terminate though – it only exits after the lookup_host()
future resolves/rejects. Do you happen to have an idea on how to force tokio to not wait for such things before shutting down itself?
Theoretically, after the timeout of handle_shutdown_requests
is over, it should cancel all the pending futures ... Would you mind providing a reproducible example? This behaviour sounds like a bug and requires fixing :)
Did you spawn it through a tokio::spawn
? Because that slips away from the reach of the shutdown handler, you would have to use SubsystemHandle::start
instead. tokio::spawn
is like a detached thread, the system looses all knowledge about its existence.
I could theoretically wait for it then through guard objects, but I wouldn't have any possibility of cancelling it. The shutdown manager can only cancel futures it knows, for obvious reasons :)
But none of this matters if you didn't use tokio::spawn
:) So I will be able to help you further once you provide a reproducible example.
Hmm. It seems that lookup_host()
internally uses spawn_blocking()
. Sadly, it's almost impossible to handle spawn_blocking
gracefully during shutdown. Let me construct a minimal example for further discussion.
I was just looking into lookup_host() as well for reproduction, and figured the same thing. I really wonder how this could be resolved nicely.
I do use vanilla threads in two other places in my program, but they all react to oneshot receivers being dropped and exit immediately.
The only thing I found so far would be shutdown_timeout()
or shutdown_background()
, but I'm unsure how to do this with externally provided runtimes. I suspect you can't shut down the runtime you are currently running on.
So spawning another runtime inside of Toplevel
would probably work, but that would override the user's decision what runtime to spawn. I wonder if there is a config option of tokio::main
that kills background tasks on exit, that would probably be the best option.
This seems to exhibit the behaviour you described:
use std::{thread, time::Duration};
use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};
async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
let mut hanging_task = tokio::task::spawn_blocking(|| {
thread::sleep(Duration::from_secs(10));
});
let joinhandle_ref = &mut hanging_task;
select! {
e = joinhandle_ref => e.unwrap(),
_ = subsys.on_shutdown_requested() => {
hanging_task.abort()
},
}
Ok(())
}
#[tokio::main]
async fn main() -> Result<()> {
// Init logging
Builder::from_env(Env::default().default_filter_or("debug")).init();
Toplevel::new()
.catch_signals()
.start("Hanging Task", hanging_task)
.handle_shutdown_requests(Duration::from_millis(2000))
.await
.map_err(Into::into)
}
Yeah, that seems to be pretty much it. I'm screening the tokio docs and code, but haven't found anything for it yet.
I could add something along those lines in handle_shutdown_requests
:
use std::{thread, time::Duration};
use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};
async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
let hanging_task = tokio::task::spawn_blocking(|| {
thread::sleep(Duration::from_secs(10));
});
select! {
e = hanging_task => e.unwrap(),
_ = subsys.on_shutdown_requested() => {
// Spawn thread that kills leftover tasks if necessary
thread::spawn(|| {
thread::sleep(Duration::from_millis(1000));
println!("Shutdown seems to hang. Killing program ...");
std::process::exit(1);
});
},
}
Ok(())
}
#[tokio::main]
async fn main() -> Result<()> {
// Init logging
Builder::from_env(Env::default().default_filter_or("debug")).init();
Toplevel::new()
.catch_signals()
.start("Hanging Task", hanging_task)
.handle_shutdown_requests(Duration::from_millis(200))
.await
.map_err(Into::into)
}
What do you think?
The idea is that we spawn a thread. If main()
exits successfully, the thread is killed. But if main()
hangs, the thread kills the program after a grace period.
I think that's a sane way to handle it, as long as the kill timeout is configurable. Though technically we already have timeout defined in handle_shutdown_requests
.
The only problem is that it isn't guaranteed that Toplevel
is used like this, it could be used in a bunch of ways, including somewhere nested in the program. And in that case we probably don't want to provoke a exit(1)
if the shutdown hangs.
We could add a .exit_program_on_hang()
option that can be appended to .handle_shutdown_requests()
.
Might have to sleep over it, though. I'm not perfectly happy with this yet, it seems too hacky.
Sure thing, take your time (not too much please, I definitely like this resolved ;)). I agree that it should be an opt-in feature though.
@DASPRiD Until I/we come up with a solution, you can work around it like this:
use std::{thread, time::Duration};
use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};
async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
let hanging_task = tokio::task::spawn_blocking(|| {
thread::sleep(Duration::from_secs(10));
});
select! {
e = hanging_task => e.unwrap(),
_ = subsys.on_shutdown_requested() => (),
}
Ok(())
}
#[tokio::main]
async fn main() -> Result<()> {
// Init logging
Builder::from_env(Env::default().default_filter_or("debug")).init();
Toplevel::new()
.catch_signals()
.start("Hanging Task", hanging_task)
.handle_shutdown_requests(Duration::from_millis(200))
.await?;
thread::spawn(|| {
thread::sleep(Duration::from_millis(1000));
log::error!("Shutdown seems to hang. Killing program ...");
std::process::exit(1);
});
Ok(())
}
Feel free to post further thoughts or ideas, I'm open for everything on this problem. I'm struggling to think of something graceful right now, I'll also do some further research.
I agree that this is definitely not the nicest solution. Considering that such a blocking thread should usually just be ignored and an exit should happen immediately, this feels like an issue on the tokio side.
Granted though, I was only running into this issue because I had to restart my program because of network issues, and that's exactly when I noticed this problem of course. As long as the network just works, the shutdown is normal. So this is definitely an edge-case.
The best solution I can come up with right now is to put above chunk of code in a function, like
fn kill_on_hang(timeout: Duration) {
thread::spawn(|| {
thread::sleep(timeout);
log::error!("Shutdown seems to hang. Killing program ...");
std::process::exit(1);
});
}
So that people can use it like this:
use std::{thread, time::Duration};
use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};
async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
let hanging_task = tokio::task::spawn_blocking(|| {
thread::sleep(Duration::from_secs(10));
});
select! {
e = hanging_task => e.unwrap(),
_ = subsys.on_shutdown_requested() => (),
}
Ok(())
}
#[tokio::main]
async fn main() -> Result<()> {
// Init logging
Builder::from_env(Env::default().default_filter_or("debug")).init();
Toplevel::new()
.catch_signals()
.start("Hanging Task", hanging_task)
.handle_shutdown_requests(Duration::from_millis(200))
.await?;
kill_on_hang(Duration::from_secs(1));
Ok(())
}
But it doesn't feel right. I agree that I think the solution should be on tokio side somewhere. And it is, if you use tokio's runtime directly and can call Runtime::shutdown_background()
. It just sucks that tokio::Runtime
's default drop
behaviour is to wait indefinitely.
@DASPRiD In case you want to report it to tokio, this is the minimal reproducible example:
use std::{thread, time::Duration};
#[tokio::main]
async fn main() {
tokio::task::spawn_blocking(|| {
thread::sleep(Duration::from_secs(10));
});
}
In my opinion it should exit immediately, but it doesn't.
My reasoning is that this also exits immediately:
use std::{thread, time::Duration};
fn main() {
thread::spawn(|| {
thread::sleep(Duration::from_secs(10));
});
}
Further, this also exits immediately:
use std::time::Duration;
#[tokio::main]
async fn main() {
tokio::spawn(async {
tokio::time::sleep(Duration::from_secs(10)).await;
});
}
I should do that, thanks a lot! I'll reference this issue over there.
I've created an issue on the tokio repository. Maybe you can chime in there as well if you have anything to add ❤️
@DASPRiD I prototyped a solution in tokio:
Cargo.toml:
[patch.crates-io]
tokio = { version = "1.23.1", git = "https://github.com/Finomnis/tokio", branch = "main_arg_ignore_hanging_threads" }
[dependencies]
tokio-graceful-shutdown = "0.12.1"
tokio = { version = "1.23.1", features = ["full"] }
env_logger = "0.10.0"
miette = { version = "5.5.0", features = ["fancy"] }
use std::{thread, time::Duration};
use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};
async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
let mut hanging_task = tokio::task::spawn_blocking(|| {
thread::sleep(Duration::from_secs(100));
});
let joinhandle_ref = &mut hanging_task;
select! {
e = joinhandle_ref => e.unwrap(),
_ = subsys.on_shutdown_requested() => (),
}
Ok(())
}
#[tokio::main(ignore_hanging_threads = true)]
async fn main() -> Result<()> {
// Init logging
Builder::from_env(Env::default().default_filter_or("debug")).init();
Toplevel::new()
.catch_signals()
.start("Hanging Task", hanging_task)
.handle_shutdown_requests(Duration::from_millis(2000))
.await
.map_err(Into::into)
}
What are your thoughts?
That looks quite good to me. The parameter name is a little ambiguous though maybe, as this is quite specific to tokio spawned threads. Apart from that, it looks like a really clean solution.
Yah, it's still open for debate. Now I just need two things:
- tests
- agreement with the tokio devs
The second one could take a while, tokio is notoriously understaffed with many open pull requests. They are doing an excellent job, just to clarify, but there's just too many people wanting things done :D
So be patient I guess
Maybe ignore_pending_threads
instead of ignore_hanging_threads
? Or detach_threads_on_exit
? I'm not sure, I hate naming things :D
I feel you on that, coming up with short but still descriptive names is… hard. Maybe something like abandon_pending_threads
?
Sounds good, might use it. Gotta leave for today though, can't promise when I'll get to it. Greetz
Thanks a lot for your effort! :)
@DASPRiD I opened a PR for the first part of the fix, fyi: tokio-rs/tokio#5360
Thanks for keeping me in the loop. I assume a second PR would add that to the tokio-main macro?
Exactly :) it's currently completely impossible to fix that reliably, with our without #[tokio::main]
. The first PR adds that functionality to the Runtime.
@DASPRiD It might be unlikely that we get the parameter for #[tokio::main]
merged. I'm currently fighting for the config option for Runtime
, that one might happen. You might have to use the Runtime
object directly then, but that shouldn't be a problem, #[tokio::main]
is really just a thin wrapper around the Runtime
object. Wish me luck ;)
I see that there was no movement on the PR yet, I really hope they get around to it eventually :)
I'm not sure, they seem to be quite busy ... I'll ask again in their discord soon.
@DASPRiD After a long discussion they rejected the pull request and instead proposed a solution on our side.
Their proposal was to wrap the runtime in a newtype that calls shutdown_background()
in its Drop
implementation.
Here is the documentation on what #[tokio::main]
actually expands to: https://docs.rs/tokio/latest/tokio/attr.main.html#using-the-multi-thread-runtime
With that, this is how such a newtype thing would look like:
use std::{future::Future, thread, time::Duration};
use env_logger::{Builder, Env};
use miette::Result;
use tokio::{runtime::Runtime, select};
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};
struct RuntimeWithInstantShutdown(Option<Runtime>);
impl RuntimeWithInstantShutdown {
pub fn new() -> Self {
Self(Some(
tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap(),
))
}
pub fn block_on<F: Future>(&self, future: F) -> F::Output {
self.0.as_ref().unwrap().block_on(future)
}
}
impl Drop for RuntimeWithInstantShutdown {
fn drop(&mut self) {
self.0.take().unwrap().shutdown_background()
}
}
async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
let hanging_task = tokio::task::spawn_blocking(|| {
thread::sleep(Duration::from_secs(10));
});
select! {
e = hanging_task => e.unwrap(),
_ = subsys.on_shutdown_requested() => (),
}
Ok(())
}
fn main() -> Result<()> {
// Init logging
Builder::from_env(Env::default().default_filter_or("debug")).init();
RuntimeWithInstantShutdown::new().block_on(async {
Toplevel::new()
.catch_signals()
.start("Hanging Task", hanging_task)
.handle_shutdown_requests(Duration::from_millis(200))
.await
.map_err(Into::into)
})
}
@Finomnis Interesting, thanks a lot for your effort! Do you have any plan to have a macro for that within this library to make it easier for users?
Not really, to be honest. It's different enough that it shouldn't be part of this crate, in my opinion.
It's just a few lines of code, so I don't think it's even worth abstracting it much. Rather the opposite: because runtime configuration is so highly context dependent, I think this is already the best and most generic API. Writing a macro would only make my crate more opinionated, and I'm trying to keep it as generic and unopinionated as possible.