Program hangs after successful shutdown

Question

Program hangs after successful shutdown

Closed this issue 2 years ago · 33 comments

Hi there,

During some recent testing I ran into an "interesting" issue with SIGINT and similar shutdowns. The exact situation I identified is the following:

Deep inside one of my subsystems, I do an async lookup_host(), which has a timeout of apparently 20 seconds, which is await there. When hitting CTRL+C, all subsystems shut down and I get tokio_graceful_shutdown::toplevel] Shutdown successful..

At this point the program does not terminate though – it only exits after the lookup_host() future resolves/rejects. Do you happen to have an idea on how to force tokio to not wait for such things before shutting down itself?

Answer 1 · 2023-01-03T20:03:55.000Z

Theoretically, after the timeout of handle_shutdown_requests is over, it should cancel all the pending futures ... Would you mind providing a reproducible example? This behaviour sounds like a bug and requires fixing :)

Did you spawn it through a tokio::spawn? Because that slips away from the reach of the shutdown handler, you would have to use SubsystemHandle::start instead. tokio::spawn is like a detached thread, the system looses all knowledge about its existence.

I could theoretically wait for it then through guard objects, but I wouldn't have any possibility of cancelling it. The shutdown manager can only cancel futures it knows, for obvious reasons :)

But none of this matters if you didn't use tokio::spawn :) So I will be able to help you further once you provide a reproducible example.

Answer 2 · 2023-01-03T20:42:15.000Z

Hmm. It seems that lookup_host() internally uses spawn_blocking(). Sadly, it's almost impossible to handle spawn_blocking gracefully during shutdown. Let me construct a minimal example for further discussion.

Answer 3 · 2023-01-03T20:46:29.000Z

I was just looking into lookup_host() as well for reproduction, and figured the same thing. I really wonder how this could be resolved nicely.

I do use vanilla threads in two other places in my program, but they all react to oneshot receivers being dropped and exit immediately.

Answer 4 · 2023-01-03T20:51:12.000Z

The only thing I found so far would be shutdown_timeout() or shutdown_background(), but I'm unsure how to do this with externally provided runtimes. I suspect you can't shut down the runtime you are currently running on.

So spawning another runtime inside of Toplevel would probably work, but that would override the user's decision what runtime to spawn. I wonder if there is a config option of tokio::main that kills background tasks on exit, that would probably be the best option.

Answer 5 · 2023-01-03T21:00:48.000Z

This seems to exhibit the behaviour you described:

use std::{thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let mut hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });

    let joinhandle_ref = &mut hanging_task;

    select! {
        e = joinhandle_ref => e.unwrap(),
        _ = subsys.on_shutdown_requested() => {
            hanging_task.abort()
        },
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    Toplevel::new()
        .catch_signals()
        .start("Hanging Task", hanging_task)
        .handle_shutdown_requests(Duration::from_millis(2000))
        .await
        .map_err(Into::into)
}

Answer 6 · 2023-01-03T21:03:28.000Z

Yeah, that seems to be pretty much it. I'm screening the tokio docs and code, but haven't found anything for it yet.

Answer 7 · 2023-01-03T21:14:07.000Z

I could add something along those lines in handle_shutdown_requests:

use std::{thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });

    select! {
        e = hanging_task => e.unwrap(),
        _ = subsys.on_shutdown_requested() => {
            // Spawn thread that kills leftover tasks if necessary
            thread::spawn(|| {
                thread::sleep(Duration::from_millis(1000));
                println!("Shutdown seems to hang. Killing program ...");
                std::process::exit(1);
            });
        },
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    Toplevel::new()
        .catch_signals()
        .start("Hanging Task", hanging_task)
        .handle_shutdown_requests(Duration::from_millis(200))
        .await
        .map_err(Into::into)
}

What do you think?

The idea is that we spawn a thread. If main() exits successfully, the thread is killed. But if main() hangs, the thread kills the program after a grace period.

Answer 8 · 2023-01-03T21:16:22.000Z

I think that's a sane way to handle it, as long as the kill timeout is configurable. Though technically we already have timeout defined in handle_shutdown_requests.

Answer 9 · 2023-01-03T21:20:47.000Z

The only problem is that it isn't guaranteed that Toplevel is used like this, it could be used in a bunch of ways, including somewhere nested in the program. And in that case we probably don't want to provoke a exit(1) if the shutdown hangs.

We could add a .exit_program_on_hang() option that can be appended to .handle_shutdown_requests().

Might have to sleep over it, though. I'm not perfectly happy with this yet, it seems too hacky.

Answer 10 · 2023-01-03T21:22:00.000Z

Sure thing, take your time (not too much please, I definitely like this resolved ;)). I agree that it should be an opt-in feature though.

Answer 11 · 2023-01-03T21:26:34.000Z

@DASPRiD Until I/we come up with a solution, you can work around it like this:

use std::{thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });

    select! {
        e = hanging_task => e.unwrap(),
        _ = subsys.on_shutdown_requested() => (),
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    Toplevel::new()
        .catch_signals()
        .start("Hanging Task", hanging_task)
        .handle_shutdown_requests(Duration::from_millis(200))
        .await?;

    thread::spawn(|| {
        thread::sleep(Duration::from_millis(1000));
        log::error!("Shutdown seems to hang. Killing program ...");
        std::process::exit(1);
    });

    Ok(())
}

Answer 12 · 2023-01-03T21:29:50.000Z

Feel free to post further thoughts or ideas, I'm open for everything on this problem. I'm struggling to think of something graceful right now, I'll also do some further research.

Answer 13 · 2023-01-03T21:33:02.000Z

I agree that this is definitely not the nicest solution. Considering that such a blocking thread should usually just be ignored and an exit should happen immediately, this feels like an issue on the tokio side.

Granted though, I was only running into this issue because I had to restart my program because of network issues, and that's exactly when I noticed this problem of course. As long as the network just works, the shutdown is normal. So this is definitely an edge-case.

Answer 14 · 2023-01-03T21:36:00.000Z

The best solution I can come up with right now is to put above chunk of code in a function, like

fn kill_on_hang(timeout: Duration) {
    thread::spawn(|| {
        thread::sleep(timeout);
        log::error!("Shutdown seems to hang. Killing program ...");
        std::process::exit(1);
    });
}

So that people can use it like this:

use std::{thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });

    select! {
        e = hanging_task => e.unwrap(),
        _ = subsys.on_shutdown_requested() => (),
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    Toplevel::new()
        .catch_signals()
        .start("Hanging Task", hanging_task)
        .handle_shutdown_requests(Duration::from_millis(200))
        .await?;

    kill_on_hang(Duration::from_secs(1));

    Ok(())
}

But it doesn't feel right. I agree that I think the solution should be on tokio side somewhere. And it is, if you use tokio's runtime directly and can call Runtime::shutdown_background(). It just sucks that tokio::Runtime's default drop behaviour is to wait indefinitely.

Answer 15 · 2023-01-03T21:38:56.000Z

@DASPRiD In case you want to report it to tokio, this is the minimal reproducible example:

use std::{thread, time::Duration};

#[tokio::main]
async fn main() {
    tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });
}

In my opinion it should exit immediately, but it doesn't.

My reasoning is that this also exits immediately:

use std::{thread, time::Duration};

fn main() {
    thread::spawn(|| {
        thread::sleep(Duration::from_secs(10));
    });
}

Further, this also exits immediately:

use std::time::Duration;

#[tokio::main]
async fn main() {
    tokio::spawn(async {
        tokio::time::sleep(Duration::from_secs(10)).await;
    });
}

Answer 16 · 2023-01-03T21:39:38.000Z

I should do that, thanks a lot! I'll reference this issue over there.

Answer 17 · 2023-01-03T21:53:08.000Z

I've created an issue on the tokio repository. Maybe you can chime in there as well if you have anything to add ❤️

Answer 18 · 2023-01-04T21:51:38.000Z

@DASPRiD I prototyped a solution in tokio:

Cargo.toml:

[patch.crates-io]
tokio = { version = "1.23.1", git = "https://github.com/Finomnis/tokio", branch = "main_arg_ignore_hanging_threads" }

[dependencies]
tokio-graceful-shutdown = "0.12.1"
tokio = { version = "1.23.1", features = ["full"] }
env_logger = "0.10.0"
miette = { version = "5.5.0", features = ["fancy"] }

use std::{thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::select;
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let mut hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(100));
    });

    let joinhandle_ref = &mut hanging_task;

    select! {
        e = joinhandle_ref => e.unwrap(),
        _ = subsys.on_shutdown_requested() => (),
    }

    Ok(())
}

#[tokio::main(ignore_hanging_threads = true)]
async fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    Toplevel::new()
        .catch_signals()
        .start("Hanging Task", hanging_task)
        .handle_shutdown_requests(Duration::from_millis(2000))
        .await
        .map_err(Into::into)
}

What are your thoughts?

Answer 19 · 2023-01-04T21:54:33.000Z

That looks quite good to me. The parameter name is a little ambiguous though maybe, as this is quite specific to tokio spawned threads. Apart from that, it looks like a really clean solution.

Answer 20 · 2023-01-04T21:56:45.000Z

Yah, it's still open for debate. Now I just need two things:

tests
agreement with the tokio devs

The second one could take a while, tokio is notoriously understaffed with many open pull requests. They are doing an excellent job, just to clarify, but there's just too many people wanting things done :D

So be patient I guess

Answer 21 · 2023-01-04T21:59:36.000Z

Maybe ignore_pending_threads instead of ignore_hanging_threads? Or detach_threads_on_exit? I'm not sure, I hate naming things :D

Answer 22 · 2023-01-04T22:11:57.000Z

I feel you on that, coming up with short but still descriptive names is… hard. Maybe something like abandon_pending_threads?

Answer 23 · 2023-01-04T22:33:27.000Z

Sounds good, might use it. Gotta leave for today though, can't promise when I'll get to it. Greetz

Answer 24 · 2023-01-04T22:34:03.000Z

Thanks a lot for your effort! :)

Answer 25 · 2023-01-06T19:40:35.000Z

@DASPRiD I opened a PR for the first part of the fix, fyi: tokio-rs/tokio#5360

Answer 26 · 2023-01-06T20:06:38.000Z

Thanks for keeping me in the loop. I assume a second PR would add that to the tokio-main macro?

Answer 27 · 2023-01-06T22:24:50.000Z

Exactly :) it's currently completely impossible to fix that reliably, with our without #[tokio::main]. The first PR adds that functionality to the Runtime.

Answer 28 · 2023-01-09T20:37:19.000Z

@DASPRiD It might be unlikely that we get the parameter for #[tokio::main] merged. I'm currently fighting for the config option for Runtime, that one might happen. You might have to use the Runtime object directly then, but that shouldn't be a problem, #[tokio::main] is really just a thin wrapper around the Runtime object. Wish me luck ;)

Answer 29 · 2023-01-20T16:42:36.000Z

I see that there was no movement on the PR yet, I really hope they get around to it eventually :)

Answer 30 · 2023-01-20T20:34:56.000Z

I'm not sure, they seem to be quite busy ... I'll ask again in their discord soon.

Answer 31 · 2023-02-06T18:51:52.000Z

@DASPRiD After a long discussion they rejected the pull request and instead proposed a solution on our side.

Their proposal was to wrap the runtime in a newtype that calls shutdown_background() in its Drop implementation.

Here is the documentation on what #[tokio::main] actually expands to: https://docs.rs/tokio/latest/tokio/attr.main.html#using-the-multi-thread-runtime

With that, this is how such a newtype thing would look like:

use std::{future::Future, thread, time::Duration};

use env_logger::{Builder, Env};
use miette::Result;
use tokio::{runtime::Runtime, select};
use tokio_graceful_shutdown::{SubsystemHandle, Toplevel};

struct RuntimeWithInstantShutdown(Option<Runtime>);
impl RuntimeWithInstantShutdown {
    pub fn new() -> Self {
        Self(Some(
            tokio::runtime::Builder::new_multi_thread()
                .enable_all()
                .build()
                .unwrap(),
        ))
    }
    pub fn block_on<F: Future>(&self, future: F) -> F::Output {
        self.0.as_ref().unwrap().block_on(future)
    }
}
impl Drop for RuntimeWithInstantShutdown {
    fn drop(&mut self) {
        self.0.take().unwrap().shutdown_background()
    }
}

async fn hanging_task(subsys: SubsystemHandle) -> Result<()> {
    let hanging_task = tokio::task::spawn_blocking(|| {
        thread::sleep(Duration::from_secs(10));
    });

    select! {
        e = hanging_task => e.unwrap(),
        _ = subsys.on_shutdown_requested() => (),
    }

    Ok(())
}

fn main() -> Result<()> {
    // Init logging
    Builder::from_env(Env::default().default_filter_or("debug")).init();

    RuntimeWithInstantShutdown::new().block_on(async {
        Toplevel::new()
            .catch_signals()
            .start("Hanging Task", hanging_task)
            .handle_shutdown_requests(Duration::from_millis(200))
            .await
            .map_err(Into::into)
    })
}

Answer 32 · 2023-02-06T19:37:10.000Z

@Finomnis Interesting, thanks a lot for your effort! Do you have any plan to have a macro for that within this library to make it easier for users?

Answer 33 · 2023-02-06T19:38:42.000Z

Not really, to be honest. It's different enough that it shouldn't be part of this crate, in my opinion.

It's just a few lines of code, so I don't think it's even worth abstracting it much. Rather the opposite: because runtime configuration is so highly context dependent, I think this is already the best and most generic API. Writing a macro would only make my crate more opinionated, and I'm trying to keep it as generic and unopinionated as possible.