Stop hooks are not called when Start times out
jkanywhere opened this issue · 17 comments
Describe the bug
None of the Stop
hooks are called when fx.Start
times out.
To Reproduce
A failing example.
Add at least one start and stop hook.
Make a start hook that takes longer than the start timeout.
Expected behavior
For every Start
hook that executes, the corresponding Stop
hook should execute when the app shuts down, for any reason.
Additional context
https://github.com/uber-go/fx/blob/master/app.go#L646
(*App).Start
uses withTimeout
to call (*App).start
.
(*App).start
uses withRollback
to call lifecycle.Stop
.
Because the withTimeout
is outside of the withRollback
, the stop is never called when the start context deadline is reached.
There is at least one leaky test due to this behavior: TestSpireInitTimeout
from //src/code.uber.internal/compute/container-launch-helper/main:go_default_test
.
If one of the Start
hooks returns an error then the previously registered Stop
hooks are indeed called as expected. See example of correct behavior.
It seems like this was a deliberate change in behavior from #875. Is there a particular reason you want Stop hooks to run with an expired context?
I want Stop
hooks to run to clean up resources such as stopping background go routines so tests do not fail go.uber.org/goleak
.
I don't really care which context is used. If a Start
hook runs then the corresponding Stop
hook should run.
In the example I call fx.(*App).Stop
with a fresh non expired context, and the stop hooks do not run.
Ah thanks for the reply. I must have misread the repro example. I'll investigate this further.
From https://github.com/uber-go/fx/pull/875/files DoNotRunStartHooksWithExpiredCtx
proves that after a context expires no additional start hooks should execute. And, since no start hooked ran there no stop hooks will run.
DoNotRunStopHooksWithExpiredCtx
likewise shows that no new Stop
hooks will run after the context expires.
I want a case more like ErrHaltsChainAndRollsBack
where some of the Start
hooks suceeded so those Stop
hooks must run.
In many situations it won't matter very much because resources will get cleaned up when the entire process is shutting down. However, in some cases there may be external finalization that should still occur even with production code. Things like flushing logs or emitting metrics.
Test code builds and destroys many fx.App
instances so it is vital to be able to clean up after a failed Start
.
Alright, I think I found the issue. It's because #994 changed so that the app.Stop returns early without calling Stop hooks if app never started.
I would expect the app to be in state incompleteStart
after the Start context times out.
I don't know how to inspect the state from outside the app.
If you want, I can add unit tests in fx
to try to reproduce the issue.
It's actually stuck at starting
if context times out during Start for some reason
It's actually stuck at
starting
if context times out during Start for some reason
Good to know!
So either we need to move from starting
to incompleteStart
after timeout,
or expand https://github.com/uber-go/fx/blob/master/internal/lifecycle/lifecycle.go#L268 to allow starting
.
It's actually stuck at
starting
if context times out during Start for some reason
I agree with your "for some reason".
How does the defer
block of lifecycle.(* Lifecycle).Start
get missed? Looks like it is really trying to get to incompleteStart
, even after returning on line 208.
https://github.com/uber-go/fx/blob/master/internal/lifecycle/lifecycle.go#L193-L209
l.state = starting
l.startRecords = make(HookRecords, 0, len(l.hooks))
l.mu.Unlock()
var returnState appState = incompleteStart
defer func() {
l.mu.Lock()
l.state = returnState
l.mu.Unlock()
}()
for _, hook := range l.hooks {
// if ctx has cancelled, bail out of the loop.
if err := ctx.Err(); err != nil {
return err
}
How does the
defer
block oflifecycle.(* Lifecycle).Start
get missed?
Yeah, I'm trying to figure this one out..
Okay, if you add time.Sleep(_timeout)
before calling app.Stop(stopCtx)
the state actually transitions into incompleteStart
started
, and the Stop hooks are run accordingly. The reason it goes to started
in this example is because the long running hook is the last Start hook, and we don't check for context.Err() after a hook finishes running. The hook that's sleeping is actually not done running before app.Stop
is called, so the defer
block which sets the lifecycle state is never run, and the program exits.
There's not really an easy way to transition lifecycle state from starting
to incompleteStart
when context expires before a hook finishes running. And even if we expand the condition on https://github.com/uber-go/fx/blob/master/internal/lifecycle/lifecycle.go#L268, we might still end up not running the Stop hook, for which a corresponding Start hook is still running. I'll investigate more on the solution tomorrow.
Internal Ref: GO-1859
hook that's sleeping is actually not done running before
app.Stop
is called
ack. I missed the go routine started in withTimeout.
With that knowledge I might be able to work around this in the particular test I'm debugging.
I'll investigate more on the solution tomorrow.
Thanks!
Following along. This is indeed an interesting edge case.
I was able to investigate on this a little more recently. The biggest issue with a long-running start hook that causes a timeout is that there's no way to stop the hook itself from running when context expires. And we can't really run the corresponding stop hook if a start hook is still running by the time app.Stop is called.
We can, however, make sure to run all stop hooks for which corresponding start hooks finished running when app.Stop is called. There are two options for this:
- Make sure that app's lifecycle state to be
incompleteStart
when app.Start times out - Expand the condition in lifecycle.go to include
starting
Ideally we would go with option 1, but this might require a non-trivial amount of refactor to the way we execute lifecycle hooks in general. So I think for now we should implement option 2, and when we eventually implement option 1, we can revert option 2.
Additionally, we should clarify in documentation that for a long-running start hook that isn't done running when app.Start times out, the corresponding stop hook will NOT be run if it's not done running by the time app.Stop is called.