uber-go/fx

Stop hooks are not called when Start times out

jkanywhere opened this issue · 17 comments

Describe the bug
None of the Stop hooks are called when fx.Start times out.

To Reproduce
A failing example.

Add at least one start and stop hook.
Make a start hook that takes longer than the start timeout.

Expected behavior
For every Start hook that executes, the corresponding Stop hook should execute when the app shuts down, for any reason.

Additional context

https://github.com/uber-go/fx/blob/master/app.go#L646
(*App).Start uses withTimeout to call (*App).start.
(*App).start uses withRollback to call lifecycle.Stop.
Because the withTimeout is outside of the withRollback, the stop is never called when the start context deadline is reached.

There is at least one leaky test due to this behavior: TestSpireInitTimeout from //src/code.uber.internal/compute/container-launch-helper/main:go_default_test.

If one of the Start hooks returns an error then the previously registered Stop hooks are indeed called as expected. See example of correct behavior.

It seems like this was a deliberate change in behavior from #875. Is there a particular reason you want Stop hooks to run with an expired context?

I want Stop hooks to run to clean up resources such as stopping background go routines so tests do not fail go.uber.org/goleak.
I don't really care which context is used. If a Start hook runs then the corresponding Stop hook should run.
In the example I call fx.(*App).Stop with a fresh non expired context, and the stop hooks do not run.

Ah thanks for the reply. I must have misread the repro example. I'll investigate this further.

From https://github.com/uber-go/fx/pull/875/files DoNotRunStartHooksWithExpiredCtx proves that after a context expires no additional start hooks should execute. And, since no start hooked ran there no stop hooks will run.

DoNotRunStopHooksWithExpiredCtx likewise shows that no new Stop hooks will run after the context expires.

I want a case more like ErrHaltsChainAndRollsBack where some of the Start hooks suceeded so those Stop hooks must run.

In many situations it won't matter very much because resources will get cleaned up when the entire process is shutting down. However, in some cases there may be external finalization that should still occur even with production code. Things like flushing logs or emitting metrics.

Test code builds and destroys many fx.App instances so it is vital to be able to clean up after a failed Start.

Alright, I think I found the issue. It's because #994 changed so that the app.Stop returns early without calling Stop hooks if app never started.

It's because #994 changed so that the app.Stop returns early without calling Stop hooks if app never started.

Quite plausible.
See also #1019 which partially rolled back #994.

I would expect the app to be in state incompleteStart after the Start context times out.
I don't know how to inspect the state from outside the app.
If you want, I can add unit tests in fx to try to reproduce the issue.

It's actually stuck at starting if context times out during Start for some reason

It's actually stuck at starting if context times out during Start for some reason

Good to know!
So either we need to move from starting to incompleteStart after timeout,
or expand https://github.com/uber-go/fx/blob/master/internal/lifecycle/lifecycle.go#L268 to allow starting.

It's actually stuck at starting if context times out during Start for some reason

I agree with your "for some reason".

How does the defer block of lifecycle.(* Lifecycle).Start get missed? Looks like it is really trying to get to incompleteStart, even after returning on line 208.
https://github.com/uber-go/fx/blob/master/internal/lifecycle/lifecycle.go#L193-L209

	l.state = starting

	l.startRecords = make(HookRecords, 0, len(l.hooks))
	l.mu.Unlock()

	var returnState appState = incompleteStart
	defer func() {
		l.mu.Lock()
		l.state = returnState
		l.mu.Unlock()
	}()

	for _, hook := range l.hooks {
		// if ctx has cancelled, bail out of the loop.
		if err := ctx.Err(); err != nil {
			return err
		}

How does the defer block of lifecycle.(* Lifecycle).Start get missed?

Yeah, I'm trying to figure this one out..

Okay, if you add time.Sleep(_timeout) before calling app.Stop(stopCtx) the state actually transitions into incompleteStart started, and the Stop hooks are run accordingly. The reason it goes to started in this example is because the long running hook is the last Start hook, and we don't check for context.Err() after a hook finishes running. The hook that's sleeping is actually not done running before app.Stop is called, so the defer block which sets the lifecycle state is never run, and the program exits.

There's not really an easy way to transition lifecycle state from starting to incompleteStart when context expires before a hook finishes running. And even if we expand the condition on https://github.com/uber-go/fx/blob/master/internal/lifecycle/lifecycle.go#L268, we might still end up not running the Stop hook, for which a corresponding Start hook is still running. I'll investigate more on the solution tomorrow.

Internal Ref: GO-1859

hook that's sleeping is actually not done running before app.Stop is called

ack. I missed the go routine started in withTimeout.
With that knowledge I might be able to work around this in the particular test I'm debugging.

I'll investigate more on the solution tomorrow.

Thanks!

zmt commented

Following along. This is indeed an interesting edge case.

I was able to investigate on this a little more recently. The biggest issue with a long-running start hook that causes a timeout is that there's no way to stop the hook itself from running when context expires. And we can't really run the corresponding stop hook if a start hook is still running by the time app.Stop is called.

We can, however, make sure to run all stop hooks for which corresponding start hooks finished running when app.Stop is called. There are two options for this:

  1. Make sure that app's lifecycle state to be incompleteStart when app.Start times out
  2. Expand the condition in lifecycle.go to include starting

Ideally we would go with option 1, but this might require a non-trivial amount of refactor to the way we execute lifecycle hooks in general. So I think for now we should implement option 2, and when we eventually implement option 1, we can revert option 2.

Additionally, we should clarify in documentation that for a long-running start hook that isn't done running when app.Start times out, the corresponding stop hook will NOT be run if it's not done running by the time app.Stop is called.