thejerf/suture

Supervisor.Serve() should block untill all services are actually stopped

Closed this issue · 13 comments

I've come across this problem while trying to handle os.Interrupt signal gracefully.

What I'm doing is I spaw goroutine which waits for signal to come and calls Supervisor.Stop() and in main goroutine I call blocking Supervisor.Serve(). What happens here is when I stop supervisor it spawns a bunch of goroutines which should stop all services under it and Serve() returns after that. Unfortunatelly, since it's in main(), program terminated before all services will actually gracefully stop.

It's possible to work around that by asking requireing all services to confirm their termination, but it's not a very nice solution.

So, let's break this into theoretical and practical.

In theory, there's no solution to this if we're going to assume (correctly) that goroutines may not terminate. The problem sources from the fact that there is no way to be sure a goroutine is terminated, and there is no way to force a goroutine to terminate. No matter how you slice it, you can stop a supervisor and it may never finish stopping routines.

In practical terms, all you can really do is try to stop things and offer a timeout. But Suture could make that nicer and offer a function call that will either terminate when the supervisors do, or terminate after a certain timeout. And it could also recurse nicely on the supervisor tree to try to shut the whole thing down gracefully. So I could write up a function like *Supervisor.GracefulStop(time.Duration), where it will return if everything under it terminates OR when the duration is up.

Does that sound good? I'll have to sketch this up carefully to make sure I get the blocking behavior correct.

Actually, I take back the need for time.Duration; the supervisor's Timeout specification will do, I think.

Which means that in theory this could be done by modifying the Supervisor's Stop implementation to unconditionally work this way. In theory this shouldn't break anybody because the described contract for Stop is already that it shouldn't return until the service is stopped. At the moment the Supervisor meets that contract by shutting down its event loop, but in theory that could be arranged to wait for its children to try to terminate and then shut itself down upon the return.

Thoughts on changing the Stop behavior for Supervisor? It seems to make sense and conform with the given contracts, and your use case seems sensible to me.

Thanks for the quick response! Yep, this sounds good.

To be a little more detailed, here's my view on a possible solution:

  1. Suture can't be sure if a service's goroutine has actually terminated.
  2. It's reasonable to assume that if call to Service.Stop() returned, service has been terminated. If service developer cares about graceful termination he can implement Stop() in a way it'll block until service does all mandatory cleanup work.
  3. If developer doesn't care to implement that, suture shouldn't either :-)
  4. When suture has confirmation that all calls to underlying Service.Stop() returned, it's safe to return as well.
  5. If timeout happens before that, suture should return with err.

PS. I'm very new to golang, so I'm not sure that my view matches common practice :-)

Thoughts on changing the Stop behavior for Supervisor? It seems to make sense and conform with the given contracts, and your use case seems sensible to me.

Sounds perfectly fine.

Actually, upon further thought, I do think I want a new function, because I want the signature to be GracefulStop() []{Something}, which will return to you a list of services that failed to stop. Probably in the form of strings suitable for logging, since there's two reasons why returning the Service itself is dangerous: One, it can be assumed to be in a failure state at this point so who knows what trying to call anything on it will do, and two, it's an intrinsically racy thing because between the failure to stop and the return it may well have stopped. But it's still practically useful to log what failed to stop.

I'd return []ServiceToken. First, it allows user to identify programmatically which services failed to stop and maybe do something about it, if he cares. Second, it's possible to extend []ServiceToken's to add fmt.Stringer interface to ServiceToken contract and say that it should ServiceToken.String() should return the same thing as corresponding Service.String() (if service implements fmt.Stringer, of course). By maybe I'm overthinking it.

Yeah, I'll have another look at the code after lunch. It's possible that the already-existing logs of failure to terminate are also already enough information about what failed to terminate, so changing Stop() might be OK. I'd definitely bump the rev to 2.0.0 for that, but it may still be the right thing to do. I do want to add as little overlap as possible to what is already there.

I have chosen to try to implement it so that the supervisor will wait for its children to stop before terminating, and after some examination of the code, depending on the existing logging seems the most sensible. I've pushed this to a branch called "supervisor_stop"; would you be so kind as to try this out with your use case and see if it meets your needs? If so, I'll merge this as either a 1.1.0 or a 2.0.0 release. (Partially based on my own experiences merging my work code bases that use this.)

The only crazy case I can see is that if you have a tree where the top-level Supervisor has a timeout of, say, 10 seconds, but it has supervisor children that have timeouts of 20 seconds, if the lower level ones hang, the top-level one will just give up and claim the supervisor failed, rather than the services themselves. It isn't clear to me that's "wrong", though; every layer is doing as you asked.

I've just checked, supervisor_stop works fine for me.

The only crazy case I can see is that if you have a tree where the top-level Supervisor has a timeout of, say, 10 seconds, but it has supervisor children that have timeouts of 20 seconds, if the lower level ones hang, the top-level one will just give up and claim the supervisor failed, rather than the services themselves.

I think this is actually expected behavior. Generally, higher supervisor doesn't have to be aware if service under it is another supervisor or not. Thus, it shouldn't be treated specially in any way. If service's timeouts aren't coherent with supervisor's, it a service's problem.

Alright, I'm going to take some time this week to put it through its paces in my code bases, and if it passes everything, I'll merge it in as a 1.1.0.

In the meantime, this has no API changes and I expect no API changes to result, so you can use this as you like, and let me know if I've mucked something up. :)

Ok, thanks for quick reaction! I'll let you know if I spot anything.

Took me a bit longer to get around to it than I thought, but it's merged now.

Cool, thanks!