opsmatic/canstop

Dealing with subordinate tasks

Closed this issue · 5 comments

There are a couple of problems with the initial implementation, let's just list them off.

There's no cleaning up going on. It works fine for the usecase I had in mind while writing it (a few goroutines responsible for the lifetime of certain services within the larger program), but breaks down if you start passing in more ephemeral, numerous pieces of work like TCP connections. It does not work very well for the actual example it was inspired by (the @rcrowley blog post)

On the flip side, the original approach is almost certain to cause contention. If every single goroutine in the program is constantly checking on the same channel, we're bound to have a problem. I'm not exactly sure how channels are implemented, but sync is sync - if we have Nk clients on a box with 32 cores, all 32 of those cores are going to be syncing on that channel on the regular. PROBABLY I'm pre-optimizing, and it may all be dwarfed by http traffic or whatever, but.. I guess we'll see. I just worry that we'll end up on the path of "striped" channels anyway.. And at that point what are we even doing here.

Switching to a single channel and relying on the close() event does have the advantage that no bookkeeping needs to be done. Probably the accounting we have to do on tombs/channels would end up adding up to the same (hashmap lookups etc).

ALLOW ME TO TAKE ANOTHER CUT.

Seems like accounting is still needed either way, if we want to be able to report specifically un-graceful termination. Which I think we do..

One thing we can do is punt on managing the sub-tasks and let that be managed downstream. AKA, the HTTP service will still be notified of cancellation, but will be responsible for manaing its own WaitGroup or set of tombs for the connections it leaves open. It can wait to call Tomb.Done() until after it's done its own cleanup. But is this just a punt?

Possible epiphany: once you're dealing with individual connections, you're less worried about things like accounting and more just worried about letting the requests finish before main() terminates, aka what sync.WaitGroup is for. What we really need is an abstraction for the consumer of the tomb/control object to keep track of jobs it itself has spawned and wait for those to complete before marking itself as done.

To put it another way, logging "goro servicing connection of type X to specific host Y did not terminate quickly" is less interesting than "service for connections of type X did not exit cleanly."

This makes me think that we probably don't need to provide too much in the way of managing that, because users could use their own sync.WaitGroup but it might be nice to attach one to every instance of the control object anyway to help reduce boilerplate.

Something like Control.ClockIn and Control.ClockOut but with less dumb names, which just call Add() and Done() on a WaitGroup respectively.

Realizing that I need to just re-implement @rcrowley's echo server with this to make sure it works well.

This was taken care of with the Service vs Session distinction and passing a full Lifecycle object into everything.