doxout/recluster

Option to fork new worker on worker "stoppedListening"?

sandinmyjoints opened this issue · 13 comments

This still stands between me and zero downtime: on process uncaughtException, I do server.close which means this worker is no longer accepting new connections. However, it may take some time before exiting to wind down existing connections, clean up, etc. During this time, recluster will not fork a new worker, which means I'm down one worker. On a two-core machine, should another uncaught exception be thrown (perhaps there is a bad request that triggered the first, and now the client is resending it, so the second worker gets it), I will have no workers accepting new connections.

It seems that in this case, master will queue new connections until a worker accepts them, though I don't know for how long it will hold them, nor how many it will queue. But basically, this is downtime.

On recluster.reload, old workers are disconnected and new workers to replace them are immediately forked, which works great for zero downtime. However, could there be some way for a worker to tell master "I am no longer listening for new connections" which would be another trigger for master to fork a new worker? Maybe master could listen for a message "stoppedListening" from the worker, that I could send in my uncaught exception handler?

Just throwing this out there for discussion. I know you said you would be busy with a project so no worries if you don't have time for this now, but interested to hear your thoughts whenever you are able!

Hmm, I thought that we were respawning workers in cluster.on('disconnect'). Looks like the sensible thing to do, at least if {readyWhen: 'listening'}. We should run the tests for that.

Off the top of my head, this is how I think the 3 distinct cases should behave

readyWhen: started - use worker.on('exit', workerExit) - these guys don't communicate much so assume we need to wait for them to die.
readyWhen: listening - use worker.on('disconnect', workerExit) - we can tell that they should be replaced if they're not listening anymore.
readyWhen: ready - use worker.on('message', function(m) { if (m.cmd == 'disconnect') workerExit(); }) - these are the talkative ones and we expect them to tell us that they should be replaced.

Does that look right?

Two things will need to be taken into account

  1. readyWhen: 'listening' should use on('exit', workerExit) before the worker starts listening. Only afterwards it should remove that listener and switch to .on(disconnect)
  2. readyWhen: 'ready' should use on('exit', workerExit) all the time, but remove it if a disconnect message fires.

With this and the appropriate tests I think we're going to get the desired respawn behavior in all cases.

Edit: Actually, even better:

  1. All methods should use on('exit', workerExit); - this is because 2. and 3. may not always be able to fire (uncaught exceptions, syntax errors, etc) and shut down workers gracefully. (TODO: check)
  2. {readyWhen: 'listening'} should also use on('disconnect', workerExit) (TODO)
  3. {readyWhen: 'ready} should capture a {cmd: 'disconnect'} message for workerExit (TODO)
  4. workerExit should :
    • only fire once per worker (TODO)
    • set up a timeout to kill that worker (already done)
    • spawn a replacement. (already done)

Makes sense. workerExit should remove all listeners so it always only called once, as you mention. The question is when to trigger it -- and this notion of "simple server workers", "listening workers", and "ready workers" (corresponding to points 1, 2, and 3 above) answers that question.

I do have a question about the "listening worker" case: should it also capture {cmd: 'disconnect'}? This would be because its http server may close and thus stop accepting new connections (for example, my app's uncaught exception handler does this using the reference it has to server), but the worker process may not at that time also disconnect from the cluster IPC pipe, so it (the worker process) does not emit the 'disconnect' event at that point (my uncaught exception handler is in app.js, not in cluster.js, so it does not have a reference to the worker to call disconnect on). In this case this worker process is still running and still on the IPC pipe so the master will send it new connections, however its http server is not accepting them so there is no point and it should really be disconnected from IPC -- the question is how to do that.

Maybe there is a way to straight-up disconnect from the IPC pipe without a reference to the worker by using process that I am missing. But I can definitely do process.send({cmd: 'disconnect'}) so that's what I'm wondering if it should also be listened for in the readyWhen: listening case.

I agree, it makes sense to always enable that command. Or atleast I don't see harm in enabling it.

The remaining work in this will be to write the tests that check if the behavior is correct. This is going to be a little hard so I'll postpone it for a while longer

Thanks for the update. I took a look into doing it myself but the tests scared me off TBH -- didn't have time to figure out what was going on there and what it would take to cover the new behavior.

Complete: implemented what I think it should do without breaking old tests.

In progress: implementing new termination tests.

You can checkout master if you're curious to see if the WIP works for you...

Done, with about 98% test coverage. Can you check if this fixes stuff for you?

Awesome, thanks for working on this! At first glance, it looks to me like it should work. I do not have capacity to try it out now but should be able to by or during this weekend.

Cool. And by the way while I was at it I also fixed the previously mentioned eventemitter issue - as long as I write tests for that, it should now be safe to use the returned "cluster" as an eventemitter (and to document that API)

Oh, and possibly because of a bug in node's cluster module, reloads often times resulted with workers that were never terminated after the timeout. For some reason node's cluster module kept listing the original workers as the active ones so recluster was trying to kill the original long dead processes, ignoring the ones that are actually active.

This resulted with 30-40 active processes for us in production after a couple of weeks (about 10 live reloads). I fixed that by not relying on node's worker management in recluster, replacing it with its own.

Interesting, that's good to know.

I just tried running the tests for the first time tonight. Each time between 1 and 7 tests fail, I'm pretty sure due to the reliance on timing. I don't have a better idea for it either, however.

Just as a sanity check, what version of Node are you using? I'm on 0.10.18. I mainly ask because I noticed that at some point 0.11 changed the cluster module significantly. Not sure whether/how that would impact recluster.

Using 0.10.21. Yeah the tests rely on timing which is awkward. I didn't want to mock things because mocking would mean making assumptions.

Perhaps I could relax the timings a bit...

Edit: done. Although I can't say if it helps or not - here everything passes after the first run (which did fail though, perhaps because reading from disk took more time)

On a faster machine, everything passed for me, even before you relaxed the timings. On my 2008-era machine, a few tests are still failing with the relaxed timings. Oh well. I think the logic is correct, which is the important thing. Thanks for implementing this!