doxout/recluster

Domain example

thedug opened this issue · 17 comments

Is there a working example of using domains with recluster to gracefully restart on unhanded exceptions?

You should generally call process.disconnect() to close the IPC channel with the master and/or server.close() to stop listening for new clients. When a process disconnects the IPC channel recluster will notice that and spawn a replacement. It will also wait for a while (the timeout option) then force-kill the process that disconnected. This should give you enough time to keep serving any remaining clients for a while (e.g. pending downloads finishing) and/or tell them to reconnect if needed (e.g. for websocket connections)

process.on('uncaughtException' function() {
  process.disconnect()
  server.close();
  // dispose of any active connection pools or database/other clients (or wait for them to expire)
});

This code doesn't really save you, but its slightly better than crashing immediately in some cases (1)

But domains wont be of much help here compared to the uncaughtException event on process.

IMHO domains should just be completely removed from node, as they're simply unsafe. If you're using domains, you need to reinitialize every bit of state and clean up every resource within that domain. This is pretty much impossible given that some libraries keep their state and resources internal (e.g. using domains with a domain-unaware database connection pool may cause that pool to leak connections).

Since you can't do that, you need to restart the process. In that case, you might as well use the uncaughtException event instead. Using domains in these cases only lets you shut down ungracefully for certain parts of your codebase which could get into a state so bad that it would be dangerous to continue running even for the short while that you're gracefully shutting down. Those cases should be rare (2)

My advice is to use promises . Compared to domains they have a much saner error handling model, and there are resource allocation and disposal patterns which can help.

(1) If backoff > timeout, the master is not in danger of spawning an infinite amount of workers. If a user discovers a way to crash the server and keep doing that action repeatedly, the backoff parameter will eventually be large enough that no workers will be available to serve requests (resulting with denial of service), just like if there wasn't an uncaughtException handler at all.

If however backoff < timeout a malicious user can cause the server to spawn a really large number of workers which can cause the entire machine (virtual or physical) to become unresponsive.

Depending on your setup, you can pick either one or the other

(2) Here is an example: a piece of code needs to disable delete ticket booking abilities for another user after they send their first booking request, but it crashes at disable, so any currently active clients that hold an active connection (e.g. via websockets) can book an infinite number of tickets. (Its hard to come up with an illustrative example, really, but they do exist).

There are multiple use cases for node, and the answer differs depending on the use case

Is the uncaughtException event always processed? I thought that an uncaught exception would crash the process by default.

What does it mean to "close the IPC" channel?

My goal is to let exciting connections complete but not allow new ones, then restart.

Seems like the built in cluster support can respawn workers in this case. Most folks advocate doing this with domains.

In your answer below us that all that is needed assuming I using re cluster? Just listen for that event and shutdown?

On Feb 19, 2015, at 7:52 PM, Gorgi Kosev notifications@github.com wrote:

You should generally call process.disconnect() to close the IPC channel with the master and/or server.close() to stop listening for new clients. When a process disconnects the IPC channel recluster will notice that and spawn a replacement. It will also wait for a while (the timeout option) then force-kill the process that disconnected. This should give you enough time to keep serving any remaining clients.

process.on('uncaughtException' {
process.disconnect()
server.close();
// dispose of any active connection pools or database/other clients (or wait for them to expire)
});
But domains wont be of much help here compared to the uncaughtException event on process.

IMHO domains should just be completely removed from node, as they're simply unsafe. If you're using domains, you need to reinitialize every bit of state and clean up every resource within that domain. This is pretty much impossible given that some libraries keep their state and resources internal (e.g. using domains with a domain-unaware database connection pool may cause that pool to leak connections).

Since you can't do that, you need to restart the process. In that case, you might as well use the uncaughtException event instead. Using domains in these cases only lets you shut down ungracefully for certain parts of your codebase which could get into a state so bad that it would be dangerous to continue running even for the short while that you're gracefully shutting down. Those cases should be rare *

My advice is to use promises. Compared to domains they have a much saner error handling model, and there are resource allocation and disposal patterns which can help.

Here is an example: a piece of code needs to disable delete ticket booking abilities for another user after they send their first booking request, but it crashes at disable, so any currently active clients that hold an active connection can book an infinite number of tickets. (Its hard to come up with an illustrative example, really, but they do exist)

Reply to this email directly or view it on GitHub.

The IPC channel is the way the cluster module delegates requests to the spawned workers.

If the channel is closed, the worker will stop receiving new requests (it can still process the old ones though).

Adding the uncaughtException handler results with the process not crashing when there is an uncaught exception; however, the worker may be in a bad state and should not serve any more requests. It should shut down ASAP.

You can just listen on that event and shut down, yes. Whether its going to be safe or not depends on your application. But the same thing applies for domains, they just limit the scope where uncaught exceptions are caught.

Just to make sure I've got this right, process.disconnect will disconnect the IPC channel and recluster will respawn a new one and the timeout will force kill the any processes that got a disconnect?

Why then do we need to call server.close? Doesn't that also just stop new requests?

Domains do seem much more complicated than just listening to an event. Why do so many examples use domains when they aren't limiting the scope? Typically the only code that is not in the domain scope is some minute bootstrapping and the "shutdown" logic.

When you say "safe" depends on your app. Shouldn't most things be ok as long as you shutdown after bleeding existing connections with timeout fail safe?

Afaik, if you don't call server.close() then the process will remain active trying to listen for new connections (which wont happen, but that doesn't prevent it from waiting). You'd also need to close any other sockets such as db connection pools, etc.

I think domains are just a failed experiment in error handling, unfortunately. I recommend against using them.

Isaac's shank-shiz example is a good demonstration of what could happen with some applications that throw while they're in an inconsistent state. In short, it should be fine unless the exception leaves the app in a state where existing clients can send it dangerous commands.

Do you know of a good way to throw an "uncaught" exception that won't be handled by the express middleware error handling? I'd like to setup some manual tests to make sure I'm exiting properly. The db connection is a great example of something that will keep the process alive.

After more research, folks seem to agree to avoid domains. It is unfortunate that the node documentation says the opposite. The one bummer is that the unhandledException event doesn't provide any context on the exception which makes troubleshooting difficult. Are there any work arounds for this?

The only way to avoid the Express exception handler is to throw your exception outside of the Express call-stack -- in an async callback. So, setTimeout({throw new Error(...)})

@thedug Not that I know of, I'm afraid. Maybe https://github.com/tlrobinson/long-stack-traces ?

In my code I just use promises for everything. Bluebird has excellent debugging facilities: great long stack traces with performance that is usable in production, as well as an excellent promisifyAll method that deep-promisifies all callback-based functions. It was a huge and invasive refactor but completely worth it.

That way most of my code is in a safe promise stack that captures all errors and its much easier to manage resources there.

@spion does that work even for asyc callbacks like @codeaholics example above setTimeout({throw new Error('wat!')});

@thedug thats the idea behind long-stack-traces, yes.

This is pretty comprehensive advice. I can close the the ticket, would it be worth migrating any of this to the wiki or readme so that others can benefit?

@spion what logger do you use with recluster?

I use debug and regular old console.(log|info|warn|error).

Yes, wiki pages sound good, but I'm not sure what topics should be there.

@spion Is log-stack-traces safe for production? I notice that it wraps processor tick.

@spion - do you typically wrap recluster with upstart, systemd, or supervisor to make sure that when you do exit recluster will be restarted?

After some trial and error I landed on the following code

With the following differences

  • I use process.exit instead of process.disconnect as this works in "standalone" mode as well as "recluster" mode.
  • I nested db shutdown and process shutdown inside of the server close callback, otherwise the process would end immediately and not wait for close to finish (this is required even if you use process.disconnect)
  • I added a timeout to force shutdown if server close or db shutdown doesn't complete in time
  • I currently don't have a graceful shutdown option when running in standalone mode as I didn't see a good solution for determining if the signal came from recluster timeout or a manual signal.
function shutdownNow(){
  logger.crit("Could not close connections in time, forcefully shutting down");
  process.exit(1);
}

function handleKill(){
  logger.crit('Recieved kill command. Halting Immedidate. Send SIGUSR2 signal to gracefully reload.')
  process.exit(1);
}

function handleUncaughtException(err){
  logger.crit(err);
  gracefulShutdown(err);
}

var shuttingDown = false;
function gracefulShutdown(err){
  if (shuttingDown) return;
  logger.info('Shutting down.')
  shuttingDown = true;

  server.close();

  // if after
  setTimeout(shutdownNow, 10*1000);
}

// Use the close event so that we can accommodate recluster as well as our graceful shutdown
server.on('close', function() {
  mongoose.disconnect(function(){
    logger.info('Shutdown complete');
  });
});

// listen for TERM signal .e.g. kill
process.on ('SIGTERM', handleKill);

// listen for INT signal e.g. Ctrl-C
process.on ('SIGINT', handleKill);

// capture uncaught exceptions
process.on('uncaughtException', handleUncaughtException);

Note: when using log-stack-trace with recluster I get logs of errors that don't actual halt the app

Uncaught Error: channel closed
    at ChildProcess.send (child_process.js:406:26)
    at Worker.send (cluster.js:406:21)
    at killTimeout (/Users/dferguson/Desktop/tutorials/node/winston/node_modules/recluster/index.js:172:24)
    at EventEmitter.workerReplaceTimeoutTerminate (/Users/dferguson/Desktop/tutorials/node/winston/node_modules/recluster/index.js:149:9)
    at EventEmitter.emit (events.js:95:17)
    at emit (/Users/dferguson/Desktop/tutorials/node/winston/node_modules/recluster/index.js:46:19)
    at Cluster.workerDisconnect (/Users/dferguson/Desktop/tutorials/node/winston/node_modules/recluster/index.js:184:36)
    at Cluster.emit (events.js:95:17)
----------------------------------------
    at EventEmitter.on
    at new Worker (cluster.js:356:16)
    at Cluster.fork (cluster.js:510:11)
    at fork (/Users/dferguson/Desktop/tutorials/node/winston/node_modules/recluster/index.js:93:25)
    at EventEmitter.run (/Users/dferguson/Desktop/tutorials/node/winston/node_modules/recluster/index.js:191:47)
    at Object.<anonymous> (/Users/dferguson/Desktop/tutorials/node/winston/cluster.js:6:9)
    at Module._compile (module.js:456:26)
    at Object..js (module.js:474:10)
    at Module.load (module.js:356:32)