doxout/recluster

Question: what's the correct way to gracefully shutdown a cluster and its children?

d6u opened this issue · 3 comments

d6u commented

I have a shutdown handler in child.js

let makeCloseHandler = (sig) => {
  return () => {
    console.log(`Received signal ${sig}`);
    server.close(() => {
      console.log('Server closed');
    });
  };
};

process.on('SIGINT', makeCloseHandler('SIGINT'));
process.on('SIGTERM', makeCloseHandler('SIGTERM'));

And cluster.js

const recluster = require('recluster');
const path = require('path');

const cluster = recluster(path.join(__dirname, 'index.js'), {
  timeout: 120
});

const workerEvent = function(ev) {
  cluster.on(ev, function(worker) {
    console.log('Worker ' + worker.id + ' [' + worker.process.pid + '] ' + ' ' + ev + '.');
  });
};

['online', 'listening', 'disconnect', 'exit'].forEach(function(ev) {
  workerEvent(ev);
});

cluster.run();
console.log('Master ' + process.pid + ' started.');

let makeCloseHandler = (sig) => {
  return () => {
    console.log(`Cluster received signal ${sig}`);
    cluster.terminate(() => {
      console.log('Cluster closed');
    });
  };
};

process.on('SIGINT', makeCloseHandler('SIGINT'));
process.on('SIGTERM', makeCloseHandler('SIGTERM'));

But I see those logs when I stop the node cluster.js

^CReceived signal SIGINT
Received signal SIGINT
Cluster received signal SIGINT
Received signal SIGINT
Server closed
Server closed
Server closed
Received signal SIGINT
Server closed
Cluster closed

It seems like child are receiving SIGINT before master does. So I'm confused on how grace shutdown are handled here. What's the best way to ensure we don't drop connect halfway in a request?

spion commented

To avoid dropping connections halfway,

(1) If you have another load balancer above the cluster, the best way would be to switch to the replacement process before shutting down the cluster using that load balancer.

(2) If you just want to replace the recluster workers gracefully (without replacing the master process) you should probably use reload instead of terminate.

edit: I forgot that terminate kills the workers immediately. Right now there is no shutdown method which would keep active workers running while there are connections (at least until the timeout expires), so we might need to add that. Until thats added I guess the best workaround for (1) would be to wait sufficiently long after switching, then use terminate.

d6u commented

Thanks for explaining!

@spion From your last comment, I take it you are open to a PR to add a shutdown method?