The easiest way to do zero downtime deployments is using multiple nodes behind a load balancer. Once removed from the rotation, you can fiddle with them to your heart’s content.
If you only have one box to play with, things aren’t so simple. One option is to push the complexity up to whatever you’re using to orchestrate deployments. You could do something like a blue/green deployment, with two full sets of processes behind nginx as a load balancer; but this felt liable to be very fragile.
I next started looking at a process manager, like pm2; but it seemed to offer far too many features I didn’t need, and didn’t play that well with systemd. It was inspiration though, for just going direct to the nodejs cluster API. It’s been available for some time, and is now marked as “stable”.
It allows you to run multiple node processes that share a port, which is also useful if you are running on hardware with multiple CPUs. Using the cluster API allows us to recycle the processes when the code has changed, without dropping any requests:
var cluster = require('cluster'); module.exports = function(start) { var env = process.env.NODE_ENV || 'local'; var cwd = process.cwd(); var numWorkers = process.env.NUM_WORKERS || 1; if (cluster.isMaster) { fork(); cluster.on('exit', function(worker, code, signal) { // one for all, let systemd deal with it if (code !== 0) { process.exit(code); } }); process.on('SIGHUP', function() { // need to chdir, if the old directory was deleted process.chdir(cwd); var oldWorkers = Object.keys(cluster.workers); // wait until at least one new worker is listening // before terminating the old workers cluster.once('listening', function(worker, address) { kill(oldWorkers); }); fork(); }); } else { start(env); } function fork() { for (var i = 0; i < numWorkers; i++) { cluster.fork(); } } function kill(workers) { if (workers.length) { var id = workers.pop(); var worker = cluster.workers[id]; worker.send('shutdown'); worker.disconnect(); kill(workers); } } };
During a deployment, we simply replace the old code and send a signal to the “master” process (kill -HUP $PID); which causes it to spin up X new workers. As soon as one of those is ready and listening, it terminates the old workers.