mixmaxhq/custody

Harvest orphan webservers

wearhere opened this issue · 1 comments

Background: we tend to have supervisor directly monitor gulp processes, that launch and monitor child server processes: livereload servers using gulp-livereload, and web servers using gulp-nodemon.

Something in this chain tends to lose the server processes, so that they keep running, but no longer under supervisor control. When, eventually, the supervisor-controlled process starts/restarts, the servers crash with EADDRINUSE errors.

If custody had a way of detecting these EADDRINUSE errors, it could automatically fix the problem.

1. Detect the error

Means of detection are described in #2. Process instrumentation would be the cleanest way to detect these errors, but it might be easier to tail logs for EADDRINUSE errors than other errors insofar as the message format is more fixed.

2. Harvest the orphan server

custody can kill an orphan server by doing the equivalent of the following:

kill -9 $(lsof -ti :$PORT)

3. Restart the supervisor-controlled server

This differs by type of server. Webservers can be relaunched by touching app.js, since gulp-nodemon monitors that file and will restart the webserver in response.

livereload servers are not run under gulp-nodemon at the moment. Perhaps they could be brought under its control? Otherwise, custody could restart gulp altogether—that's just slower insofar as it entails rebuilding the service.

@gaastonsr is also interested in preventing the processes from being orphaned in the first place, by (perhaps) making doubly sure that we kill all the child processes when restarting gulp. If we could do this then perhaps we could avoid some of the logic above. However we'd need to be really sure we had addressed the causes of orphans.