brettwooldridge/NuProcess

Descriptors/handles are leaked when processes fail to start

bturner opened this issue · 2 comments

The exact codepaths vary, but for every supported platform there are cases where a failure to start a process leaks descriptors or handles.

For example, on Linux, if forkAndExec throws an exception, closePipes() is never called and the 3 *Widow descriptors are leaked. You can confirm this with NuProcess's own test suite:

  • Put a breakpoint in RunTest.noExecutableFound() after the call to pb.start()
  • When the breakpoint is hit, run lsof on the java process

At the end of the lsof output will be something like the following (the IDs will likely vary, of course)

java    3858 bturner   24r     FIFO   0,12       0t0  119418 pipe
java    3858 bturner   27w     FIFO   0,12       0t0  119419 pipe
java    3858 bturner   29w     FIFO   0,12       0t0  119420 pipe

If you use the debugger to manually call process.closePipes() and then run lsof again, you'll see that the 3 pipes are closed.

That specific codepath is handled correctly on macOS and all the pipes are closed, but if there are any errors in createPosixPipes() after it calls createPipes(), since createPosixPipes() is called outside the try/catch in run and start, the pipes are leaked. (This case is much less likely that the Linux one, to be clear; I'm just noting its an edge case that exists.)

WindowsProcess has similar issues with handles it creates in createPipes().

I'm hoping to put together a pull request (or pull requests) for these issues in the next couple days.

The Linux issue with leaked pipe descriptors has actually caused an outage for us in production internally, as well as for a fairly large customer. In both cases the system ran out of file descriptors through no direct fault of NuProcess's; Bitbucket Server was simply running too many concurrent processes, but once that happens it's basically impossible for the system to ever recover without a restart because each new failure to start a process leaks 3 more pipe handles. Similar situations with ProcessBuilder can recover on their own; once the load subsides and the number of concurrent processes drops, things are stable again. With the pipe leaks, even as load declines the number of free descriptors doesn't increase.

Raised pull request #120 for the Linux pipe part of the fix. The issues with macOS and Windows are much more subtle (on Windows I'm actually not certain there are any issues, at this point), and would involve failures of system calls other than posix_spawnp or CreateProcessW, so I'm not certain they're worth the complexity of the changes it would take to fix them.