brettwooldridge/NuProcess

Interrupts while registering processes with ProcessEpoll can leak processes

bturner opened this issue · 2 comments

ProcessEpoll maintains a BlockingQueue of EpollEvent instances. When processes are registered, an instance is pulled off the queue, used to register events for that process, and then put back onto the queue. The queue has 64 EpollEvent instances in it, so In a system that's lightly loaded there's always an instance available. In a heavily loaded Bitbucket Server instance, though, spikes can easily trigger the creation of 100+ processes concurrently.

If the thread is interrupted (e.g. Thread.interrupt()) while it's blocked in take(), a RuntimeException is thrown and caught by LinuxProcess's start or run code, which logs the exception with a message that the process couldn't be started and the process is abandoned. But that's not actually accurate. The process was started, but NuProcess couldn't register it. That means NuProcess never reaps the process and it either hangs (if it needs stdin or blocks after writing too much to stdout or stderr) or terminates and becomes a zombie.

2021-05-25 11:11:15,409 WARN  [threadpool:thread-19] jdoe *D51P0Nx671x3323004x13 1ygxqot 1.1.1.1,2.2.2.2 "GET /rest/api/latest/projects/KEY/repos/slug/branches HTTP/1.1" c.z.n.internal.BasePosixProcess Failed to start process
java.lang.RuntimeException: Interrupted while registering 7224
        at com.zaxxer.nuprocess.linux.ProcessEpoll.registerProcess(ProcessEpoll.java:139)
        at com.zaxxer.nuprocess.linux.ProcessEpoll.<init>(ProcessEpoll.java:74)
        at com.zaxxer.nuprocess.linux.LinuxProcess.run(LinuxProcess.java:102)
        at com.zaxxer.nuprocess.linux.LinProcessFactory.runProcess(LinProcessFactory.java:50)
        at com.zaxxer.nuprocess.NuProcessBuilder.run(NuProcessBuilder.java:273)
        at com.atlassian.bitbucket.internal.process.nu.NuNioProcessHelper.run(NuNioProcessHelper.java:75)
        at com.atlassian.bitbucket.internal.process.NioCommand.call(NioCommand.java:46)
        at com.atlassian.stash.internal.scm.git.command.TransformedGitCommand.call(TransformedGitCommand.java:45)
        at com.atlassian.bitbucket.internal.branch.LatestCommitMetadataProvider.getMetadata(LatestCommitMetadataProvider.java:45)
        at com.atlassian.stash.internal.repository.PluginRefMetadataMapProvider.lambda$null$3(PluginRefMetadataMapProvider.java:99)
        at com.atlassian.stash.internal.concurrent.StateTransferringCallable.call(StateTransferringCallable.java:33)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
        at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
        at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:400)
        at com.zaxxer.nuprocess.linux.ProcessEpoll.registerProcess(ProcessEpoll.java:116)
        ... 18 common frames omitted

Note that the line numbers for ProcessEpoll above won't line up with any NuProcess release. We had to switch to running a forked build of NuProcess while waiting for pull request #118 and #120 to be merged. What is ProcessEpoll.java:116 in our fork is line 112 In the canonical NuProcess codebase.

$ ps -ef | grep defunct
atlbitb+  7224 26859  0 May25 ?        00:00:00 [git] <defunct>

I'm happy to put together a pull request for this, but before I do I'm curious how you'd like to see it addressed, @brettwooldridge. I can think of a couple options:

  • Remove the ProcessEpoll.eventPool completely and, instead, create an EpollEvent for each LinuxProcess. That way whenever ProcessEpoll needs to register a process or queue for stdin, it can use that process instance's EpollEvent to do it without any need to block
    • This will create more EpollEvent instances overall, but I resuscitated the JMH microbenchmark and ran it on the current 2.0.2 codebase and then on 2.0.2 + this change and the increase in the allocation rate was negligible; 63400 bytes/op and ~0.37 MB/sec before vs. 63500 bytes/op and 0.38 MB/sec after, with no measurable impact on ops/sec
  • Update the ProcessEpoll code to use poll() instead of take() and offer() instead of put() (neither of which can throw InterruptedExceptions). If poll returns null, ProcessEpoll would instantiate an ad hoc EpollEvent which it wouldn't put back on the queue
    • It's likely this wouldn't impact the general allocation rate at all, since it'd almost never happen unless under pretty heavy load/concurrency, but under heavy load this could end up allocating more EpollEvents than if there was a fixed, reusable instance per process. (That said, that level of heavy load is uncommon, and likely to not be continuous.)
    • This approach is a bit more code/complexity

Or maybe there's something else you prefer. I can put together whatever you think is the best option.