Interrupts while registering processes with ProcessEpoll can leak processes
bturner opened this issue · 2 comments
ProcessEpoll
maintains a BlockingQueue
of EpollEvent
instances. When processes are registered, an instance is pulled off the queue, used to register events for that process, and then put back onto the queue. The queue has 64 EpollEvent
instances in it, so In a system that's lightly loaded there's always an instance available. In a heavily loaded Bitbucket Server instance, though, spikes can easily trigger the creation of 100+ processes concurrently.
If the thread is interrupted (e.g. Thread.interrupt()
) while it's blocked in take()
, a RuntimeException
is thrown and caught by LinuxProcess
's start
or run
code, which logs the exception with a message that the process couldn't be started and the process is abandoned. But that's not actually accurate. The process was started, but NuProcess couldn't register it. That means NuProcess never reaps the process and it either hangs (if it needs stdin
or blocks after writing too much to stdout
or stderr
) or terminates and becomes a zombie.
2021-05-25 11:11:15,409 WARN [threadpool:thread-19] jdoe *D51P0Nx671x3323004x13 1ygxqot 1.1.1.1,2.2.2.2 "GET /rest/api/latest/projects/KEY/repos/slug/branches HTTP/1.1" c.z.n.internal.BasePosixProcess Failed to start process
java.lang.RuntimeException: Interrupted while registering 7224
at com.zaxxer.nuprocess.linux.ProcessEpoll.registerProcess(ProcessEpoll.java:139)
at com.zaxxer.nuprocess.linux.ProcessEpoll.<init>(ProcessEpoll.java:74)
at com.zaxxer.nuprocess.linux.LinuxProcess.run(LinuxProcess.java:102)
at com.zaxxer.nuprocess.linux.LinProcessFactory.runProcess(LinProcessFactory.java:50)
at com.zaxxer.nuprocess.NuProcessBuilder.run(NuProcessBuilder.java:273)
at com.atlassian.bitbucket.internal.process.nu.NuNioProcessHelper.run(NuNioProcessHelper.java:75)
at com.atlassian.bitbucket.internal.process.NioCommand.call(NioCommand.java:46)
at com.atlassian.stash.internal.scm.git.command.TransformedGitCommand.call(TransformedGitCommand.java:45)
at com.atlassian.bitbucket.internal.branch.LatestCommitMetadataProvider.getMetadata(LatestCommitMetadataProvider.java:45)
at com.atlassian.stash.internal.repository.PluginRefMetadataMapProvider.lambda$null$3(PluginRefMetadataMapProvider.java:99)
at com.atlassian.stash.internal.concurrent.StateTransferringCallable.call(StateTransferringCallable.java:33)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:400)
at com.zaxxer.nuprocess.linux.ProcessEpoll.registerProcess(ProcessEpoll.java:116)
... 18 common frames omitted
Note that the line numbers for ProcessEpoll
above won't line up with any NuProcess release. We had to switch to running a forked build of NuProcess while waiting for pull request #118 and #120 to be merged. What is ProcessEpoll.java:116
in our fork is line 112 In the canonical NuProcess codebase.
$ ps -ef | grep defunct
atlbitb+ 7224 26859 0 May25 ? 00:00:00 [git] <defunct>
I'm happy to put together a pull request for this, but before I do I'm curious how you'd like to see it addressed, @brettwooldridge. I can think of a couple options:
- Remove the
ProcessEpoll.eventPool
completely and, instead, create anEpollEvent
for eachLinuxProcess
. That way wheneverProcessEpoll
needs to register a process or queue forstdin
, it can use that process instance'sEpollEvent
to do it without any need to block- This will create more
EpollEvent
instances overall, but I resuscitated the JMH microbenchmark and ran it on the current 2.0.2 codebase and then on 2.0.2 + this change and the increase in the allocation rate was negligible; 63400 bytes/op and ~0.37 MB/sec before vs. 63500 bytes/op and 0.38 MB/sec after, with no measurable impact on ops/sec
- This will create more
- Update the
ProcessEpoll
code to usepoll()
instead oftake()
andoffer()
instead ofput()
(neither of which can throwInterruptedException
s). Ifpoll
returnsnull
,ProcessEpoll
would instantiate an ad hocEpollEvent
which it wouldn't put back on the queue- It's likely this wouldn't impact the general allocation rate at all, since it'd almost never happen unless under pretty heavy load/concurrency, but under heavy load this could end up allocating more
EpollEvent
s than if there was a fixed, reusable instance per process. (That said, that level of heavy load is uncommon, and likely to not be continuous.) - This approach is a bit more code/complexity
- It's likely this wouldn't impact the general allocation rate at all, since it'd almost never happen unless under pretty heavy load/concurrency, but under heavy load this could end up allocating more
Or maybe there's something else you prefer. I can put together whatever you think is the best option.