jenkinsci/opentelemetry-plugin

Potential deadlock from GraphListenerAdapterToPipelineListener

Dohbedoh opened this issue ยท 6 comments

Jenkins and plugins versions report

Environment
* core: 2.387.3
* opentelemetry:2.11.0
* workflow-api:1208.v0cc7c6e0da_9e
* workflow-cps:3653.v07ea_433c90b_4
* workflow-job:1289.1291.vb_7c188e7e7df
* workflow-support:839.v35e2736cfd5c

Issue

Jenkins shows a deadlock between open-telemetry GraphListenerAdapterToPipelineListener and the Jenkins initialization thread:

==============
Deadlock Found
==============
"Jenkins initialization thread" id=50 (0x32) state=BLOCKED cpu=95%
    - waiting to lock <0x23e558fe> (a org.jenkinsci.plugins.workflow.cps.CpsFlowExecution)
      owned by "Running CpsFlowExecution[Owner[myfolder/myjob/17:myfolder/myjob #17]]" id=1246 (0x4de)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.isComplete(CpsFlowExecution.java:1277)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getCurrentExecutions(CpsFlowExecution.java:1019)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener.onResumed(FlowExecutionList.java:267)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionListener.fireResumed(FlowExecutionListener.java:85)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:589)
    at hudson.model.RunMap.retrieve(RunMap.java:233)
    at hudson.model.RunMap.retrieve(RunMap.java:61)
    at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:650)
    at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:632)
    at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:530)
    at hudson.model.RunMap.getById(RunMap.java:213)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:965)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:977)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:76)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:68)
    at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)
    at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:141)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:197)
    at jenkins.model.Jenkins.<init>(Jenkins.java:1040)
    at hudson.model.Hudson.<init>(Hudson.java:86)
    at hudson.model.Hudson.<init>(Hudson.java:82)
    at hudson.WebAppMain$3.run(WebAppMain.java:247)

"Running CpsFlowExecution[Owner[myfolder/myjob/17:myfolder/myjob #17]]" id=1246 (0x4de) state=BLOCKED cpu=57%
    - waiting to lock <0x76460394> (a hudson.model.RunMap)
      owned by "Jenkins initialization thread" id=50 (0x32)
    at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:519)
    at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:237)
    at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:234)
    at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:105)
    at hudson.model.Run.fromExternalizableId(Run.java:2498)
    at org.jenkinsci.plugins.workflow.support.steps.build.DownstreamFailureCause.getDownstreamBuild(DownstreamFailureCause.java:47)
    at org.jenkinsci.plugins.workflow.support.steps.build.DownstreamFailureCause.getShortDescription(DownstreamFailureCause.java:63)
    at io.jenkins.plugins.opentelemetry.job.MonitoringPipelineListener.lambda$endCurrentSpan$11(MonitoringPipelineListener.java:308)
    at io.jenkins.plugins.opentelemetry.job.MonitoringPipelineListener$$Lambda$1490/0x00000008425bd440.apply(Unknown Source)
    at java.base@11.0.19/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at java.base@11.0.19/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
    at java.base@11.0.19/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base@11.0.19/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base@11.0.19/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
    at java.base@11.0.19/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base@11.0.19/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
    at io.jenkins.plugins.opentelemetry.job.MonitoringPipelineListener.endCurrentSpan(MonitoringPipelineListener.java:308)
    at io.jenkins.plugins.opentelemetry.job.MonitoringPipelineListener.onEndParallelStepBranch(MonitoringPipelineListener.java:291)
    at io.jenkins.plugins.opentelemetry.job.jenkins.GraphListenerAdapterToPipelineListener.fireOnAfterEndParallelStepBranch(GraphListenerAdapterToPipelineListener.java:129)
    at io.jenkins.plugins.opentelemetry.job.jenkins.GraphListenerAdapterToPipelineListener.onNewHead(GraphListenerAdapterToPipelineListener.java:64)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1588)
    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.notifyNewHead(CpsThreadGroup.java:492)
    at org.jenkinsci.plugins.workflow.cps.FlowHead.setNewHead(FlowHead.java:158)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1297)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.croak(CpsFlowExecution.java:882)
    at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.reportProblem(CpsVmExecutorService.java:57)
    at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:69)
    at java.base@11.0.19/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
    at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
    at jenkins.util.ErrorLoggingExecutorService$$Lambda$1086/0x0000000841d64840.run(Unknown Source)
    at java.base@11.0.19/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base@11.0.19/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base@11.0.19/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base@11.0.19/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base@11.0.19/java.lang.Thread.run(Thread.java:829)

Could be due to the usage of a synchronous GraphListener:

What Operating System are you using (both controller, and any agents involved in the problem)?

  • JVM Implementation

    • Name: OpenJDK 64-Bit Server VM
    • Vendor: Red Hat, Inc.
    • Version: 11.0.19+7-LTS
  • Operating system

    • Name: Linux
    • Architecture: amd64
    • Version: 3.10.0-1160.92.1.el7.x86_64

Reproduction steps

No consistently reproducible. But basically:

  • launch many many resumable pipelines
  • restart Jenkins

Expected Results

No deadlocks

Actual Results

deadlock

Anything else?

No response

jglick commented

Did you mean to show a thread dump excerpt here?

@jglick it's in the "Issue" <details> in the description.

Sorry about that. I tried to follow the issue template of the project. I can make it more visible.

@Dohbedoh @dwnusbaum do you have recommendations on how we can fix this? I looked at the logic of the code and it sounded reasonable to me.

@cyrille-leclerc GraphListenerAdapterToPipelineListener should probably not extend GraphListener.Synchronous, just GraphListener, see https://github.com/jenkinsci/workflow-api-plugin/blob/ca5fddb3fcebcc6acf823c80600ec7ab9deb0d05/src/main/java/org/jenkinsci/plugins/workflow/flow/GraphListener.java#L48. Or perhaps you could modify it to only perform a small subset of the current logic synchronously if necessary and then offload all of the complex behavior that acquires locks to a separate thread.

Thanks @dwnusbaum .
Some of it definitively needs to be synchronous so that we can associate a span with the FlowNode. I'll brainstorm to see if something can be made async and how we would orchestrate the sync and async parts.