Database deadlock causes executors to get stuck in a retry loop
ptorkko opened this issue · 4 comments
We've seen some issues where a deploy of our service resurrects some old batch workflows due to them being recovered on a restart.
NFlow 7.1.0 running on 4 nflow nodes with 2 executors each on top of jdk11 + postgres11.
In this case we have workflows that are scheduled or poll for data, partition the work and create child workflows for the partitions. Using the basic batch work pattern the children complete work and wake up the parent that then schedules more child workflows until all done.
After upgrade to 7.1.0 we randomly see this deadlock error where something goes wrong on persisting workflow changes.
Currently this causes that any executor this happens to, gets caught in a persistent loop of retrying the same operation every 60 seconds.
This happens initially:
org.springframework.dao.DeadlockLoserDataAccessException: PreparedStatementCallback; SQL [update nflow_workflow set next_activation = (case when executor_id is null then case when next_activation <= current_timestamp then next_activation else current_timestamp end else next_activation end), external_next_activation = current_timestamp where executor_group = 'nflow' and id = ? and next_activation is not null]; ERROR: deadlock detected
Detail: Process 4017 waits for ShareLock on transaction 18204573; blocked by process 12872.
Process 12872 waits for ShareLock on transaction 18204572; blocked by process 18526.
Process 18526 waits for ExclusiveLock on tuple (1467,1) of relation 16715 of database 16396; blocked by process 4017.
Hint: See server log for query details.
Where: while locking tuple (1467,1) in relation "nflow_workflow"; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
Detail: Process 4017 waits for ShareLock on transaction 18204573; blocked by process 12872.
Process 12872 waits for ShareLock on transaction 18204572; blocked by process 18526.
Process 18526 waits for ExclusiveLock on tuple (1467,1) of relation 16715 of database 16396; blocked by process 4017.
Hint: See server log for query details.
Where: while locking tuple (1467,1) in relation "nflow_workflow"
at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:267)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:72)
at org.springframework.jdbc.core.JdbcTemplate.translateException(JdbcTemplate.java:1443)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:633)
at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:862)
at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:917)
at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:927)
at io.nflow.engine.internal.dao.WorkflowInstanceDao.addExpectedStatesToQueryAndUpdate(WorkflowInstanceDao.java:502)
at io.nflow.engine.internal.dao.WorkflowInstanceDao.wakeUpWorkflowExternally(WorkflowInstanceDao.java:480)
at io.nflow.engine.internal.dao.WorkflowInstanceDao$$FastClassBySpringCGLIB$$f1fc6e3.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:685)
at io.nflow.engine.internal.dao.WorkflowInstanceDao$$EnhancerBySpringCGLIB$$4f0e63fa.wakeUpWorkflowExternally(<generated>)
at io.nflow.engine.internal.executor.WorkflowStateProcessor.lambda$processSuccess$2(WorkflowStateProcessor.java:303)
at java.base/java.util.Optional.ifPresent(Optional.java:183)
at io.nflow.engine.internal.executor.WorkflowStateProcessor.processSuccess(WorkflowStateProcessor.java:301)
at io.nflow.engine.internal.executor.WorkflowStateProcessor.persistWorkflowInstanceState(WorkflowStateProcessor.java:292)
at io.nflow.engine.internal.executor.WorkflowStateProcessor.saveWorkflowInstanceState(WorkflowStateProcessor.java:264)
at io.nflow.engine.internal.executor.WorkflowStateProcessor.runImpl(WorkflowStateProcessor.java:180)
at io.nflow.engine.internal.executor.WorkflowStateProcessor.run(WorkflowStateProcessor.java:117)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected
Detail: Process 4017 waits for ShareLock on transaction 18204573; blocked by process 12872.
Process 12872 waits for ShareLock on transaction 18204572; blocked by process 18526.
Process 18526 waits for ExclusiveLock on tuple (1467,1) of relation 16715 of database 16396; blocked by process 4017.
Hint: See server log for query details.
Where: while locking tuple (1467,1) in relation "nflow_workflow"
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2497)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2233)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:310)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:446)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:370)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:149)
at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:124)
at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeUpdate(ProxyPreparedStatement.java:61)
at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java)
at org.springframework.jdbc.core.JdbcTemplate.lambda$update$0(JdbcTemplate.java:867)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:617)
... 19 common frames omitted
And this every 60 seconds afterwards
org.springframework.dao.EmptyResultDataAccessException: Incorrect result size: expected 1, actual 0
at org.springframework.dao.support.DataAccessUtils.nullableSingleResult(DataAccessUtils.java:97)
at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:784)
at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:809)
at io.nflow.engine.internal.dao.WorkflowInstanceDao.updateWorkflowInstanceWithCTE(WorkflowInstanceDao.java:428)
at io.nflow.engine.internal.dao.WorkflowInstanceDao.updateWorkflowInstanceAfterExecution(WorkflowInstanceDao.java:325)
at io.nflow.engine.internal.dao.WorkflowInstanceDao$$FastClassBySpringCGLIB$$f1fc6e3.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:685)
at io.nflow.engine.internal.dao.WorkflowInstanceDao$$EnhancerBySpringCGLIB$$4f0e63fa.updateWorkflowInstanceAfterExecution(<generated>)
at io.nflow.engine.internal.executor.WorkflowStateProcessor.persistWorkflowInstanceState(WorkflowStateProcessor.java:290)
at io.nflow.engine.internal.executor.WorkflowStateProcessor.saveWorkflowInstanceState(WorkflowStateProcessor.java:264)
at io.nflow.engine.internal.executor.WorkflowStateProcessor.runImpl(WorkflowStateProcessor.java:180)
at io.nflow.engine.internal.executor.WorkflowStateProcessor.run(WorkflowStateProcessor.java:117)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
I can see via a thread dump that the executors are stuck in WorkflowStateProcessor#sleepIgnoreInterrupted
in the catch branch of WorkflowStateProcessor#saveWorkflowInstanceState
on line 268 in version 7.1.0.
Looking at the 7.2.0 changes, they don't quite touch these code paths, but already upgrading to 7.2.0. But as this is completely random, it's quite hard to know if it's fixed.
Hello,
Sorry for the late reaction to this.
Have you seen this after upgrading to 7.2.0?
Did you check the server log for query details as suggested by the exception message? Was there anything useful?
IMHO the EmptyResultDataAccessException should not be happening anyway, it could be some problem with the CTE version of the workflow instance update statement.
Does your deployment kill the instances immediately or does it let them shut down gracefully? This may cause recovery of workflow instances while they could have been processed successfully by the old instances before shutting down.
Are you using the BulkWorkflow provided by nFlow to implement your batch work pattern?
br, Edvard
Hello,
#416 should fix this.
It remains a mystery why that deadlock ever happens, but it could be related to some Spring Batch magic. From nFlow point of view, we are just updating a single row in a one table and not even referrring to any other tables, so it should not cause any deadlocks.
br, Edvard
Has not happened since upgrade to 7.2.0
Ok thanks, I will close this issue then. If the deadlock still happens, 7.2.2 should handle the situation better.