After workflow repaired task is executed two times

Question

After workflow repaired task is executed two times

astelmashenko opened this issue 2 years ago · 2 comments

Describe the bug
We notices that task is executed twice sometimes. After we enabled debug logs we found out that after WorkflowRepairService re-queued task for some reason the task was exeucted two times:

INFO  2022-07-04T07:56:38,583 147034  com.netflix.conductor.core.reconciliation.WorkflowRepairService [sweeper-thread-1]  Task 425d9c94-dc30-441b-b21b-73ccc5118829 in workflow d6e20f06-c884-4c25-81a4-4a7c0eb3827e re-queued for repairs

DEBUG 2022-07-04T07:56:42,994 151445  com.netflix.conductor.contribs.tasks.http.HttpTask  [system-task-worker-1]  Response: 200, {bills={partyAUTHOR={biId=5200737, status=OPEN}, partyUNIVERSITY={biId=5200740, status=OPEN}}}, task:425d9c94-dc30-441b-b21b-73ccc5118829

DEBUG 2022-07-04T07:56:42,994 151445  com.netflix.conductor.contribs.tasks.http.HttpTask  [system-task-worker-0]  Response: 200, {bills={partyAUTHOR={biId=5200738, status=OPEN}, partyUNIVERSITY={biId=5200739, status=OPEN}}}, task:425d9c94-dc30-441b-b21b-73ccc5118829

What does WorkflowRepairService do and do we need it at all? Why does it happen even when we have lock service?
Thanks.

Details
Conductor version: 3.7.2
Persistence implementation: Postgres
Queue implementation: Postgres
Lock: Redis

To Reproduce
This happens from time-to-time, we did not find steps to reproduce

Expected behavior
HTTP task must be executed only once.

The original issue was opened condcutor-community Netflix/conductor-community#70
But nobody responded in months

Answer 1 · 2023-05-25T17:00:57.000Z

Hi @astelmashenko , WorkflowRepairs checks for the taskId before pushing anything into the queue. Are you using locks in your configuration? There is a high chance that workflow execution is not guarded by locks so the task may be picked up by two different threads.

Answer 2 · 2023-05-26T07:22:37.000Z

@manan164 , Yes we are using lock (Redis). What I have in mind is upgrade of conductor. E.g. we fixed something in our custom task and re-deploying conductor with thousands of workflows. How does it stop, e.g. stop decider firtst, wait for complete of all running tasks, stop connections and shutdown conductor.
The question: Is the process of shutdown deterministic, is there evidence that it shutdowns gracefully?