WAIT Task inside DO_WHILE causing infinite task creation which are already completed
appunni-old opened this issue · 4 comments
Describe the bug
While running the below workflow it goes into infinite loop
Details
Conductor version: 3.15.0
Persistence implementation: Postgres and MySQL
Queue implementation: MySQL and Postgres
Lock: Redis
Workflow definition:
{
"createTime": 1701489520469,
"createdBy": "owner@email.com",
"updatedBy": "owner@email.com",
"accessPolicy": {},
"name": "test_do_while",
"description": "Workflow details",
"version": 1,
"tasks": [
{
"name": "default__do_while",
"taskReferenceName": "task_1__loop_databricks",
"inputParameters": {},
"type": "DO_WHILE",
"startDelay": 0,
"optional": false,
"asyncComplete": false,
"loopCondition": "if ($.task_1__loop_databricks['iteration'] < 200) { true; } else { false; }",
"loopOver": [
{
"name": "default__sleep",
"taskReferenceName": "task_1__wait_databricks",
"inputParameters": {
"duration": "20 seconds",
"tenantId": "csit"
},
"type": "WAIT",
"startDelay": 0,
"optional": false,
"asyncComplete": false
}
]
}
],
"inputParameters": [],
"outputParameters": {},
"schemaVersion": 2,
"restartable": true,
"workflowStatusListenerEnabled": false,
"ownerEmail": "owner@email.com",
"timeoutPolicy": "ALERT_ONLY",
"timeoutSeconds": 0,
"variables": {},
"inputTemplate": {}
}
Error in conductor server
conductor-server | 2023-12-02 04:56:48.224 ERROR 13 --- [m-task-worker-8] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: 94c1d30a-aef6-4861-be18-4fbfcd03743c could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:48.227 ERROR 13 --- [-task-worker-11] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: 8c683b28-0c10-42dd-894b-2aebead3e3e8 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:48.234 ERROR 13 --- [m-task-worker-9] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: 976eb165-97af-4451-800b-b506341bd938 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:48.250 ERROR 13 --- [-task-worker-10] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: 86de75ad-195b-4d18-86f6-7b1280702751 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:48.293 ERROR 13 --- [-task-worker-12] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: 822de4fa-2291-4516-82b9-bd6ef7f8b0ac could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:48.409 ERROR 13 --- [-task-worker-13] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: 7ed85099-e58f-45e1-845a-c44e141113e5 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:48.431 ERROR 13 --- [-task-worker-14] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: 784e801d-fcb6-488a-b575-6d476c86a6aa could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:48.466 ERROR 13 --- [-task-worker-15] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: 75e2b4ac-225e-447b-b8ff-b9ab5164c642 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:48.744 ERROR 13 --- [-task-worker-16] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: a2a4b709-7c51-4b24-a26f-f49cffbcf877 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:48.880 ERROR 13 --- [-task-worker-17] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: 73526e0e-0ede-4bda-8e4a-9d77d250e947 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.103 ERROR 13 --- [-task-worker-18] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: a47dded2-3d7c-4d25-82eb-021cdd19f288 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.120 ERROR 13 --- [-task-worker-20] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: a725f7d5-da1c-4d42-a453-0550592f8b06 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.121 ERROR 13 --- [-task-worker-21] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: ab0319c0-cb0f-49f5-8a7e-99c04fef1809 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.123 ERROR 13 --- [-task-worker-22] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: abdb849e-6fd9-43cd-b924-5415a933e6bb could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.189 ERROR 13 --- [-task-worker-23] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: b2a69eb2-bea1-4764-b986-ad75bb82e9dc could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.191 ERROR 13 --- [-task-worker-24] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: c9410bb4-d96e-4ca9-a6be-bd45e6f0ea53 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.192 ERROR 13 --- [m-task-worker-1] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: b3bfbe2f-9c32-41aa-89aa-08bed04c47ce could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.194 ERROR 13 --- [m-task-worker-2] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: b4b10f44-1530-4a93-971d-6025762d837b could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.196 ERROR 13 --- [m-task-worker-3] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: c5496ec6-2730-43eb-a265-86b06ae35807 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.197 ERROR 13 --- [-task-worker-23] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: c3369393-abb7-4c1f-907d-4d02eda5e9a4 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.207 ERROR 13 --- [m-task-worker-5] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: c26ffe62-4c7f-4d64-a1d5-ef203afc4272 could not be found while executing WAIT
conductor-server | 2023-12-02 04:56:49.210 ERROR 13 --- [m-task-worker-6] c.n.c.c.e.AsyncSystemTaskExecutor : TaskId: b6b9aef8-e941-48e2-a6
To Reproduce
Just goto UI http://localhost:5000
Create the above task definition
Goto workbench
Just trigger this workflow
WARNING - This creates an Infinite loop situation only use this with local conductor setup which can be deleted
Expected behavior
Loop runs and waits for 20 seconds between loop
Screenshots
The workflow is stuck not moving forward.
Additional context
Add any other context about the problem here.
Not able to replicate in orkes platform
I debugged it by running line by line, attaching first lines as well
595060 [sweeper-thread-24] INFO com.netflix.conductor.core.reconciliation.WorkflowRepairService [] - Task 46abe269-5daf-403a-9b15-cbd7878b8bed in workflow 7d137e5b-304e-449c-9607-6413bfee8fd0 re-queued for repairs
667288 [HikariPool-1 housekeeper] WARN com.zaxxer.hikari.pool.HikariPool [] - HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m16s793ms).
686827 [system-task-worker-2] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 1445ba4c-0bd5-4826-a359-984fd4da86a5 could not be found while executing WAIT
692015 [system-task-worker-3] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 05cbf978-86e3-48ef-b5cf-52b481edd5f5 could not be found while executing WAIT
699409 [system-task-worker-4] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 95ffee82-0cc9-468a-8ce8-af7b1d8438c1 could not be found while executing WAIT
700895 [system-task-worker-5] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: d7f9d0a7-3525-4eff-a07a-179bc57ab349 could not be found while executing WAIT
701862 [system-task-worker-7] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 75811fa6-ec79-40c3-9136-88b33a3a53f3 could not be found while executing WAIT
702397 [system-task-worker-6] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 80b2e11a-4f28-4f28-8737-26d1d7abd010 could not be found while executing WAIT
702762 [system-task-worker-9] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 58055094-5e0d-4613-beb6-078f940994fa could not be found while executing WAIT
Oh sorry this is broken, I ran it in orkes platform, it went to same loop. I regret now, I should have been more careful. Can some one help ?
And I definitely think it's something to do with the config, because I created same via UI and it worked completely fine. In orkes default cluster task limit was 1000, but this created 7552. I terminated the workflow. Otherwise it would have kept running.
Issue Identified: This happens when task reference name has double underscore. Which means this will evaluate false. We should have validation when accepting taskReference names not to have double underscore on workflow definition or on the Start workflow API
for (TaskModel t : workflow.getTasks()) {
if (doWhileTaskModel
.getWorkflowTask()
.has(TaskUtils.removeIterationFromTaskRefName(t.getReferenceTaskName()))
&& !doWhileTaskModel.getReferenceTaskName().equals(t.getReferenceTaskName())
&& doWhileTaskModel.getIteration() == t.getIteration()) {
relevantTask = relevantTasks.get(t.getReferenceTaskName());
if (relevantTask == null || t.getRetryCount() > relevantTask.getRetryCount()) {
relevantTasks.put(t.getReferenceTaskName(), t);
}
}
}
TaskUtils.removeIterationFromTaskRefName(t.getReferenceTaskName())
Is the culprit as it tries to fetch the task id by splitting DELIMITER ie "__".
public static String removeIterationFromTaskRefName(String referenceTaskName) {
String[] tokens = referenceTaskName.split(TaskUtils.LOOP_TASK_DELIMITER);
return tokens.length > 0 ? tokens[0] : referenceTaskName;
}
This leads to an infinite loop condition, creating infinite tasks