conductor-oss/conductor

Performance degradation caused by WAIT task change to Async

BatyaPinski opened this issue · 1 comments

Describe the bug
After upgrading our Conductor to version 3.16.0 (from version 3.11.3), we have encountered a significant performance degradation across our system. The degradation is evident in increased CPU usage, memory consumption, and network traffic.
Upon investigation, we have identified that the root cause of this performance degradation is the recent change that made the "WAIT" task asynchronous, which was introduced in version 3.14.0.

When we reverted this change, the performance of our system returned to normal levels.

Details
Conductor version: 3.16.0
Persistence implementation: Redis
Queue implementation: Orkes Queue
Lock: Redis

Expected behavior
The performance of the system should remain stable after the Conductor upgrade without significant degradation.

Screenshots
image

image

Suggested Solution
Revert the change that made the "WAIT" task asynchronous to restore optimal system performance.

@BatyaPinski we are investigating, earlier the WAIT task relied on the sweeper to complete, which means the guarantees for WAIT task to be completed were at-least 30 seconds (or the frequency at which decider runs). This meant you could not wait for say 30 seconds or less and scaling a system with a LOT of WAIT tasks was tightly coupled to the performance of the sweeper.

Making WAIT solves that issue and allows you to have WAITs that are as little as few seconds.