Netflix/conductor

Operation: ( processUnacks ) failed on key: [conductor_queue.test.UNACK._deciderQueue.c ]

dpozinen opened this issue · 0 comments

Describe the bug
RedisDynoQueue starts failing unpredictably when using memory db type, logs show
Operation: ( processUnacks ) failed on key: [conductor_queue.test.UNACK._deciderQueue.c ].

Upon overriding the class to add more detailed logs, I can see and NPE here in redis mock is the cause

ERROR [up-be-conductor-server,,] 1 --- [ool-17-thread-1] c.n.d.q.r.RedisDynoQueue                 : Error while processing unacks. Operation: ( processUnacks ) failed on key: [conductor_queue.test.UNACK._deciderQueue.c ].
java.lang.RuntimeException: Operation: ( processUnacks ) failed on key: [conductor_queue.test.UNACK._deciderQueue.c ].
at com.netflix.dyno.queues.redis.QueueUtils.executeWithRetry(QueueUtils.java:47) ~[dyno-queues-redis-2.0.22.jar!/:2.0.22]
at com.netflix.dyno.queues.redis.QueueUtils.execute(QueueUtils.java:29) ~[dyno-queues-redis-2.0.22.jar!/:2.0.22]
at com.netflix.dyno.queues.redis.RedisDynoQueue.processUnacks(RedisDynoQueue.java:1426) ~[classes!/:2.0.22]
at com.netflix.dyno.queues.redis.RedisDynoQueue.lambda$new$1(RedisDynoQueue.java:144) ~[classes!/:2.0.22]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: redis.clients.jedis.exceptions.JedisException: java.lang.NullPointerException
at com.netflix.conductor.redis.jedis.JedisMock.zrangeByScoreWithScores(JedisMock.java:958) ~[conductor-redis-persistence-3.13.8.jar!/:3.13.8]
at com.netflix.dyno.queues.redis.RedisDynoQueue.lambda$processUnacks$23(RedisDynoQueue.java:1435) ~[classes!/:2.0.22]
at com.netflix.dyno.queues.redis.QueueUtils.executeWithRetry(QueueUtils.java:36) ~[dyno-queues-redis-2.0.22.jar!/:2.0.22]
... 9 more
Caused by: java.lang.NullPointerException

I've looked at the library used and it hasn't been updated since 2015.

Details
Conductor version: 3.13.8
Persistence implementation: memory
Platform: Macbook Pro M1
Docker Engine: 20.10.23

Additional context
I am running conductor locally inside docker and executing random workflows, as part of a test. It seems to be happening on any kind of workflow, as long as it runs long enough (1m+).
It happens on random parts of the workflow too, and sometimes (although rarely) may not happen at all, even on the same workflow. Conductor does not recover after this error, once encountered it is logged indefinitely and the workflow is not executed.

I realize that the memory db option isn't stable, but I think I'm using it as intended. I also realize that this seems to be potentially a bug in the mock library, but either way it is impacting conductor, and I think moving away or forking that library to fix potential bugs is the way to go here, since it hasn't been updated since 2015.