Netflix/dyno-queues

RedisDynoQueue.pop() not returning HTTP tasks that are in the queue

rickfish opened this issue · 5 comments

We have HTTP tasks that are in SCHEDULED status but the pop() method of RedisDynoQueue is not returning them. If I call the peek method of the queue I can see them but the pop() method acts like they are not there. I have tried to debug around that class but RedisDynoQueue is in the dyno queues project and that project uses other classes from other projects so it is difficult.

I had a look at the code but without being able to log information at runtime it is tough to find out what is going on. There are also several other very similar HTTP tasks that get processed fine.

This also seems to happen when more than 5 tasks are pushed to the HTTP queue around the same time. Then some of them just stay in the queue, even when the activity dies down.

Is there any reason why the pop() would not return what the peek() method sees?

I was able to dig a little deeper and found that the reason these SCHEDULED task are not popped is because the RedisMock class's zAdd method for these tasks for the HTTP UNACK queue returns 0. That causes the task to not be popped in the _pop method of RedisDynoQueue. Can anyone explain what the unack queue is and how it is used? Is it used to see if the task was ack'ed?

For some tasks, returning 0 from the zAdd method does not cause issues as I guess further calls return a non-zero. But for these tasks a 0 is always returned.

More info. Seems to be a bug with RedisSortedSetCache. It is trying to maintain two maps, the cache map and the scores map. The scores map value is a Map and the cache map value is a Set. When tasks are added to the UNACK.HTTP cache after they have already been added before, a duplicate entry is added to the cache map's Set but the scores entry is just overwritten (since the put does not allow dups). Then when the entry is removed, only one is removed from the cache, leaving one (or more) and the scores entry is deleted entirely.

The Comparator of the cache map does comparisons using the scores map, thinking that there are equivalent entries in both maps. If it finds two scores that are null it will return 0 (equal).

Since the SortedSet contains() method uses the comparator, even though the set does not contain the task, it thinks it does because of the null-scores comparison.

Bottom line on this is that this causes SCHEDULED tasks to be stranded forever and never popped. This is a big problem.

NOTE THAT this bug only happens when more than one HTTP system task is SCHEDULED in parallel and only sporadically. I think it is a race condition but I am not sure yet as this is very difficult to debug. I am also not sure if it only happens with HTTP tasks.

Here’s my debugging history:

In the Redis persistence implementation of conductor, the DynoQueueDAO uses RedisDynoQueue as its queue impl. RedisDynoQueue uses JedisMock as its JedisCommands impl. JedisMock uses RedisMock for its zadd and zrem impl if the params contain “nx”, which they do.

RedisMock uses the RedisSortedSetCache as its ‘zsetCache’. That class has two maps – cache and scores. The cache has a Comparator that doesn’t just compare the task ids, it also compares the scores in the scores map for the two task ids of the comparison.

It appears as though the cache entry that is causing the problem is the “conductor_queues.test.UNACK.HTTP.c” cache.

When the pop() method of DynoQueueDAO is called, it calls the pop() method of the “conductor_queues.test.UNACK.HTTP.c” RedisDynoQueue which calls its internal _pop() method. In its _pop() method, it does this for each task in the UNACK.HTTP queue:

                 long added = quorumConn.zadd(unackQueueName, unackScore, msgId, zParams);
                if(added == 0){
                       monitor.misses.increment();
                       continue;
                }

So if ‘added=0’, that task will not be added to the tasks to pop. In the zadd method of RedisMock, the contains() method of the cache map is used to see if the task exists in the cache. If it does, it returns 0. The contains() method uses the Comparator specified by RedisSortedSet.set() method which doesn’t just check the task id of the others in the set, it also will return a 0 (equals) if both scores for those tasks are null, which they are at that point. So it returns 0 because both scores are null and it is not popped.

Note that this only happens the first time zadd is called. After the contains() method is called to determine if the task is in the queue, it then puts it in the queue with the zsetCache.set() call so from then on it does actually exist and the contains() method will then always return 0. The task stays in there forever and never gets popped.

I am not sure the best way to fix this yet. I am looking for advice.

If this is a Dyno Queues issue, I suggest you open an issue on https://github.com/netflix/dyno-queues so that the proper team can handle it.

hi @rickfish
Is this problem solved?

I also encountered the same problem

conductor version 2.29.0
redis 5.0.5

If you solved it,
I want to know how to solve it, what should I do?

Not sure if it was solved. Since we are now using Postgres as the db it only affects us if running on somebody's laptop using db=memory. In that case, it would just be a developer doing a test of his/her changes so this kind of problem would not be noticed. In any case, I think it is a DynoQueue issue with RedisSortedSetCache and would have to be resolved in that OpenSource project.

Sorry I could not help more.