FoundatioFx/Foundatio.Redis

[RedisQueue] DequeueIdAsync might leave queue item in "phantom" state

Closed this issue · 4 comments

        try {
            return await Run.WithRetriesAsync(async () => {
                return await Database.ListRightPopLeftPushAsync(_queueListName, _workListName).AnyContext();
            }, 3, TimeSpan.FromMilliseconds(100), linkedCancellationToken, _logger).AnyContext();
        } catch (Exception) {
            return RedisValue.Null;
        }

ListRightPopLeftPushAsync has default timeout of 5 seconds, in case of spike in Redis load which makes the action take more than 5 seconds (it's busy with other ops), this function will return RedisValue.Null while eventually the ListRightPopLeftPushAsync will actually be executed in Redis.

Since the item will be moved from "in" to "work" but this function will return null, the following code is not executed:

             await Run.WithRetriesAsync(() => Task.WhenAll(
            _cache.SetAsync(GetDequeuedTimeKey(value), now, wiTimeoutTtl),
            _cache.SetAsync(GetRenewedTimeKey(value), now, wiTimeoutTtl)

and the user handling function is not called.

So the item is now in "work" but only with "enqueued" value so maintenance will ignore it and it will stay in "work" forever without being handled.

Thanks.

Interesting. Thanks for pointing this out. What are your thoughts on how this should be handled? Can you submit a PR with your proposed changes?

One option I can think of is moving the code that set the dequeued time and renewed time to be in transaction with the ListRightPopLeftPushAsync (either by transaction or with LUA script using Database.ScriptEvaluateAsync).
Basically, DequeueIdAsync will move the item and set the tags in the same operation.

In case we have a timeout in this scenario we will return null but the item will be handled by maintenance later.
What do you think?

@niemyjski @ejsmith
A friend noticed that my fix to this issue breaks the support to cluster mode, he already suggested and showed a solution for the cluster as well (LUA script and used keys must be on the same shard), I will ask him to create a pull request for the fix as soon as he can.
Do you want me to open a new issue?

Probably would be better to keep it separate.