imranmomin/Hangfire.AzureDocumentDB

Hangfire.Azure.DocumentDbDistributedLockException

Closed this issue · 3 comments

I'm running 2 instances of Hangfire server using Hangfire.AzureDocumentDB and the same Cosmos DB collection as a storage. During inspecting App Insights exceptions I've found that there are a lot of Hangfire.Azure.DocumentDbDistributedLockException. It occurs constantly while app is running, nearly every second:

timestamp | problemId
-------------------------
2018-06-06T11:43:50.489Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:43:14.4Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:42:14.291Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:42:00.774Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:41:02.144Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:38:40.288Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:38:04.244Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:37:01.088Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:36:50.589Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire

Exception details:

Message: "Error occurred during execution of 'Worker #c5a17354' process. Execution will be retried (attempt #267) in 00:05:00 seconds."

Stacktrace: Hangfire.Azure.DocumentDbDistributedLockException: "Could not place a lock on the resource 'locks:job:dequeue': Lock timeout."
   at Hangfire.Azure.DocumentDbDistributedLock.Acquire(String name, TimeSpan timeout)
   at Hangfire.Azure.Queue.JobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
   at Hangfire.Azure.DocumentDbConnection.FetchNextJob(String[] queues, CancellationToken cancellationToken)
   at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
   at Hangfire.Server.ServerProcessExtensions.Execute(IServerProcess process, BackgroundProcessContext context)
   at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

Taking into account a comment by @odinserj for the similar issue it turns out that this type of exception could be expected:

This process runs on each server, and for each queue it acquires a distributed lock to prevent other processing servers to get into the process. And if you have a lot of servers, some of them are inevitably will fail to acquire a lock, causing this exception.

The question is whether getting this exception is a normal behavior? And if so the it would be great to handle it in the library and do not re-throw to the user's code.

@savbace based on the HangfireIO/Hangfire#925 it looks like it an excepted behavior when multiple workers try to have same lock

For JobQueue.Dequeue the defaultLockTimeout = 10 sec.
So, if a worker acquires a lock and is unable to find any jobs inside a queue for 10 seconds will eventually throw the exception to breakout from the queue polling loop.

Well regarding on handling the exception, I think it should be handled on Hangfire.Core or if @odinserj can suggest me solution

I'm not sure if @odinserj has changed the log level from error to debug for the core library.

I believe @savbace was able to fine tune its settings