Hangfire.Azure.DocumentDbDistributedLockException
Closed this issue · 3 comments
I'm running 2 instances of Hangfire server using Hangfire.AzureDocumentDB and the same Cosmos DB collection as a storage. During inspecting App Insights exceptions I've found that there are a lot of Hangfire.Azure.DocumentDbDistributedLockException
. It occurs constantly while app is running, nearly every second:
timestamp | problemId
-------------------------
2018-06-06T11:43:50.489Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:43:14.4Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:42:14.291Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:42:00.774Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:41:02.144Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:38:40.288Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:38:04.244Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:37:01.088Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
2018-06-06T11:36:50.589Z | Hangfire.Azure.DocumentDbDistributedLockException at Hangfire.Azure.DocumentDbDistributedLock.Acquire
Exception details:
Message: "Error occurred during execution of 'Worker #c5a17354' process. Execution will be retried (attempt #267) in 00:05:00 seconds."
Stacktrace: Hangfire.Azure.DocumentDbDistributedLockException: "Could not place a lock on the resource 'locks:job:dequeue': Lock timeout."
at Hangfire.Azure.DocumentDbDistributedLock.Acquire(String name, TimeSpan timeout)
at Hangfire.Azure.Queue.JobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
at Hangfire.Azure.DocumentDbConnection.FetchNextJob(String[] queues, CancellationToken cancellationToken)
at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
at Hangfire.Server.ServerProcessExtensions.Execute(IServerProcess process, BackgroundProcessContext context)
at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)
Taking into account a comment by @odinserj for the similar issue it turns out that this type of exception could be expected:
This process runs on each server, and for each queue it acquires a distributed lock to prevent other processing servers to get into the process. And if you have a lot of servers, some of them are inevitably will fail to acquire a lock, causing this exception.
The question is whether getting this exception is a normal behavior? And if so the it would be great to handle it in the library and do not re-throw to the user's code.
@savbace based on the HangfireIO/Hangfire#925 it looks like it an excepted behavior when multiple workers try to have same lock
For JobQueue.Dequeue
the defaultLockTimeout = 10 sec.
So, if a worker acquires a lock and is unable to find any jobs inside a queue for 10 seconds will eventually throw the exception to breakout from the queue polling loop.
Well regarding on handling the exception, I think it should be handled on Hangfire.Core
or if @odinserj can suggest me solution
I'm not sure if @odinserj has changed the log level from error
to debug
for the core library.
I believe @savbace was able to fine tune its settings