gottscj/Hangfire.Mongo

RemoveTimedOutServers can incorrectly remove servers if the system time is not synchronised accross all servers

jam-esh opened this issue · 2 comments

Hangfire + Hangfire.MongoDB (using latest packages of both).
Running on Windows Server.

Awesome package that has been rock solid until one recent incident, best summarised as "if it can happen (in production) then, left long enough, it eventually will".

We had an incident recently with Hangfire running on multiple servers. One of the servers kept being elected as "dead".
On the next Heartbeat check, the "zombie" Hangfire server in question would then cancel all running jobs and restart itself. Only to be informed that it was dead. Repeat ad-nauseam.

When we did some post-mortem analysis on this, it transpired that the system clock on the "zombie" Hangfire server's machine had, for some reason, become 5 minutes out, having an earlier time than all other servers. This is in a corporate environment where the system time for all servers is meant to be synchronised.

When I look at the Hangfire native SQL Server implementation of JobStorageConnection.RemoveTimedOutServers, I can see that the implementation leaves all references to time to the SQL Server. Thus all Hangfire servers use a single reference to time.

When I look at the Hangfire MongoDB implementation of JobStorageConnection.RemoveTimedOutServers (in MongoConnection), I can see that it uses (local to each server) DateTime.UtcNow. Thus each server has its own reference to time.

This caused the "5 minute slow system time" Hangfire server to write a heartbeat that was immediately seen as stale by the other servers. Hence it was repeatedly marked as timed out.

This use of DateTime.UtcNow in RemoveTimedOutServers requires all servers to have their system time broadly in-sync.

Would it be more robust to leverage the MongoDB server time in your RemoveTimedOutServers implementation, instead of each individual Hangfire servers' local machine time?

@jam-esh,

Thank you for the thorough explanation!
I have changed the server heartbeat handling to use MongoDB instance timestamp.
We already use the MongoDB server time for a lot of other handling:
https://github.com/Hangfire-Mongo/Hangfire.Mongo/blob/master/src/Hangfire.Mongo/MongoConnection.cs#L671

Let me know if you think this would solve your issues:
https://github.com/Hangfire-Mongo/Hangfire.Mongo/pull/357/files#diff-4f513b5e44f47177a784928fa0a45bb91998cf4dbe09b760a528685edd1c7e14

Thanks!

@gottscj, yes, that should solve this issue.

Thank you very much!