gottscj/Hangfire.Mongo

Job stuck in the Processing state (InvisibilityTimeout not working properly)

braca opened this issue · 21 comments

braca commented

During some tests where our server is being restarted while jobs are being executed, we find out, that sometimes, jobs became stuck in the Processing state, and they never recover.

After we start the server, those jobs are never executed again no matter how much time we wait (we've waited for the invisibility timeout and nothing happens, we had left it for a few hours, and nothing happens as well)

The Hangfire UI seems to know, that the job is stuck/aborted as it shows the following message

The job was aborted – it is processed by server testpc:server1:33704:7daebff8-39b2-4899-9f5f-3dae35a39863 which is not in the active servers list for now. It will be retried automatically after invisibility timeout, but you can also re-queue or delete it manually.

After digging into this issue a little bit more, I've found out a weird behaviour that I think is causing this, and it's related to InvisibilityTimeout.

In our example we have an InvisibilityTimeout of 2min (it can be 30min, it won't make a difference).
We have some long-running jobs that may exceed our InvisibilityTimeout (for our usecase, we are OK with that)

When the InvisibilityTimeout is exceeded, the job CancelationToken is triggered and the worker that is running that job aborts that execution.
At the same time, a new worker will pick that job and run it again.
This is the expected behaviour (AFIAK).

The problem is that this only happens once, the worker that is now processing the long-running job, will no longer receive any kind of cancellation when the InvisibilityTimeout is exceeded.
The same happens, if the server is restarted at this point, the job will no longer be processed again, and will be stuck in "Processing".

This seems to be happening, because when the InvisibilityTimeout cancels the first execution, in the Worker.Execute(BackgroundProcessContext context) it will remove the job from the queue, despite the fact that the job is still being processed by another worker.

This happens in these lines:

In the Worker.Execute(BackgroundProcessContext context)

var state = PerformJob(context, connection, fetchedJob.JobId);
>>>> In this example state is null

if (state != null)
{
	// Ignore return value, because we should not do anything when the current state is not Processing.
	TryChangeState(context, connection, fetchedJob, state, new[] { ProcessingState.StateName }, CancellationToken.None, context.ShutdownToken);
}

>>>>> I'm not sure if the following assumption is correct, because in our case the job was not performed, 
>>>>>it is still being executed (in another worker), so I don't think it should be removed from the queue

// Checkpoint #4. The job was performed, and it is in the one
// of the explicit states (Succeeded, Scheduled and so on).
// It should not be re-queued, but we still need to remove its
// processing information.

requeueOnException = false;
fetchedJob.RemoveFromQueue();

Steps to reproduce

  1. Download this repro project: https://github.com/braca/hangfire-mongo-bug
  2. Run the project and go to: http://localhost:5000/swagger
  3. POST /Test/LongJobtesting
  4. Wait a few minutes
  5. The job will be running for 1minute in one of the available workers
  6. After 1min, the CancellationToken will be triggered (because of the InvisibilityTimeout)
  7. A new worker will pick the job and execute it
  8. This new worker will never receive any CancelationToken even if the job execution time exceeds the InvisibilityTimeout
  9. Restart the server
  10. The job will be stuck in Processing and it will never be picked by any worker
  11. You can check the logs/app.logs and confirm this

Logs

16:50:37.231 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - Started
16:50:37.231 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 0 - Execution time: 00:00:00.0025689
16:50:47.244 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 1 - Execution time: 00:00:10.0158631
16:50:57.253 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 2 - Execution time: 00:00:20.0250140
16:51:07.256 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 3 - Execution time: 00:00:30.0278617
16:51:17.263 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 4 - Execution time: 00:00:40.0343608
16:51:27.262 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 5 - Execution time: 00:00:50.0336615
16:51:37.261 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - Started
16:51:37.261 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 0 - Execution time: 00:00:00.0010493
16:51:37.278 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - i = 6 - Execution time: 00:01:00.0498272
16:51:37.320 [ERR] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: cc1e7f72-5408-42f6-9415-6288f083cbbd - Token was cancelled - Execution time: 00:01:00.0917212
16:51:47.265 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 1 - Execution time: 00:00:10.0047696
16:51:57.280 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 2 - Execution time: 00:00:20.0198516
16:52:07.281 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 3 - Execution time: 00:00:30.0211184
16:52:17.294 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 4 - Execution time: 00:00:40.0335476
16:52:27.311 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 5 - Execution time: 00:00:50.0509575
16:52:37.321 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 6 - Execution time: 00:01:00.0607559
16:52:47.333 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 7 - Execution time: 00:01:10.0725911
16:52:57.352 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 8 - Execution time: 00:01:20.0913322
16:53:07.363 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 9 - Execution time: 00:01:30.1025756
16:53:17.378 [INF] ExecuteLongTaskAsync - Job: 61e992d0172392e01cb7be14 -  Worker: 23f99319-21d4-4001-a383-18027f31abad - i = 10 - Execution time: 00:01:40.1181397

@braca,

Thanks for the links. I will check this out ASAP.

Thanks!

@braca,

Its a bug, I have a fix for it. just need to verify it.
Thanks!

braca commented

@gottscj any update on this?

Hi @braca,

I usually wait for the next release of hangfire, but im not sure when the next release is due, so I will create a release with this shortly.

thanks for waiting. :)

Is this bug fixed in Version 1.7.0?
I have this issue, too. And need to know an update of the package solve this problem, or not 😄

Edit:
Or should I wait for the next release?
Btw. I tried the version 1.7.0, and this BUG still exists.

braca commented

Hi @gottscj,

With the 1.7.0 release I cannot replicate our issue, it seems to be fixed, but I'll try to do more tests

Thanks :)

ℹ️ I changed the queue strategy from 'Watch' to 'Poll' and now it works.

@braca,

Thank you for validating the fix. And thank you for setting up the repo!

@Sacharkin,

This fix should be fixed in the 1.7.0 release.
Are you experiencing the same issue as this ticket? Thanks

Not 100 percent the same. My issue was to spam jobs into the queue (1000 jobs in one second) some of them (5-10) start to stuck in the process state.

Hangfire runs on multiple instances. With only one instance I can not reproduce the issue.
And after server restart the jobs were still in process, but sometimes it was possible to run the jobs successfully, after requeue.

Info: Before the package Update to 1.7.0 , the stuck jobs count was something like 200/1000.

@gottscj

I think my issue is different, should I create a new task for it?
And after some tests, I was able to solve my issue.
Change the write and read concern for the mongo client to 'majority'.
It was about to have a mongo replicaset.

@Sacharkin,

Please create a new issue for this. thanks!

I have issue with OracleStorageOptions: long-running background job is retried in 30 mins. What am I supposed to do? HF is 1.7.28. InvisibilityTimeout is not used as suggested.

@alexandis,

This is the repo for hangfire support for mongo db. I, unfortunately don't have any insight into the storage you are using.

Good luck!

@gottscj

I think my issue is different, should I create a new task for it? And after some tests, I was able to solve my issue. Change the write and read concern for the mongo client to 'majority'. It was about to have a mongo replicaset.

thanks for your inspiration,in my case(MongoDB.Driver2.22.0),changing connection string from "readPreference=SecondaryPreferred"to "readPreference=PrimaryPreferred" solved the stuck problem.

@marshafan,

Thanks for sharing!

Hi @gottscj,

I hope this message finds you well.

I am currently encountering an issue with Hangfire Core version 1.7.31 and Hangfire.Mongo. In my AWS ECS deployment, when attempting to execute background jobs for long-running in servers, I've observed that during testing forcibly stopping server or server scale-down, some jobs are reassigned to other servers, while others revert to the old server which is server actually terminated or going to terminate . The issue persists until a new server is started from AWS.

I'm seeking your expertise in resolving this matter. Could you please provide guidance on how we can address and solve this problem?

Your assistance is greatly appreciated!

Thanks in advance.

Hi @dwevedivaibhav,

Did you try to set the InvisibilityTimeout? It is default disabled, but if you set a value Hangfire.Mongo will requeue the job if job has been terminated.

Please be aware that if the job runs for longer than the timeout, hangfire.mongo will mistake it for a stale job and requeue it.

I hope this helps

Hi @gottscj,

Thank you for your prompt response.

In my scenario, there are multiple workflows running in parallel within a 24-hour timeframe. The execution time varies based on the load, with some workflows completing in 5 minutes and others taking up to 3 hours or more also. Given this variability, how can I set an invisible timeout in this situation?

Any leads or guidance you can provide would be greatly appreciated.

@dwevedivaibhav,

You would need to set the timeout to 3+ hours then. Else, if possible, you could consider breaking up the the jobs, so executing time is smaller. You could save state between the jobs?

I hope this helps.

@gottscj Thank you for your response!

I have an additional question related to processing background jobs in different workers within the Hangfire server.

To illustrate, let's consider scheduling 20 background jobs across 3 containers, each with 2GB CPU and 4GB Memory. During distribution, everything works seamlessly. However, when the server crashes or scales down and a new server is brought up, Hangfire resumes distributing jobs. I've noticed a glitch: if one container (Hangfire server worker) initially receives 10 jobs, subsequent jobs are not allocated to other servers until the earlier allocated 10 jobs are processed. Meanwhile, other servers with fewer or no jobs continue to remain idle.

Could you provide insights into addressing this issue or optimizing the job distribution process?

Thank you for your assistance!