I think something got problem on groups feature when server crash

Question

I think something got problem on groups feature when server crash

Opened this issue 2 years ago · 14 comments

Hi

I set
group concurrency: 1
global concurrency: 15
And during the process one job and server got refreshed for some reason.
And it seems get back to waiting, but it never get active.
Even though I delete the stalled job and add again with same group id but the job is not work.

Seems that specific group id get stuck and doesn't get back to normal when server get crash during process.

Answer 1 · 2023-03-22T09:06:58.000Z

When I check the group status, it return maxed but it's not processing at all

Answer 2 · 2023-03-22T10:35:00.000Z

Which version of BullMQ Pro are you using?

Answer 3 · 2023-03-22T10:37:16.000Z

So to summarize your issue:

A job belonging to a group G was processing.
the server was restarted while the job was being processed.
the job has been correctly moved to wait, but the group is still in "maxed" status.
no other job is being processed (active) in that group.

Can you confirm?

Answer 4 · 2023-03-22T10:49:12.000Z

Yes true
I'll explain the detailed situation

I'm using 5.1.14 version of BullMQ Pro
I'm also using 1.76.6 version of the BullMQ together for the legacy queue
So I also use QueueSchedular. And I'm still making QueueSchedular for old queue(I don't know it's affect)
The node server is on docker and it's restart by CI/CD
So weird thing is retrying it self is work from my local when I kill the node pid(I don't use docker from local). It's keep process the another job properly.(I used local redis when I test)

Please ask me anything if you need more info.
BTW this is almost 8PM so I may reply tomorrow.
thank you 🙏

Answer 5 · 2023-03-22T11:37:26.000Z

Ok, an explanation for this behavior could be that the standard BullMQ (not Pro) is actually also using the new queues. For example, if a Pro worker crashes or is re-started, the standard BullMQ could move that job to wait, but since it does not know about groups, the group will stay at "maxed".
Also I wonder, why not upgrading to latest BullMQ, or even better only using BullMQ Pro for all queues? (with the newest version you do not need the QueueScheduler either so it is easier)

Answer 6 · 2023-03-22T12:06:43.000Z

Another thing. By any chance, do you share a Redis connection between the standard and the Pro version?

Answer 7 · 2023-03-23T01:30:18.000Z

Ok, an explanation for this behavior could be that the standard BullMQ (not Pro) is actually also using the new queues. For example, if a Pro worker crashes or is re-started, the standard BullMQ could move that job to wait, but since it does not know about groups, the group will stay at "maxed".

I don't think so.. because I completely split between the standard BullMQ Queue-Worker and BullMQ Pro Queue-Worker.
So I don't think it's possible some job is made by standard BullMQ Queues.

Also I wonder, why not upgrading to latest BullMQ, or even better only using BullMQ Pro for all queues? (with the newest version you do not need the QueueScheduler either so it is easier)

It's because I just don't want to change it because it is already working well.

Another thing. By any chance, do you share a Redis connection between the standard and the Pro version?

No I differentiate the connection. I don't know why but I was not even start the node server when I use same connection.
This is my comment
BTW do you know why it's not possible to use same connection?

Answer 8 · 2023-03-23T09:40:52.000Z

BTW do you know why it's not possible to use same connection?

Because when BullMQ starts it loads a bunch of lua scripts with that connection, and I think if you use two different versions with the same connection the scripts get mixed up.

Answer 9 · 2023-03-23T09:41:12.000Z

By connection I mean a IORedis instance btw.

Answer 10 · 2023-03-23T09:42:14.000Z

Also, I am releasing a "repairMaxedGroup" function to Pro and exposing it in Taskforce.sh so that you can fix the maxed groups manually. This should never happen, but at least if it happens know you can do something about it. We will need to investigate it further to discover the cause behind it.

Answer 11 · 2023-03-23T21:59:51.000Z

It is released now, please give it a try:

Answer 12 · 2023-03-24T06:08:41.000Z

Oh, Thank you. It works!
BTW is the function heavy process?

I just made worker to run every 10 minutes to check every BullMQ Pro Queue like below

Get Groups list for each BullMQ Pro
filter only the status is maxed
run the function that you released

Answer 13 · 2023-03-24T09:18:46.000Z

The function is not designed to be used frequently as this issue should never happen :) If you are able to produce this issue frequently, then please provide some code that reproduces it and we will fix it instead.

Answer 14 · 2023-03-24T09:28:58.000Z

OK Thank you