I think something got problem on groups feature when server crash
Opened this issue ยท 14 comments
Hi
I set
group concurrency: 1
global concurrency: 15
And during the process one job and server got refreshed for some reason.
And it seems get back to waiting, but it never get active.
Even though I delete the stalled job and add again with same group id but the job is not work.
Seems that specific group id get stuck and doesn't get back to normal when server get crash during process.
When I check the group status, it return maxed but it's not processing at all
Which version of BullMQ Pro are you using?
So to summarize your issue:
- A job belonging to a group G was processing.
- the server was restarted while the job was being processed.
- the job has been correctly moved to wait, but the group is still in "maxed" status.
- no other job is being processed (active) in that group.
Can you confirm?
Yes true
I'll explain the detailed situation
- I'm using 5.1.14 version of BullMQ Pro
- I'm also using 1.76.6 version of the BullMQ together for the legacy queue
- So I also use QueueSchedular. And I'm still making QueueSchedular for old queue(I don't know it's affect)
- The node server is on docker and it's restart by CI/CD
- So weird thing is retrying it self is work from my local when I kill the node pid(I don't use docker from local). It's keep process the another job properly.(I used local redis when I test)
Please ask me anything if you need more info.
BTW this is almost 8PM so I may reply tomorrow.
thank you ๐
Ok, an explanation for this behavior could be that the standard BullMQ (not Pro) is actually also using the new queues. For example, if a Pro worker crashes or is re-started, the standard BullMQ could move that job to wait, but since it does not know about groups, the group will stay at "maxed".
Also I wonder, why not upgrading to latest BullMQ, or even better only using BullMQ Pro for all queues? (with the newest version you do not need the QueueScheduler either so it is easier)
Another thing. By any chance, do you share a Redis connection between the standard and the Pro version?
Ok, an explanation for this behavior could be that the standard BullMQ (not Pro) is actually also using the new queues. For example, if a Pro worker crashes or is re-started, the standard BullMQ could move that job to wait, but since it does not know about groups, the group will stay at "maxed".
I don't think so.. because I completely split between the standard BullMQ Queue-Worker and BullMQ Pro Queue-Worker.
So I don't think it's possible some job is made by standard BullMQ Queues.
Also I wonder, why not upgrading to latest BullMQ, or even better only using BullMQ Pro for all queues? (with the newest version you do not need the QueueScheduler either so it is easier)
It's because I just don't want to change it because it is already working well.
Another thing. By any chance, do you share a Redis connection between the standard and the Pro version?
No I differentiate the connection. I don't know why but I was not even start the node server when I use same connection.
This is my comment
BTW do you know why it's not possible to use same connection?
BTW do you know why it's not possible to use same connection?
Because when BullMQ starts it loads a bunch of lua scripts with that connection, and I think if you use two different versions with the same connection the scripts get mixed up.
By connection I mean a IORedis instance btw.
Also, I am releasing a "repairMaxedGroup" function to Pro and exposing it in Taskforce.sh so that you can fix the maxed groups manually. This should never happen, but at least if it happens know you can do something about it. We will need to investigate it further to discover the cause behind it.
Oh, Thank you. It works!
BTW is the function heavy process?
I just made worker to run every 10 minutes to check every BullMQ Pro Queue like below
- Get Groups list for each BullMQ Pro
- filter only the status is maxed
- run the function that you released
The function is not designed to be used frequently as this issue should never happen :) If you are able to produce this issue frequently, then please provide some code that reproduces it and we will fix it instead.
OK Thank you