taskforcesh/bullmq-pro-support

Grouped workflows appear to be stuck and maxed out

Opened this issue · 5 comments

I upgraded Bull from 6.3.4 to 6.6.1 recently, and noticed that our scheduling went completely havoc :)
Downgrading today didn't fix anything, so while it would be a big coincidence, the upgrade may be a red herring.

We are using Bull to schedule "workflows" at a given time, in most of the cases immediately, but sometimes also at a given time. We are often using groups with a concurrency of 1 to ensure sequential execution in certain contexts to avoid race conditions.

We also have a routine that detects workflows that have not been executed when scheduled. If it detects those, it reschedules them for immediate execution (in batches of max 500 workflows at a time). This typically doesn't happen, but since after the package upgrade (which was deployed a bit after 20:00), this has skyrocketed:

image

I have a testing endpoint that returns any group: Fetching groups often returns just an empty array (which is good, as it indicates there's simply no pending groups). But I am now frequently also am getting groups like those:

[
    {
        "id": "6530246b25b8336ddc285058",
        "status": "maxed",
        "jobs": []
    },
    {
        "id": "642ac25f4b2ea56c60f5f157",
        "status": "maxed",
        "jobs": [
            {
                "name": "LIAM API.INT",
                "data": {
                    "workflowId": "65310c5893ebef7622ae1ac8",
                    "fingerprint": "1697713242404"
                },
                "opts": {
                    "attempts": 20,
                    "delay": 0,
                    "group": {
                        "id": "642ac25f4b2ea56c60f5f157"
                    },
                    "backoff": {
                        "delay": 10000,
                        "type": "fixed"
                    },
                    "jobId": "LIAM API.INT_64396127-32ce-4352-8013-056402e7e3f6",
                    "removeOnComplete": true
                },
                "id": "LIAM API.INT_64396127-32ce-4352-8013-056402e7e3f6",
                "progress": 0,
                "returnvalue": null,
                "stacktrace": [],
                "attemptsMade": 0,
                "delay": 0,
                "timestamp": 1697713242793,
                "queueQualifiedName": "bull:LIAM API.INT",
                "gid": "642ac25f4b2ea56c60f5f157"
            }
        ]
    },
    {
        "id": "65304473b145909c2307b358",
        "status": "maxed",
        "jobs": []
    }
]

As you can see, they are all maxed, often without any jobs at all. Now, I saw that we also had a lot of load during that time from an integration team. That might have played into those maxed out groups, but those look super fishy.
Any idea what I could do to help triangulate those?

hey @hardcodet, I took a look in our changelog and in the changes between those versions, but for now I don't see anything weird https://docs.bullmq.io/bullmq-pro/changelog, one question, the upgrade that you did includes any other changes, asking in order to isolate this

Hi Rogger
No, it was a pretty isolated change, didn't update any other packages either. But as mentioned, I'm still seeing maxed out groups today, even though I'm long back to the old version. It's really been a coincidence that we also have had massive load on that integration system due to an external team running heavy automation tests, so the issue might have been there before.

It's also noteworthy that those maxed out groups disappear after a while. I usually have a few when querying for groups, but never too many. So they are not indefinitely stuck (even though I just manually deleted one that seems to have been around for a few hours already).

On part may be the automated cleanup routines I run on groups, when I maintenance routine of mine discovers that I have group workflows were due for execution, but didn't run. I assume that routine would clear out those empty groups without any jobs though, I guess.

This is the routine that I wrote a while back with your help, which sometimes kicks in due to stuck groups (and kicked in a lot in the last few days):

/**
 * Resets the status of a given Bull group, which should force it to
 * resume any (stuck) jobs.
 */
async resetGroup(groupId: string, throwException: boolean) {
    const queue = this.getQueue(this.config.qualifiedAppName);

    const redis = new Redis(this.config.redisUri, {
        maxRetriesPerRequest: 20,
        enableReadyCheck: false,
        retryStrategy: (times: number) => Math.min(Math.exp(times), 15000),
    });

    const groupPrefix = `bull:${queue.name}:groups`;
    // reset the count to zero
    await redis.hdel(`${groupPrefix}:active:count`, groupId);
    this.logger.info(`Sent redis HDEL in order to reset the count of stuck group ${groupId}.`);
    // change the status of the group. first remove it from the max status:
    await redis.zrem(`${groupPrefix}:max`, groupId);
    this.logger.info(`Sent redis ZREM in order to reset the maxed status of stuck group ${groupId}.`);
    // then reinsert it in the groups set (we use a high score so that this group starts next):
    await redis.zadd(`${groupPrefix}`, 9999999, groupId);
    this.logger.info(`Sent redis ZADD in order to force BullMQ to resume processing of stuck group ${groupId}. The group's jobs should start executing again now.`);

    return true;
}

FYI, after having simply deleted most pending workflows, and thus killed the rescheduling cycle, things have normalized again. There was one last empty maxed group that didn't want to go away, so I deleted that manually, and that was the end of it. I'm also back on 6.6.1, and didn't see any more issues (it's save to say that I will wait quite a while before rolling that out to PROD though - we'll have to monitor that situation over the next weeks).
So it actually might have been the load of jobs that overloaded the system. Still, it would be good to know why it happened (and also, what was with those empty maxed out groups - can't say if they are to be expected, or a helpful indicator of something).

hi @hardcodet, we were refactoring how we are tracking maxed groups in v7.9.2 in case you want to give it a try