buildkite/buildkite-agent-metrics

UnfinishedJobsCount stuck at >0

Closed this issue ยท 2 comments

Another issue, which may be completely unrelated to this codebase. Please redirect me elsewhere if necessary.

My org uses a few different queue names for targeting build agents, we currently have default, deploy-non-prod, deploy-prod, ios, and deprecated-2. I was troubleshooting an issue where certain groups of agents weren't scaling down in quiet periods. We use ASG scaling rules based on Cloudwatch metric thresholds. Of course, we post the Cloudwatch metrics using your buildkite-agent-metrics lambda.

The problem we noticed, is that even over long periods including quiet times (e.g., 1 week), some of the queues were always reporting 1 or 2 UnfinishedJobsCount. I'm unable to find any jobs in the Buildkite UI that seem to have been running that long.

Here's a screenshot of our Cloudwatch metrics: notice how in the middle of the night there are still UnfinishedJobsCount > 0, whereas i cannot see agents running any jobs in the UI. Is there a better way to track these down without simply terminating all build agent instances?

image

Do you have suggestions for me to debug this? Thank you! ๐Ÿ™Œ

It turns out that indeed we had some jobs kicking around, but they were really hard to find in the Buildkite UI. Eventually we found https://www.buildkite.com/our-org-slug/builds, filtered on "Running", and cancelled a few that had been there for weeks waiting on incorrectly configured concurrency groups. This is fixed, but maybe making stuck jobs more easily discoverable would be an awesome feature ๐Ÿ‘

FTR, the URL is actually https://buildkite.com/organizations/<slug>/builds?state=running, just had the same problem ๐Ÿ˜„