buildkite/buildkite-agent-metrics

Error posting metrics to Cloudwatch

toothbrush opened this issue · 7 comments

Hello. First up, apologies if this is PEBKAC.

We've just tried upgrading from a rather ancient version of buildkite-agent-metrics (v1.5.0) to the latest version (v4.1.2). Of course that's quite a jump.

We run it as a lambda in AWS, which always worked fine, however after upgrading (and changing to the Agent Token etc) I get the following errors:

START RequestId: ... Version: $LATEST

02:43:18
2019/02/13 02:43:18 Collecting agent metrics for all queues

02:43:20
2019/02/13 02:43:20 Found organization "myorg"

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > BusyAgentCount=51

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue=default > BusyAgentCount=8

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue=deprecated-2 > TotalAgentCount=10

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue=ios > UnfinishedJobsCount=1

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue= > ScheduledJobsCount=1

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue=deploy-non-prod > TotalAgentCount=4

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue=deploy-prod > IdleAgentCount=1

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > ScheduledJobsCount=2

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue=deprecated-2 > TotalAgentCount=10

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue=ios > TotalAgentCount=1

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue= > UnfinishedJobsCount=1

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue=deploy-non-prod > BusyAgentCount=0

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue=deploy-prod > IdleAgentCount=1

02:43:20
2019/02/13 02:43:20 Buildkite > Org=myorg > Queue=default > ScheduledJobsCount=0

02:43:20
2019/02/13 02:43:20 Extracted 66 cloudwatch metrics from results

02:43:20
2019/02/13 02:43:20 Submitting chunk of 10 metrics to Cloudwatch

02:43:20
2019/02/13 02:43:20 Submitting chunk of 10 metrics to Cloudwatch

02:43:20
InvalidParameter: 2 validation error(s) found. - minimum field size of 1, PutMetricDataInput.MetricData[8].Dimensions[0].Value. - minimum field size of 1, PutMetricDataInput.MetricData[9].Dimensions[0].Value. : ErrInvalidParams null
InvalidParameter: 2 validation error(s) found.
- minimum field size of 1, PutMetricDataInput.MetricData[8].Dimensions[0].Value.
- minimum field size of 1, PutMetricDataInput.MetricData[9].Dimensions[0].Value.
: ErrInvalidParams
null

Note that i have removed a lot of adjacent lines having the same "dimensions" (i.e., identical combinations of Org=x, Queue=y). Is the problem occurring because of the Queue= blank queue names? Or something else? The only environment variable I'm passing to the lambda is BUILDKITE_AGENT_TOKEN.

Thanks!

Having stared at the logs more carefully, i'm seeing a lot of duplicates. For instance, it appears to try and post all the metrics for deprecated-2 twice!

lox commented

Sorry for the slow response, it does look like an issue with blank queue names 🤔 Will do some testing.

With regards to the duplicates, we did that when we added an additional dimension of org-name. All the metrics get submitted with [queue-name] and then duplicated and submitted with [org-name, queue-name].

We will be dropping the older [queue-name] only metrics in the next major version.

lox commented

@toolmantim it looks like the agent metrics API might be returning empty queue names for jobs without an explicit queue.

Thoughts on whether we should handle empty queue names? They cause errors when posting to Cloudwatch.

Strange, I didn't think we would ever be returning an empty queue name. Sorry about that @toothbrush! Do you know which job it was that caused that? I was wondering if you could share the config/pipeline.yml?

(cc @novatan-rb)

Hi @toolmantim, thanks for the response. We have literally thousands of job step definitions over a hundred or so pipelines, so it's hard to say, from this aggregate view, which one is causing the issue – unless i'm missing something obvious. My apologies if so; how could i help you debug this?

I have a suspicion it might be our practise of not specifying queue names though – here's a job definition which is a lot like others we use across the organisation:

env:
  PROJECT_NAME: my-proj
  IMAGE_NAME: redbubble/frubbadygook

steps:
  - name: 'build'
    command: deploy.sh
    env:
      BUILD_ARGS: push clean
    concurrency: 1
    concurrency_group: my-proj/push

  - wait

  - name: 'test'
    command: deploy.sh
    env:
      BUILD_ARGS: test clean
    concurrency: 1
    concurrency_group: my-proj/test

  - wait
... et cetera ...

Could it be something like that causing the issue, do you think?

EDIT: to be extra clear, for cases where we need specific build agents, we'll add something like this to the step definition:

steps:
  - name: 'build'
    command: deploy.sh
    agents:
      queue: some-fancy-queue

Thanks @toothbrush! Leaving off the agent queue is totally normal, and it shouldn't end up with a blank queue name in the metrics. I thought maybe there was somewhere doing "queue=" or similar?

I've just done some testing, creating builds based on these four variants:

steps:
  - command: sleep 600
  - command: sleep 600
    agents:
      queue: ""
  - command: sleep 600
    agents:
      queue:
  - command: sleep 600
    agents:
      queue: default

But all I get is:

{
  "agents": {
    "idle": 0,
    "busy": 0,
    "total": 0,
    "queues": {}
  },
  "jobs": {
    "scheduled": 4,
    "running": 0,
    "waiting": 0,
    "total": 4,
    "queues": {
      "default": {
        "scheduled": 4,
        "running": 0,
        "waiting": 0,
        "total": 4
      }
    }
  },
  "organization": {
    "slug": "..."
  }
}

Or is a blank agent queue? (I can't seem to recreate that state locally though… all my attempts end up as queue=true)

I’m going to close this as stale but if it comes up again don’t hesitate to get in touch or reopen 🙇‍♂️