Buildkite Agent Scaler

An AWS lambda function that handles the scaling of an Amazon Autoscaling Group (ASG) based on metrics provided by the Buildkite Agent Metrics API.

In practice, we've seen 300% faster initial scale-ups with this lambda vs native AutoScaling rules. 🚀

Why?

The Elastic CI Stack depends on being able to scale up quickly from zero instances in response to scheduled Buildkite jobs. Amazon's AutoScaling primatives have a number of limitations that we wanted more granular control over:

The median time for a scaling event to be triggered was 2 minutes, due to needing two samples with a minimum period of 60 seconds between.
Scaling can either be by a fixed rate, a fixed step size or tracking, but tracking doesn't work well with custom metrics like we use.

How does it work?

The lambda (or cli version) polls the Buildkite Metrics API every 10 seconds, and based on the results sets the DesiredCount to exactly what is needed. This allows much faster scale up.

Gracefully scaling in

Whilst the lambda does support scaling in via setting DesiredCount, Amazon ASGs appear to not send Lifecycle Hooks before terminating instances, so jobs in progress are interrupted.

Instead, in the Elastic CI Stack we run the scaler with scale-in disabled (DISABLE_SCALE_IN) and rely on the recent addition in buildkite-agent v3.10.0 of --disconnect-after-idle-timeout in the Agent combined with a systemd PostStop script to terminate the instance and atomically decrease the DesiredCount after the agent has been idle for a time period. We've found it to work really well, and is less complicated than relying on lifecycled and Lifecycle Hooks.

See the forum post for more details.

Publishing Cloudwatch Metrics

The scaler collects it's own metrics and doesn't require buildkite-agent-metrics. It supports optionally publishing the metrics it collects back to Cloudwatch, although it only supports a subset of the metrics that the buildkite-agent-metrics binary collects:

Buildkite > (Org, Queue) > ScheduledJobsCount
Buildkite > (Org, Queue) > RunningJobCount

Running as an AWS Lambda

An AWS Lambda bundle is created and published as part of the build process. The lambda will require the following IAM permissions:

cloudwatch:PutMetricData
autoscaling:DescribeAutoScalingGroups
autoscaling:DescribeScalingActivities
autoscaling:SetDesiredCapacity

Its handler is bootstrap, it uses a provided.al2 runtime and requires the following env vars:

BUILDKITE_AGENT_TOKEN or BUILDKITE_AGENT_TOKEN_SSM_KEY
BUILDKITE_QUEUE
AGENTS_PER_INSTANCE
ASG_NAME

If BUILDKITE_AGENT_TOKEN_SSM_KEY is set, the token will be read from AWS Systems Manager Parameter Store GetParameter which can also read from AWS Secrets Manager.

aws lambda create-function \
  --function-name buildkite-agent-scaler \
  --memory 128 \
  --role arn:aws:iam::account-id:role/execution_role \
  --runtime provided.al2 \
  --zip-file fileb://handler.zip \
  --handler bootstrap

Running locally for development

$ aws-vault exec my-profile -- go run . \
  --asg-name elastic-runners-AgentAutoScaleGroup-XXXXX
  --agent-token "$BUILDKITE_AGENT_TOKEN"

Using Clusters

The BUILDKITE_AGENT_TOKEN is scoped to a specific cluster. It's best to create a unique token for the cluster being targeted by the scaler.

The scaler is set up automatically by the Elastic CI Stack's CloudFormation templates, which reference the agent token and a queue name. A Lambda function running the scaler is then generated using these references (e.g., BUILDKITE_AGENT_TOKEN_SSM_KEY and BUILDKITE_QUEUE).

buildkite/buildkite-agent-scaler