An AWS lambda function that handles the scaling of an Amazon Autoscaling Group (ASG) based on metrics provided by the Buildkite Agent Metrics API.
In practice, we've seen 300% faster initial scale-ups with this lambda vs native AutoScaling rules. 🚀
The Elastic CI Stack depends on being able to scale up quickly from zero instances in response to scheduled Buildkite jobs. Amazon's AutoScaling primatives have a number of limitations that we wanted more granular control over:
- The median time for a scaling event to be triggered was 2 minutes, due to needing two samples with a minimum period of 60 seconds between.
- Scaling can either be by a fixed rate, a fixed step size or tracking, but tracking doesn't work well with custom metrics like we use.
The lambda (or cli version) polls the Buildkite Metrics API every 10 seconds, and based on the
results sets the DesiredCount
to exactly what is needed. This allows much faster scale up.
Whilst the lambda does support scaling in via setting DesiredCount
, Amazon ASGs appear to not send
Lifecycle Hooks before terminating instances, so jobs in progress are interrupted.
Instead, in the Elastic CI Stack we run the scaler with scale-in disabled (DISABLE_SCALE_IN
)
and rely on the
recent addition in buildkite-agent v3.10.0
of --disconnect-after-idle-timeout
in the Agent combined with a
systemd PostStop script
to terminate the instance and atomically decrease the DesiredCount
after the agent has been idle
for a time period. We've found it to work really well, and is less complicated than relying on
lifecycled and Lifecycle Hooks.
See the forum post for more details.
The scaler collects it's own metrics and doesn't require buildkite-agent-metrics. It supports optionally publishing the metrics it collects back to Cloudwatch, although it only supports a subset of the metrics that the buildkite-agent-metrics binary collects:
- Buildkite > (Org, Queue) >
ScheduledJobsCount
- Buildkite > (Org, Queue) >
RunningJobCount
An AWS Lambda bundle is created and published as part of the build process. The lambda will require the following IAM permissions:
cloudwatch:PutMetricData
autoscaling:DescribeAutoScalingGroups
autoscaling:DescribeScalingActivities
autoscaling:SetDesiredCapacity
Its handler is bootstrap
, it uses a provided.al2
runtime and requires the following env vars:
BUILDKITE_AGENT_TOKEN
orBUILDKITE_AGENT_TOKEN_SSM_KEY
BUILDKITE_QUEUE
AGENTS_PER_INSTANCE
ASG_NAME
If BUILDKITE_AGENT_TOKEN_SSM_KEY
is set, the token will be read from
AWS Systems Manager Parameter Store GetParameter
which can also read from AWS Secrets Manager.
aws lambda create-function \
--function-name buildkite-agent-scaler \
--memory 128 \
--role arn:aws:iam::account-id:role/execution_role \
--runtime provided.al2 \
--zip-file fileb://handler.zip \
--handler bootstrap
$ aws-vault exec my-profile -- go run . \
--asg-name elastic-runners-AgentAutoScaleGroup-XXXXX
--agent-token "$BUILDKITE_AGENT_TOKEN"
The BUILDKITE_AGENT_TOKEN
is scoped to a specific cluster. It's best to create a unique token for
the cluster being targeted by the scaler.
The scaler is set up automatically by the Elastic CI Stack's CloudFormation templates, which
reference the agent token and a queue name. A Lambda function running the scaler is then generated
using these references (e.g., BUILDKITE_AGENT_TOKEN_SSM_KEY
and BUILDKITE_QUEUE
).
Copyright (c) 2014-2019 Buildkite Pty Ltd. See LICENSE for details.