Scaledkite helps you reduce AWS spend on your CI workers by only running workers when you need them while also helping you avoid build queuing by having a limitless (in theory) number of workers available.
Note: Scaledkite can really only help you reduce CI spend if you're already paying for an EKS cluster than you can run your workers in.
Buildkite is one of the few providers that is an AWS-recognized integrator of AWS EventBridge. EventBridge is a hosted message bus that allows you to attach rules to an event bus that are evaluated on every message sent to the bus. Those rules can do things like trigger Lambda events, which is what we're going to do here! Buildkite supports sending messages to EventBridge for a number of different events -- those are documented here -- but we're just going to focus on the Job Scheduled event.
Whenever EventBridge recieves a message on our Buildkite event bus, it will evaluate a rule that checks to see if the detail-type
of that event is "Job Scheduled". If it is, we'll have the rule kick off our Lambda function with the payload it receieved.
(We're making assumptions that you already have an EKS cluster, you already have an IAM role configured with cluster access that your Lambda function can assume, and you're running your agents in the buildkite
namespace)
- Create a secret with your Buildkite token:
$ kubectl create secret generic buildkite-agent-token --from-literal token=INSERT-AGENT-TOKEN-HERE --namespace=buildkite
- Create the
buildkite-env-vars
secret containing the following keys and values:DOCKER_LOGIN_USER
,DOCKER_LOGIN_PASSWORD
,GITHUB_TOKEN
. (DOCKER_LOGIN_*
env vars are our Docker Hub bot account login,GITHUB_TOKEN
is a custom env var we use for installing private gems, etc. from GitHub in CI) - Create the
buildkite-agent-git-credentials
secret using agit-credentials
file for a bot account as documented here:
$ kubectl create secret generic buildkite-agent-git-credentials --from-file=./git-credentials --namespace=buildkite
- Create your Lambda function using the payload generated by
make build
. The handler should bemain
, runtime isgo1.x
, timeout30
, and memory128mb
. You'll need to setup a few environmental variables in your function, too -- they're documented below. - Follow Buildkite's instructions (here)[https://buildkite.com/docs/integrations/amazon-eventbridge#configuring] on setting up AWS EventBridge notifications within the Buildkite and AWS consoles.
- Create a rule on your
aws.partner/buildkite.com/......
event bus to trigger your Lambda function. For the rule pattern, selectEvent pattern
->Custom pattern
, and fill in:
{
"account": [
"<your AWS account number>"
],
"detail-type": [
"Job Scheduled"
]
}
In the target section, select Lambda function
and then select your newly-created function.
The following environmental variables can be configured on your Lambda function for ScaledKite:
- [required]
cluster
- the EKS cluster to authenticate to - [required]
arn
- the IAM role ARN that Scaledkite should use for EKS cluster access - [required]
buildkite_token
- your Buildkite agent token - [optional]
namespace
- the namespace your worker agents will run in - [optional]
pod_prefix
- the prefix used for created k8s jobs/pods - [optional]
image
- thebuildkite-agent
docker image to use -- remember that you'll need a custom one with thepre-exit
hook (see #Quirks) - [optional]
region
- the region your EKS cluster is in, not required if your Lambda function is running in the same region as your cluster
- This tool runs Docker in Docker containers as privileged in your cluster. This can be somewhat mitigated by running Buildkite pods on segregated nodes (which we do here via workload selectors).
- Scaledkite only creates workers for jobs where the
agent_query_rules
arequeue=dynamic
. See the test CI script inci/
for an example. (or fork this and edit it to match what you need) - You probably shouldn't use Scaledkite for steps involving image builds, there's no layer caching support. (We split image builds off into a separate queue at Basecamp)
- Scaledkite relies on a custom
buildkite-agent
Docker image that has apre-exit
hook that deletes thedocker-dind
sidecar from the worker pod. Until proper sidecar support lands in Kubernetes, this is one of the better options that doesn't require weird entrypoint signal handling to get ourdocker-dind
container to shut down when thebuildkite-agent
container does. - There's an off-chance that a message to EventBridge isn't delivered and an agent isn't scheduled for a specific task. It could be worth running an agent or two with
BUILDKITE_AGENT_TAGS
set toqueue=dynamic
to pick them up.
- Enable Fargate support. It's simple -- just add the annotation to the generated Job config, but we don't use it at Basecamp because we use Docker in CI.
- Move
buildkite_token
to a real secret that isn't a Kubernetes secret. - Make more things configurable (resource requests, etc.)
- Stop relying on the
pre-exit
hook to stop thedocker-dind
container - Accept a string of ECR account IDs and regions to authenticate to in the
environment
hook, rather than just assumed we only need us-east-1 in the account the agents are running in. - Switch to SSH keys for GitHub auth
- EKS cluster authentication code from https://github.com/nbrandaleone/eksClient