actions/runner

Support for autoscaling self-hosted github runners

jwcmd opened this issue ยท 17 comments

jwcmd commented

Describe the enhancement
I'm looking for a way to put a self-hosted github runner into an autoscale group.

I've discussed with Github Support and they've explained that the tokens are only valid for one hour. That's problematic for an autoscale group because it means they will fail to bring up a runner an hour after I deploy the autoscale group. They recommended raising my issue here, I apologize if we've both missed an obvious solution for this.

Code Snippet
Not Applicable.

In AWS we do this:

GitHub App tokens:

  • We register a GitHub App which has permission to register runners
  • We store its secrets in AWS secrets manager
  • A cloudwatch scheduled event triggers a lambda which generates a github token for that app and stores it into secrets manager

Runner registration:

  • A cloudwatch event trigger is setup for EC2 instances in the pending state.
  • Those events trigger a lambda which looks at the Org and Repo tags of the launched instance to decide where to register the runner (this lambda has permission to read the github token secret)
  • The lambda fetches a registration token and puts it into the EC2 Parameter store, prefixed by the instance name

The runner instance:

  • The IAM role for that instance has permission to read parameter store values for its instance id
  • On boot it polls parameter store waiting for a registration token

The ASGs:

  • We run an ASG for each kind of runner we want, and it uses the Org/Repo tags for its instances
  • When running ephemeral runners (waiting on GitHub to finish that feature so we don't do this yet) when a job gets picked up by the runner it removes itself from the ASG, causing a new runner to be launched/boot/etc. (in this mode, "desired" = how many pre-warmed runners do we want to have on standby to pick up new jobs... our max # of concurrent jobs is only limited by EC2/our accounts EC2 limits)
  • When running a fixed pool of re-usable runners we use scheduled scaling events

The way we do registration is so that:

  • Runner VMs never have access to GitHub app creds or tokens
  • The registration lambda only has access to the token, not the long-lived creds
  • If GitHub has a blip (heh) we have the cached token.... we try to refresh it every 30 mins but its good for an hour and we will keep using the old one for that long if new tokens are failing
  • The instance polls for the registration token... its usually there on the first poll but if GitHub or AWS were to have a blip it will still work fine
  • It's also easy to manually launch a runner in the console if needed... just specify the Org/Repo tags

@j3parker thank you for solution!
With ephemeral runners you describe it's kind of autoscaling, because instances which remains in ASG are just idle runners.
Question about ephemeral runners: how do you catch event when job gets picked up by the runner ? Do you also terminate such ephemeral runner after job complete ?
Would be nice to know some details.

Question about ephemeral runners: how do you catch event when job gets picked up by the runner ?

Fantastic question -- I opened an issue about that over here: #699

We've prototyped a few hacks to detect when a job is started (to remove the runner from the ASG, triggering a new one to start booting to replace it + ASG policies to scale-up). We're just waiting patiently for ephemeral runners to be supported ๐Ÿ˜„

Do you also terminate such ephemeral runner after job complete?

Our plan is to terminate, yes. Vaguely I'm assuming the runner will exit and we will trigger a shutdown. You can configure an EC2 instance to terminate on shutdown.


Spinning up VMs for builds might be expensive. We do a fair clip of builds during the day so one option I'm mulling is to use firecracker rather than VMs, but you need to buy a whole (metal) instance for that. We haven't costed out if that would make sense for us yet.

Hopefully in the long-term someone will develop a turn-key AWS solution that can do a mix of spot-based instances for small load and bulk firecracker-based ones for better latency at scale.

We've prototyped a few hacks to detect when a job is started (to remove the runner from the ASG, triggering a new one to start booting to replace it + ASG policies to scale-up). We're just waiting patiently for ephemeral runners to be supported ๐Ÿ˜„

I guess GitHub going to present something new in Q3.
but we need some working solution until that happen.

Our plan is to terminate, yes. Vaguely I'm assuming the runner will exit and we will trigger a shutdown. You can configure an EC2 instance to terminate on shutdown.

@j3parker good tip, thank you.
In my propotype I'm checking if runner is busy via github api - /actions/runners. If it's busy - script remove it from ASG.
Next run script check if runner not busy and it's not part of ASG(no tag "aws:autoscaling:groupName"") and do deregister runner from github and shutdown/terminate the instance. The only problem that this checking script runs by cron every minute and if job run is less than 1 minute there is a chance that such logic will not work.
My goal is to try detect that runner is busy not every minute, but immediately. Maybe some filewatcher service that detects new Worker* files in runner "_diag" folder will help.
But this looks promising for my setup.

Spinning up VMs for builds might be expensive. We do a fair clip of builds during the day so one option I'm mulling is to use firecracker rather than VMs, but you need to buy a whole (metal) instance for that. We haven't costed out if that would make sense for us yet.

Why expensive as EC2 instances are currently billed per-second ?
Also spin-up time for ephemeral runners can be improved by baking own images with pre-installed runner. Register runner action is also just few seconds task.

In my propotype I'm checking if runner is busy via github api - /actions/runners. If it's busy - script remove it from ASG.

Nice! That is simple.

Why expensive as EC2 instances are currently billed per-second ?

Oh sorry, that was unclear. I mean in terms of time (there is a latency to spin up a machine.) Spinning up hot capacity in the background can hide that from users, but of course you're also paying for that. With enough concurrent builds it could be worth it (both in terms of money and managing perceived latency) to have an entire machine rented from AWS and use firecracker (which will boot things faster than EC2, e.g. it's what powers AWS Lambda).

An i3.metal (required if you want to use Firecracker) is 72 vCPUs so if you're doing 2 vCPU agents thats probably only going to make sense if you are doing >36 or so concurrent builds. You pay for these by the second too though, and can theoretically buy them on the spot market (I'm not sure if availability is good).

Also spin-up time for ephemeral runners can be improved by baking own images

๐Ÿ˜„ We do that by taking actions/virtual-environments which defines the GitHub-hosted runners and patching the packer files with jsonnet to tweak things for our purposes (and install the runner exe.) I definitely recommend it. You need to keep up with versions of the runner so that when your VM connects to github it doesn't accept a job and then download the newer version of the runner (we have a scheduled github action that polls for new releases of the runner.)

I doing a project like this by using GCP Preemptible VM but there are some issues:

  • Startup instance: I need 65s to create a new instance and register the runner. I use startup script to register the runner with GitHub.
  • Cache: I have no idea how to resolve this issue, maybe Gcloud storage.

I'm changing to Gcloud build. I think it's easier.

Gitlab already support this feature for long time ago use Gitlab Runner Manager
its like Github no intention support for autoscaling self-hosted (aws,gcp) for Github runners
because they try build like Azure Devops

Waiting for this feature to be running on AWS ECS fargate

@vietanhduong , how did you implement that in GCP? I'm trying to use a MIG with runners on them.
Do you have any details on how you made it?

Gitlab already support this feature for long time ago use Gitlab Runner Manager its like Github no intention support for autoscaling self-hosted (aws,gcp) for Github runners because they try build like Azure Devops

How will they make you pay if runners are easy to auto scale? Its similar to "planned obsolesce", this would be "authentication nightmare"

You can create a simple cronjob to regenerate the token every 30 minutes let's say. I created an scalable environment in an ECS cluster and sometimes the containers die after more then 1h, before unsubscribe the runner a function refresh the token.

I would suppose that's because the features that doc is written around are fairly new, released 20 Sept. :D

https://github.blog/changelog/2021-09-20-github-actions-ephemeral-self-hosted-runners-new-webhooks-for-auto-scaling/

Strange that no one is pointing to the docs on this:

docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners

ashb commented

I've just noticed this warning in the logs of my runner:

Nov 10 10:11:54 ip-172-31-28-50 run.sh[12322]: Warning: '--once' is going to be deprecated in the future, please consider using '--ephemeral' during runner registration.
Nov 10 10:11:54 ip-172-31-28-50 run.sh[12322]: https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#using-ephemeral-runners-for-autoscaling

However this won't work for us as a project in the apache org unless something has changed about the permissions around registering runners -- In order to register a new runner we need to a token that is created, and to create runner in a org wide group that requires Admin permissions, which we as members of the project don't have (only the central members of the ASF Infra team have that).

It has not changed as per https://docs.github.com/en/rest/reference/actions#self-hosted-runners

In order to create a registration token for an org group (i.e. not belonging to a single repo) I'll need an access token with Admin rights on the org:

GitHub Apps must have the administration permission for repositories or the organization_self_hosted_runners permission for organizations. Authenticated users must have admin access to the repository or organization to use this API.

If this goes ahead then all apache projects won't be able to have single-shot runners anymore.

Hi,

In the cloudwatch logs I see that the lambda triggers the scale up function but it is not creating the EC2 instance and also the job builds are not queued up in SQS. As my understanding is right, whenever the job is in queued it should post it in the SQS queue and from there the lambda scale up function picks up the job. But that is not happening. I'm not seeing any messages come to SQS always the available messages is "0".

Cloudwatch logs for Scaleup function

2022-07-22 17:55:35.045 INFO [scale-up:b0c371ee-c099-xxxxxxxx index.js:1142xx scaleUp] Received workflow_job from xxxxxxx
{}
2022-07-22 17:55:35.060 INFO [scale-up:b0c371ee-c099-5a8c-ba85-2aba264b3b98 index.js:114235 scaleUp] Received event
{
"runnerType": "Org",
"runnerOwner": "xxxxxxx",
"event": "workflow_job",
"id": "xxxxxx"
}

Disclaimer: This doesn't answers the actual question, but suggests an alternative:

You can achieve this easily with https://cirun.io/ It creates on demand runners for GitHub Actions on your cloud and manages the complete lifecycle. You simply connect your cloud provider and define what runners you need in a simple yaml file and that's it.

See https://docs.cirun.io/reference/examples.html#aws for example.