philips-labs/terraform-aws-github-runner

breaking release V2 to support multi-runners

npalm opened this issue ยท 22 comments

npalm commented

We have implemented mult-runners! With this change on the branch next. we combine other changes that are breaking to the current v1* release. In this issue we tracking all the PR part of the upcoming v2 release. Till that time, we push out v2.0.0-next.* releases for testing purpose. Currently most of the work is finished. We expect to merge all changes back by the end of December.

Once the work of next is finished we will apply the follong changes

  • The next branch will be merged back with a merge commit to develop
  • Semantic release will be replaced with release-please. This will help to keep a better track on new release via a release branch.
  • The develop branches will be removed, main will become the default.
  • A branch v1 will be created, we will not actively maintain this branch. But we will do our best to merge PR's for some time.

All the help and feedback is welcome, feel free to raise improvement PR's to the branch next.

Migration

Application tagged namespaces

In #2182 the Application tag on the runner is namespace, in this release the support for the old tag is removed (#2705) and only the new tag ghr:Application is used for scaling down. To ensure that olds instances got removed by the scale-down function it is important to first upgrade to 1.17. Alternative you can stop them manually.

Terraform 1.3+ required

We drop support for Terraform versions between lower then 1.3. This means versions 1.0.x till 1.2.x are not longer supported. The impact of upgrading from any Terraform version above 1 to 1.3.x should be minimal.

SSM Paths changed - Update required for tailored start-runner script (#2569)

Migration is only required when using your tailored start-runner script for example when building your own AMI. Before all parameters were based on convention. We have update them to be predictable with still similar conventions as default.

We set the tag ghr:ssm_config_path as EC2 instance tag to the path used for the configuration. In the config path we store the same values as before (run_as, enable_cloudwatch, agent_mode). The tokens are stored in a separate location which can be retrieved by looking up the value token_path in the config.

For a more details example check-out the following diffs:

Multi runner capability (#2519)

  • The multi-runner is available as a stand-alone module. However the change introduced a breaking change on the main module. The variable runner_enable_workflow_job_labels_check is not available under this module as we now only support the flagenable_runner_workflow_job_labels_check_all to check all/partial workflow labels to be present in the runner labels. Running TF apply on existing deployment will apply some changes done in internal modules to existing infrastructure.
  • Impact on using internal (submodules). There are some changes in the webhook internal module as per this feature. So, If you have been using it directly, the changes will be towards accepting runner configuration (queue details and label matchers) as a single object instead of earlier configuration of scattered queue details and criteria to match the workflow labels against the runner labels.

Removal support check_run (#1256

In the past we used for scaling the check_run event, since GitHub had now support for a workflow_job event. Now the event is upport for quite some time. And also support exists on GHES. We will phase out check_run support.

Ensure you configure the GitHub app to subscribe to workflow_job. In case you on an old GHES server version, the only. migration path is to update GHES or remain on old version of the module

Removal old scale-down mechanism

Migration is only required when you are on an older version than 0.19.0.

  • Option 1, upgrade to 2.x could cause orphan instances which will not be terminated by the scale-down function, thoses instances needs to be removed manual.
  • Option 2, upgrade first to a verions 1.17+ and keep this version running till all instances before the upgrade are terminated. Next upgrade to 2.x.

Default Node version and lambda architecture

The default for node is set to node 18 (LTS) and the default for the lambda architecture is changed from x64 to arm64. This architecture change has no impact on the runners (#2763).

Variables

Depcreated variables are removed from the module.

  • environment -> prefix
  • instance_type -> instance_types
  • market_options -> instance_target_capacity_type

The following variables are renamed for consistently. Old variables names will trigger an error with upgrade instruction.

  • enabled_userdata -> enable_userdata (#2784)
  • fifo_build_queue -> enable_fifo_build_queue (#2809)
  • runner_enable_workflow_job_labels_check_all -> enable_runner_workflow_job_labels_check_all (#2809)
rlove commented

Looking forward to this change.

npalm commented

Looking forward to this change.

Do you have time to test the pre-release? Any help to check the branch would be very welcome.

rlove commented

Looking forward to this change.

Do you have time to test the pre-release? Any help to check the branch would be very welcome.

I will try, but I might not have time. I am currently working part-time, as I learn to deal with Parkinson's and loss of vision in one eye.

@npalm Since this will be a major release (in the semver sense), would it be possible to also bump minimum terraform version to 1.3.0+, to support optional, so we could make e.g. the runner block_device_mappings variable more user friendly, by leveraging default values for some object attributes?

npalm commented

@jpalomaki good point, we kept the main module as backward compatible as we can. But agree would help a lot of other refactoring. Feel free to make a small PR to the next branch.

@npalm All right, I will scan the modules for places where a later terraform version might help, and then post a PR

Hey, thanks for the work on this, would you have an idea of an ETA?

npalm commented

Hey, thanks for the work on this, would you have an idea of an ETA?

Still one PR open. Open #2182, #2591, bump the version consistent to terraform 1.3+, and scan the module to any depcreated to remove. Any help is welcome.

Here's the terraform version version bump: #2639. I will do a follow-up PR for the EBS volume mapping variable to use optionals

@npalm Here's the EBS volume mappings variable refactoring: #2664. I think it ought to be backwards-compatible. Do the automated tests cover the usage of this variable already?

npalm commented

@npalm Here's the EBS volume mappings variable refactoring: #2664. I think it ought to be backwards-compatible. Do the automated tests cover the usage of this variable already?

Unit tests does not cover this, EBS block mappings is Terraform only and creates a launch template from which the unit tests are not aware. We have currently not automated end-to-end test.

npalm commented

https://github.com/philips-labs/terraform-aws-github-runner/releases/tag/v2.0.0-next.5

Only 2 PRs out to upgrade node, remove depcrecation and set default lambda arch to arm.

@npalm Minor thing: how about renaming variable enabled_userdata -> enable_userdata to be consistent with the other enable_* variables?

Very much looking forward to this update. I likely won't be able to test before the new year (unsure when you're planning on releasing) but if it's still prerelease in a month or so I'd be happy to give it a go ๐Ÿ‘

npalm commented

@npalm Minor thing: how about renaming variable enabled_userdata -> enable_userdata to be consistent with the other enable_* variables?

Thanks for spotting, feel free to raise a PR. I planning to merge next back in the last week of the year.

npalm commented

Good news, the branch is ready to merge. Cleanup up develop amd main.