buildkite/buildkite-agent-metrics

Add support for refreshing of agent token at runtime

Closed this issue · 4 comments

Description

The buildkite-agent-metrics binary does not support updating of the buildkite agent token during runtime. Hence if the token gets cycled/rotated the application cannot ingest metrics.

Solution

Add support for updating agent token at runtime.

Ideas

Create a flag for opt-in refresh behavior. Programmatically it instantiates a goroutine with runloop that runs every X seconds to verify if an agent token is valid and if not updates the token. For example, in an AWS setting we can refetch the token from SSM - following similar semantics for other backends.

Open to additional ideas.

Hi @NotArpit. Thanks for using buildkite-agent-metrics.

Can you provide more details about how you're running this?

We've had a look at the code, and it looks like if you're using a lambda and the SSM or SecrectsManager integrations (i.e. the variables BUILDKITE_AGENT_TOKEN_SSM_KEY or BUILDKITE_AGENT_SECRETS_MANAGER_SECRET_ID are set), then the next execution of the lambda should pick changes to the token.

If you're not using a lambda, or the token is sourced from an environment variable, then it's the responsibility of the user to populate the environment with the correct token.

We think we can improve the situation in the non-lambda case by terminating the process when there is a permissions error response from Buildkite, but it would still only read what's in the environment. Typically, these are run in containers, so restarting the process will restart the container, and it will be populated with an environment containing the updated token.

If you're the latter situation, please let us know and we can make this change for you.

Hey @triarius - that's correct, we're in the latter situation where the token gets sourced from env vars (we aren't in a FaaS context). So when the token gets updated it runs into permission issues as it does not refresh at runtime.

We think we can improve the situation in the non-lambda case by terminating the process when there is a permissions error response from Buildkite

Ideally I'd like for the process to signal that it's ran into a permissions issue (i.e. SIGTERM) so that we can refresh the environment variable with a valid token and then restart the process via signal handlers. I think your idea would achieve that - happy to put a PR up once we agree on an implementation :)

Ideally I'd like for the process to signal that it's ran into a permissions issue (i.e. SIGTERM) so that we can refresh the environment variable with a valid token and then restart the process via signal handlers. I think your idea would achieve that - happy to put a PR up once we agree on an implementation :)

I think you can use exit codes to detect whether the process exit because the auth token is unauthorised. Then, in the code that launches this executable, you can retry with a refreshed auth token when you detect that exit code. That seems a bit simpler than signal handling.

Always happy to review PRs!

Put a PR up.