Can't resolve `Pods and runner API not in sync, returning early`

(We're technically running 0.10.0. We'll try to upgrade soon...)

github-actions-runner-operator/controllers/githubactionrunner_controller.go

Line 124 in 456a5b5

logger.Info("Pods and runner API not in sync, returning early")

Our GitHub runners got stuck, we had a limit to our pool, so we edited each pod to remove the finalizer and then deleted the pod. This may or may not have allowed some to start.

Eventually the operator wouldn't start any more for the pool apparently because the system had reached a limit. (Not quite sure why, we can file a bug about that later).

To try to fix things, someone deleted the GithubActionRunner object hoping that this would unstick the operator (ArgoCD manages the object, so it resurrected it immediately after deletion).

Instead, we get:

2023-02-03T16:52:49.900Z INFO controllers.GithubActionRunner Pods and runner API not in sync, returning early {"githubactionrunner": "github/docker-runner-pool"}

My guess is that the only way to "fix" this is to restart the operator, but, ideally the operator should be more tolerant of this case.

Rough thoughts:

It'd be nice if the operator would be able to recognize "oh, the object i'm talking to is not the one I was monitoring and is younger than the one I was monitoring" + "I guess I should look at my old state, and probably discard it and then track state against the new object".
To some extent, it might be nice if there was a way to ask the operator to do things like "drop" things or refresh or ...

Ok, it appears that we ended up in a state where GitHub thought we had more runners than the operator thought it was managing.

At this time, we couldn't actually delete those additional runners (I opened a ticket: https://support.github.com/ticket/personal/0/1995950 -- for my reference "Sorry, there was a problem deleting your runner." and GitHub support confirmed they could reproduce that API failure).

We tried deleting all of our runner pods, and that made things worse, because the operator eventually thought it had 0 pods, but GitHub had ~2-4 dangling runners that no one could delete.

Because there's a fairly tightly coupled state machine that cautiously refuses to make additional changes while the number of pods the operator has doesn't match the number of runners that GitHub has, the operator was unwilling to make new pods and we were stuck just watching the operator repeat the reported error message.

The workaround mentioned in #128 of selecting new pool names would work, but it's a pretty unfortunate thing and really not particularly ideal.

If the number of pods the operator is managing is lower than the number of runners that GitHub has, and GitHub claims the runners it knows about are idle, then from my perspective, it's totally fair for the operator to tell GitHub to delete the ones that the Operator doesn't know about that are (1) idle and (2) part of the same label set such that they should be owned and operated by the operator.

... In fact, the GitHub support agent suggested:

can you please let me know whether it works for you if you try to delete the runner via the API?

So, at least GitHub support thinks it's a good idea 😉.

GitHub
Build software better, together
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 330 million projects.