evryfs/github-actions-runner-operator

Pods and runner API not in sync, returning early

Closed this issue · 6 comments

Hello again,

We've also noticed that every now and then we are getting this error from the operator.

Pods and runner API not in sync, returning early

It seems that this happens when there is a runner in the github repo with no corresponding pod.

Not sure how we get into this state, but is it possible to have the operator just automatically remove the unknown runner to keep the "pods and runner api" in sync?

@Mattkin this can happen in the [short] timeframe where pods have been spawned but not yet registered with github, it should fix it self after the pods have started and registered.

Hi @davidkarlsen thanks for the response

Is it possible to get in a state where the pod has registered the runner name with github, crashes and on retry sees a duplicate runner with the same name?

I think this is one use-case we have observed.

Any tips for debugging why a runner isn't starting up? I'm not seeing any logs in the operator that could be indicating a problem.

Any tips for debugging why a runner isn't starting up? I'm not seeing any logs in the operator that could be indicating a problem.

See operator log. Also try describe on the runner pod and check its log

I'm faced with the same problem. After many weeks of running nice and smoothly a runner is not removed by the operator in the github runner api and from this point the scaling of runners is not working anymore and I continuously getting this error in the log until I remove this runner by hand.
Any advise or sth. that I can do to solve/finding this?

EDIT:
Found that a call to kube api server for configmap failed with context deadline exceeded
after this the operator gets shutdown signal and restarts a new container
I think this causes the behaviour

E0412 10:24:41.602036 1 leaderelection.go:330] error retrieving resource lock github-actions-runner/4ef9cd91.tietoevry.com: Get "https://10.253.0.1:443/api/v1/namespaces/github-actions-runner/configmaps/4ef9cd91.tietoevry.com": context deadline exceeded
I0412 10:24:41.602323 1 leaderelection.go:283] failed to renew lease github-actions-runner/4ef9cd91.tietoevry.com: timed out waiting for the condition
2022-04-12T10:24:41.602Z INFO controller.githubactionrunner Shutdown signal received, waiting for all workers to finish {"reconciler group": "garo.tietoevry.com", "reconciler kind": "GithubActionRunner"}
2022-04-12T10:24:41.602Z DEBUG events Normal {"object": {"kind":"ConfigMap","apiVersion":"v1"}, "reason": "LeaderElection", "message": "github-actions-runner-operator-5c5c5f584-8njpz_974449cb-ae6a-4d8e-9389-40d6264b5c87 stopped leading"}
2022-04-12T10:24:41.602Z INFO controller.githubactionrunner All workers finished {"reconciler group": "garo.tietoevry.com", "reconciler kind": "GithubActionRunner"}
2022-04-12T10:24:41.603Z ERROR setup problem running manager {"error": "leader election lost"}

Check the list of runners at github and forcefully delete them if malfunctioning