evryfs/github-actions-runner-operator

Operator doesn't respond to `Evicted` runner

FrederikNJS opened this issue · 8 comments

We frequently experience runners getting evicted, due to using too much ephemeral storage. It's probably due to the builds themselves not cleaning up after themselves.

However when a runner is evicted due to using too much memory or ephemeral storage, nothing happens. I can see a use for keeping the Evicted pod around for debugging purposes, but the operator should notice that a runner was evicted, and spin up a new one, to ensure that the minimum number of healthy runners is kept around. If the Evicted pod is deleted, the operator responds immediately by spinning up a new runner.

Would it be possible for the operator to regard an Evicted runner just like if it doesn't exist?

We see the same problem

thanks for reporting, can you check if the runners are still registered at github when the pod has been evicted (needs to be checked quite soon after it becoming evicted, as gh will eventually remove dead runners). Some logic may be needed to get them unregistered, as well as catering for the pod-status.

thanks for reporting, can you check if the runners are still registered at github when the pod has been evicted (needs to be checked quite soon after it becoming evicted, as gh will eventually remove dead runners). Some logic may be needed to get them unregistered, as well as catering for the pod-status.

@davidkarlsen

  1. Yes - Evicted pods are still registered at github (Org Settigns -> Actions -> Self-hosted runners -> Runner groups)
  2. They are left there for a while. We had one stuck in Offline for the whole weekend

I the same problem.

  1. what version are you running?
  2. what does the operator logs say? (It should attempt to deregister them with gh and then remove the finalizer, allowing them to be deleted.

@davidkarlsen OK Bro,
I had stuck same issues 212 here, I start the latest version and another namespace. I check Evicted pods is deleted.
But namespace old it still status Terminating.
Thank bro

@davidkarlsen
I am still facing this issue in v0.10.0. New runner pods don't come up when an older pod gets evicted and then that runner pool goes out of sync and all the jobs get queued.

I've to manually delete the evicted pod for the new pods to come up and make everything operational.

FYI, I've installed the operator via the helm chart mentioned in the Readme and I'm running with GHE.

@davidkarlsen I see that the runner is getting unregistered when a pod is evicted but still isn't deleted and the new runner isn't spun by the operator.

Also, I don't see finalizers attached to the evicted pods. Does it mean the finalizer was also removed successfully?