evryfs/github-actions-runner-operator

Terminated container(s) doesn't kill the pod

Opened this issue · 1 comments

In certain circumstances the connection to GitHub might fail. This can be due to SSL/TLS issues, GitHub being down, etc.

Example from the runner container log:

√ Runner successfully added
The SSL connection could not be established, see inner exception.
An error occurred: Not configured. Run config.(sh/cmd) to configure the runner.

This leads to the runner container being terminated, but the pod itself keeps running (albeit in an ERROR state) - blocking spawning of new pods for the affected pool. After deleting the pod, the pool scales as normal again.

This could be solved by using a livenessProbe on the runner container to check if it is running or not, but if any of the containers under the pod terminates, the pod should also be terminated (handled in the Operator)

The only way I've been able to get around this when GH has hiccups and our runner pool crashes is to either scale the operator to 0 then back to 1, or force delete the pool then let the operator scale them back up.

Would love to have a better solution to this, as GH outages seem to be 3-4 times a year if not more.

livenessProbe sounds interesting.