evryfs/github-actions-runner-operator

Add status to operator when it has trouble talking to GitHub

bgolding355 opened this issue · 2 comments

The Problem

GitHub Outage

Within the last day github had a significant outage: https://www.githubstatus.com/incidents/sksd097hm0y5?utm_ts=1647526099

While this outage occurred, github-actions-runner-operator experienced errors communicating with github.

While trying to debug this, I visited https://github.com/settings/tokens and tried to generate a token, the result of which was:
image

Runner Pod Logs

Http response code: InternalServerError from 'POST https://api.github.com/actions/runner-registration'

Operator Pod Logs

ERROR controller.githubactionrunner Reconciler error

{
   "reconciler group": "garo.tietoevry.com", 
   "reconciler kind": "GithubActionRunner", 
   "name": "basic-runner-pool", 
   "namespace": "github", 
   "error": "POST https://api.github.com/orgs/GarnerCorp/actions/runners/registration-token: 500  []" 
}

DEBUG events Warning

{
  "object": {
    "kind": "GithubActionRunner",
    "namespace": "e2e-tests",
    "name": "e2e-runner-pool",
    "uid": "c855cc0e-a161-4673-b7ae-4c4e316f00bf",
    "apiVersion": "garo.tietoevry.com/v1alpha1",
    "resourceVersion": "533367216"
  },
  "reason": "ProcessingError",
  "message": "failed to get installation for owner \"GarnerCorp\": GET https://api.github.com/orgs/GarnerCorp/installation: 500  []"
}

Environment

I am using:

dependencies:
  - name: github-actions-runner-operator
    version: 2.5.5
    repository: https://evryfs.github.io/helm-charts/

I have since upgraded to 2.7.0 but the behavior persists.

Proposed Enhancements

When a github API call fails, it would be very useful to add a status to the GithubActionRunner saying that it is having an issue, especially in the case where it is a reconciler error

Thank you for the thorough report. This should already be the case. Example from our cluster:

k describe gar dts-default-pool|tail -27
  Reconciliation Period:  30s
Status:
  Conditions:
    Last Transition Time:  2022-03-17T20:19:30Z
    Message:               
    Observed Generation:   50
    Reason:                LastReconcileCycleSucceded
    Status:                True
    Type:                  ReconcileSuccess
    Last Transition Time:  2022-03-17T15:22:43Z
    Message:               failed to get installation for owner "<redacted>": GET https://api.github.com/orgs/<redacted>/installation: 500  []
    Observed Generation:   50
    Reason:                LastReconcileCycleFailed
    Status:                True
    Type:                  ReconcileError
  Current Size:            1
Events:
  Type    Reason   Age   From                Message
  ----    ------   ----  ----                -------
  Normal  Scaling  151m  GithubActionRunner  Created pod garo-default-runner-pool/dts-default-pool-pod-cxdh9
  Normal  Scaling  150m  GithubActionRunner  garo-default-runner-pool/dts-default-pool-pod-52qvs
  Normal  Scaling  147m  GithubActionRunner  Created pod garo-default-runner-pool/dts-default-pool-pod-txgtr
  Normal  Scaling  146m  GithubActionRunner  Created pod garo-default-runner-pool/dts-default-pool-pod-r8gdk
  Normal  Scaling  146m  GithubActionRunner  Created pod garo-default-runner-pool/dts-default-pool-pod-7mnmv
  Normal  Scaling  120m  GithubActionRunner  garo-default-runner-pool/dts-default-pool-pod-cxdh9
  Normal  Scaling  114m  GithubActionRunner  garo-default-runner-pool/dts-default-pool-pod-txgtr
  Normal  Scaling  113m  GithubActionRunner  garo-default-runner-pool/dts-default-pool-pod-r8gdk

do you have the latest CRD applied in the cluster?

Closing since no activity and this is already supported. Feel free to re-open should there be anything else.