int128/datadog-actions-metrics

Metric of "lost communication with the server" error

int128 opened this issue · 0 comments

Problems to solve

Eventually a self-hosted runner is killed by OOM or some issue. It is called "lost communication with the server" error.

When the error occurred, GitHub Actions adds an annotation with the following message:

The self-hosted runner: POD_NAME lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

Currently, we send the annotation message to Slack by this action:
https://github.com/int128/workflow-run-summary-action/blob/216f94dd10d099652cfb393e598c2a8f604c3bd0/src/run.ts#L60

How to solve

It would be nice to monitor the count of "lost communication with the server" errors for fact-based decision.