Metric of "lost communication with the server" error
int128 opened this issue · 0 comments
Problems to solve
Eventually a self-hosted runner is killed by OOM or some issue. It is called "lost communication with the server" error.
When the error occurred, GitHub Actions adds an annotation with the following message:
The self-hosted runner: POD_NAME lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
Currently, we send the annotation message to Slack by this action:
https://github.com/int128/workflow-run-summary-action/blob/216f94dd10d099652cfb393e598c2a8f604c3bd0/src/run.ts#L60
How to solve
It would be nice to monitor the count of "lost communication with the server" errors for fact-based decision.