TilBlechschmidt/WebGrid

HeartBeat does not consider service health

TilBlechschmidt opened this issue Β· 2 comments

πŸ› Bug description

Currently, the HeartBeat struct does not take any external status indicators into account. In case of manager/session heartbeats this may cause issues as jobs can be unavailable, thus compromising the overall service health, but the heartbeat persists. This bug presumably does not manifest itself as of now though, because the only resource that could become unavailable is the database (which the HeartBeat struct uses). However, to ease the future addition of resources this behaviour should be fixed and crashes of services, albeit unlikely, can still surface this.

🦢 Reproduction steps

Steps to reproduce the behavior:

  1. Compromise a job by e.g. letting it crash
  2. Launch the service (manager, session or storage)
  3. The heartbeat stays in the database even though /status reports Degraded

🎯 Expected behaviour

When the JobScheduler is in a degraded state, the heartbeat should be temporarily removed from the database.


Context

Version
This bug is not version related.

Where did the problem occur?

  • ☸️ Kubernetes
  • 🐳 Docker
  • πŸ‘¨β€πŸ’» Locally

Which browsers cause the bug?
This bug is not browser related

Solution sketch

The ideal solution would be an extension to the HeartBeat struct that takes &JobScheduler and adds/removes the heartbeat as the status of the scheduler changes. The arguably simplest way would be to spawn either a new job or incorporate it into the existing HeartBeat loop.

Solving this ticket would be an opportunity to move the status evaluation from the status server into the job scheduler (where it should belong) to share the logic between the status server and heart beat.