HeartBeat does not consider service health
TilBlechschmidt opened this issue Β· 2 comments
π Bug description
Currently, the HeartBeat
struct does not take any external status indicators into account. In case of manager/session heartbeats this may cause issues as jobs can be unavailable, thus compromising the overall service health, but the heartbeat persists. This bug presumably does not manifest itself as of now though, because the only resource that could become unavailable is the database (which the HeartBeat struct uses). However, to ease the future addition of resources this behaviour should be fixed and crashes of services, albeit unlikely, can still surface this.
π¦Ά Reproduction steps
Steps to reproduce the behavior:
- Compromise a job by e.g. letting it crash
- Launch the service (manager, session or storage)
- The heartbeat stays in the database even though
/status
reportsDegraded
π― Expected behaviour
When the JobScheduler is in a degraded state, the heartbeat should be temporarily removed from the database.
Context
Version
This bug is not version related.
Where did the problem occur?
- βΈοΈ Kubernetes
- π³ Docker
- π¨βπ» Locally
Which browsers cause the bug?
This bug is not browser related
Solution sketch
The ideal solution would be an extension to the HeartBeat
struct that takes &JobScheduler
and adds/removes the heartbeat as the status of the scheduler changes. The arguably simplest way would be to spawn either a new job or incorporate it into the existing HeartBeat loop
.
Solving this ticket would be an opportunity to move the status evaluation from the status server into the job scheduler (where it should belong) to share the logic between the status server and heart beat.