[Operator] Detect stuck operator
Closed this issue · 0 comments
jfaltermeier commented
Is your feature request related to a problem? Please describe.
Once we allow multiple replicas of the operator we should make sure that the probes are implemented.
- We don't offer an API endpoint, so implementing the kubernetes health/ready/live probes does not really make sense.
- We should still try to detect those cause, but then simply stop our java application.
- What cases do we have where we don't crash but are stuck
- when watches for CRs are not called anymore
- we had issues with this because of a kubernetes bug here: #32
- I don't think we have a way to detect issues with Watches fast, meaning that we would have to detect it based on some timeframe (e.g. if we haven't received an event or have been asked if we can reconnect within two hours, we assume that watch to be dead)
- Look for calls with active waiting that might get stuck
- don't do any active waiting on the main thread
- all waits must have a timeout
- ResourceClient.watchUntil
- AddedHandlerUtil.updateSessionURLAsync
- Calls to the kubernetes-client will either throw an exception or return immediately, so there should be no action required atm.
- when watches for CRs are not called anymore
Describe the solution you'd like
Stop the operator if it gets stuck.
Describe alternatives you've considered
Cluster provider
No response
Additional information
No response