[Operator] Detect stuck operator

Question

Closed this issue a year ago · 0 comments

Once we allow multiple replicas of the operator we should make sure that the probes are implemented.

We don't offer an API endpoint, so implementing the kubernetes health/ready/live probes does not really make sense.
We should still try to detect those cause, but then simply stop our java application.
What cases do we have where we don't crash but are stuck
- when watches for CRs are not called anymore
  - we had issues with this because of a kubernetes bug here: #32
  - I don't think we have a way to detect issues with Watches fast, meaning that we would have to detect it based on some timeframe (e.g. if we haven't received an event or have been asked if we can reconnect within two hours, we assume that watch to be dead)
- Look for calls with active waiting that might get stuck
  - don't do any active waiting on the main thread
  - all waits must have a timeout
    - ResourceClient.watchUntil
    - AddedHandlerUtil.updateSessionURLAsync
  - Calls to the kubernetes-client will either throw an exception or return immediately, so there should be no action required atm.

Stop the operator if it gets stuck.

No response

No response