flux-coral2-dws service doesn't handle k8s connection failure gracefully

Question

flux-coral2-dws service doesn't handle k8s connection failure gracefully

grondo opened this issue 8 months ago · 4 comments

The flux-coral2-dws service is restarting repeatedly on el cap after an update. The k8s servers are having problems and are down:

Apr 01 13:48:51 elcap1 flux-coral2-dws[2684568]: urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='elcap-kube-apiserver', port=6443): Max retries exceeded with url: /apis/dataworkflowservices.github.io/v1alpha2/storages (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ffe4b96da20>: Failed to establish a new connection: [Errno 113] No route to host',))

The service keeps exiting and getting restarted by systemd. A special RestartPreventExitStatus should perhaps be used in this case to avoid restarting, or the service should have a way to retry the https connection.

Answer 1 · 2024-04-05T01:55:41.000Z

What do you think would be best for drawing admins' attention that something is misconfigured? My guess would be using RestartPreventExitStatus...

Answer 2 · 2024-04-05T02:21:34.000Z

Yeah, that might be best. The error could be logged so admins could see it with journalctl.

I don't know enough to know if some other solution would work.. like if the connection is down placing some kind of alternate dependency on DWS jobs to alert admins. Even if that were possible, it sounds like a lot of work for not much gain..

Answer 3 · 2024-04-05T02:23:14.000Z

You might also want to think through what happens if the service goes from up to down, e.g. the connection terminates, if that's even possible. The service should be able to shut down cleanly without affecting jobs - same with restarting 🤔

Answer 4 · 2024-04-05T02:36:13.000Z

You might also want to think through what happens if the service goes from up to down, e.g. the connection terminates, if that's even possible. The service should be able to shut down cleanly without affecting jobs - same with restarting 🤔

It can already restart without affecting jobs. It (in combination with the jobtap plugin) holds jobs with dependencies and prologs and epilogs and if the service is down those holds will continue until the service comes back up.

Although I suppose if a user's job is being held in prolog and then the service crashes (let's say the k8s server goes down) that user might exhaust their allocation walltime just waiting for the service to come back up, and that seems bad. Maybe it makes sense for the prolog to have a configurable time limit?