flux-framework/flux-coral2

flux-coral2-dws service doesn't handle k8s connection failure gracefully

grondo opened this issue · 4 comments

The flux-coral2-dws service is restarting repeatedly on el cap after an update. The k8s servers are having problems and are down:

Apr 01 13:48:51 elcap1 flux-coral2-dws[2684568]: urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='elcap-kube-apiserver', port=6443): Max retries exceeded with url: /apis/dataworkflowservices.github.io/v1alpha2/storages (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ffe4b96da20>: Failed to establish a new connection: [Errno 113] No route to host',))

The service keeps exiting and getting restarted by systemd. A special RestartPreventExitStatus should perhaps be used in this case to avoid restarting, or the service should have a way to retry the https connection.

What do you think would be best for drawing admins' attention that something is misconfigured? My guess would be using RestartPreventExitStatus...

Yeah, that might be best. The error could be logged so admins could see it with journalctl.

I don't know enough to know if some other solution would work.. like if the connection is down placing some kind of alternate dependency on DWS jobs to alert admins. Even if that were possible, it sounds like a lot of work for not much gain..

You might also want to think through what happens if the service goes from up to down, e.g. the connection terminates, if that's even possible. The service should be able to shut down cleanly without affecting jobs - same with restarting 🤔

You might also want to think through what happens if the service goes from up to down, e.g. the connection terminates, if that's even possible. The service should be able to shut down cleanly without affecting jobs - same with restarting 🤔

It can already restart without affecting jobs. It (in combination with the jobtap plugin) holds jobs with dependencies and prologs and epilogs and if the service is down those holds will continue until the service comes back up.

Although I suppose if a user's job is being held in prolog and then the service crashes (let's say the k8s server goes down) that user might exhaust their allocation walltime just waiting for the service to come back up, and that seems bad. Maybe it makes sense for the prolog to have a configurable time limit?