Argo workflows fail to terminate on EKS

Question

Argo workflows fail to terminate on EKS

fakeburst opened this issue 3 years ago · 5 comments

Greetings!

I've encountered an issue while trying to implement Admiralty for Argo workflows. Running parallel steps on multiple clusters works like a charm, but terminating workflows does not work. It seems to me that it boils down to the way argo workflow-controller manages stopping the running pods. Simplified the process is (ref):

workflow-controller annotates the pod (proxy pod in our case) with workflows.argoproj.io/execution: '{"deadline":"2022-XX-XXTXX:XX:XXZ"}' and sends SIGUSR2 to wait sidecar
upon receiving the signal wait checks the said annotation and issues a kill command

The problem is that the proxy pod gets instantly synced with its PodChaperon, which has the annotations of the delegated pod, and the annotation never reaches the delegate. This causes wait to fail the check and not issue to kill the containers in the pod.

I assume having the delegate to be the only source of truth saves us from possible race conditions in terms of which annotations should be considered "true", but this way we are not able to stop the workflows, which is a needed feature for working with argo workflows.

Please let me know if you need any logs or additional info and/or whether I'm misunderstanding the process of syncing the annotations

Answer 1 · 2022-02-01T07:51:36.000Z

Hi!

Admiralty doesn't currently support proxy-to-delegate pod updates, including annotation updates. However, as you noticed, Admiralty does support delegate-to-proxy pod annotation updates (so Argo can read step outputs).

To fully support Argo, especially stopping/terminating workflows and using daemon steps, as it used to work before version 3.2, i.e., using the workflows.argoproj.io/execution pod annotation, we'd need two-way updates. We could likely make them deterministic with a three-way merge algorithm and some priority rules.

Luckily, you won't have to wait, because Argo v3.2 doesn't use annotations for execution control anymore, sends TERM signals directly instead: argoproj/argo-workflows#6022

Upgrading Argo should fix your issue.

Answer 2 · 2022-02-10T12:09:34.000Z

Sorry for the delayed answer and thanks a lot for your advice!

We're using argo workflows as a part of kubeflow pipelines, so will need to wait for the next release of pipelines to have argo v3.2.3 integrated. Meanwhile, I had a custom wait sidecar image built with annotation check removed, so that SIGUSR2 signal triggers docker kill directly.

But, as EKS drops support for k8s 1.18 on March 31, 2022, I've upgraded to 1.20 and encountered the issue described at #120 - argo workflow controller executes kill commands via pods/exec request.

Could you please tell me whether you have any ETAs for #120?
I've tried bumping the dependencies myself, but failed with my little to no experience with go 😅

Answer 3 · 2022-02-11T06:12:09.000Z

Meanwhile, I had a custom wait sidecar image built with annotation check removed, so that SIGUSR2 signal triggers docker kill directly.

Good idea.

Could you please tell me whether you have any ETAs for #120?

This month, hopefully next week.

Answer 4 · 2022-04-07T21:02:10.000Z

I'm terribly sorry for such a delayed response once again.

Thanks for the update, I've upgraded my EKS clusters to k8s 1.21 and re-installed admiralty with v0.15.1.
Yet it seems I'm missing something as I get this error in controller-manager pods

main.go:323] timed out waiting for virtual kubelet serving certificate to be signed, pod logs/exec won't be supported

I've checked the CSRs and they get Approved pretty much the second the controller-manager pod launches, ~~but the agent times-out with the error anyway. What could be the reason for this error?~~
UPD: it seems to be related to aws/containers-roadmap#1604 (comment)
Indeed, the CSRs are never issued, only approved. Will a custom signer help in this case?

Please let me know if you need any logs or additional info.

Answer 5 · 2022-04-15T06:09:52.000Z

Let's continue the conversation about EKS logs/exec support in #120.