Argo workflows fail to terminate on EKS
fakeburst opened this issue ยท 5 comments
Greetings!
I've encountered an issue while trying to implement Admiralty for Argo workflows. Running parallel steps on multiple clusters works like a charm, but terminating workflows does not work. It seems to me that it boils down to the way argo workflow-controller manages stopping the running pods. Simplified the process is (ref):
- workflow-controller annotates the pod (proxy pod in our case) with
workflows.argoproj.io/execution: '{"deadline":"2022-XX-XXTXX:XX:XXZ"}'
and sendsSIGUSR2
towait
sidecar - upon receiving the signal
wait
checks the said annotation and issues a kill command
The problem is that the proxy pod gets instantly synced with its PodChaperon, which has the annotations of the delegated pod, and the annotation never reaches the delegate. This causes wait
to fail the check and not issue to kill the containers in the pod.
I assume having the delegate to be the only source of truth saves us from possible race conditions in terms of which annotations should be considered "true", but this way we are not able to stop the workflows, which is a needed feature for working with argo workflows.
Please let me know if you need any logs or additional info and/or whether I'm misunderstanding the process of syncing the annotations
Hi!
Admiralty doesn't currently support proxy-to-delegate pod updates, including annotation updates. However, as you noticed, Admiralty does support delegate-to-proxy pod annotation updates (so Argo can read step outputs).
To fully support Argo, especially stopping/terminating workflows and using daemon steps, as it used to work before version 3.2, i.e., using the workflows.argoproj.io/execution
pod annotation, we'd need two-way updates. We could likely make them deterministic with a three-way merge algorithm and some priority rules.
Luckily, you won't have to wait, because Argo v3.2 doesn't use annotations for execution control anymore, sends TERM signals directly instead: argoproj/argo-workflows#6022
Upgrading Argo should fix your issue.
Sorry for the delayed answer and thanks a lot for your advice!
We're using argo workflows as a part of kubeflow pipelines, so will need to wait for the next release of pipelines to have argo v3.2.3 integrated. Meanwhile, I had a custom wait sidecar image built with annotation check removed, so that SIGUSR2 signal triggers docker kill directly.
But, as EKS drops support for k8s 1.18 on March 31, 2022, I've upgraded to 1.20 and encountered the issue described at #120 - argo workflow controller executes kill commands via pods/exec
request.
Could you please tell me whether you have any ETAs for #120?
I've tried bumping the dependencies myself, but failed with my little to no experience with go ๐
Meanwhile, I had a custom wait sidecar image built with annotation check removed, so that SIGUSR2 signal triggers docker kill directly.
Good idea.
Could you please tell me whether you have any ETAs for #120?
This month, hopefully next week.
I'm terribly sorry for such a delayed response once again.
Thanks for the update, I've upgraded my EKS clusters to k8s 1.21 and re-installed admiralty with v0.15.1.
Yet it seems I'm missing something as I get this error in controller-manager
pods
main.go:323] timed out waiting for virtual kubelet serving certificate to be signed, pod logs/exec won't be supported
I've checked the CSRs and they get Approved pretty much the second the controller-manager
pod launches, but the agent times-out with the error anyway. What could be the reason for this error?
UPD: it seems to be related to aws/containers-roadmap#1604 (comment)
Indeed, the CSRs are never issued, only approved. Will a custom signer help in this case?
Please let me know if you need any logs or additional info.