argoproj/argo-workflows

Windows-based pod runs indefinitely

Closed this issue Β· 20 comments

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

When I submit a workflow with a single task that runs in a windows/servercore:ltsc2022 container, the pod is stuck in Running phase indefinitely, although the command in the container succeeds.

Before we updated from v3.4.8, we used ltsc2019 images without problems. The Windows nodes were upgraded as well, they have the label node.kubernetes.io/windows-build=10.0.20348.

(Below snippet is an extraction from our workflow with modifications to hide sensitive data.)

Version

v3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: test
  namespace: dev
spec:
  templates:
    - name: whoami-template
      nodeSelector:
        kubernetes.io/os: windows
      metadata:
        annotations:
          sidecar.istio.io/inject: 'false'
      container:
        name: whoami-container
        image: >-
          mcr.microsoft.com/windows/servercore:ltsc2022
        command:
          - whoami
        resources: {}
        securityContext:
          capabilities:
            drop:
              - ALL
          runAsUser: 10001
          runAsNonRoot: true
          allowPrivilegeEscalation: false
          seccompProfile:
            type: RuntimeDefault
      tolerations:
        - key: windows
          operator: Exists
          effect: NoSchedule
    - name: main
      dag:
        tasks:
          - name: whoami
            template: whoami-template
  entrypoint: main
  arguments: {}
  serviceAccountName: workflow-execution
  podGC:
    strategy: OnWorkflowCompletion
  securityContext:
    seccompProfile:
      type: RuntimeDefault

Logs from the workflow controller

time="2023-09-13T12:38:16.903Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2023-09-13T12:38:16.909Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2023-09-13T12:38:16.909Z" level=info msg="Updated phase  -> Running" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=info msg="DAG node test initialized Running" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="was unable to obtain the node for test-2372842584, taskName whoami"
time="2023-09-13T12:38:16.909Z" level=warning msg="was unable to obtain the node for test-2372842584, taskName whoami"
time="2023-09-13T12:38:16.909Z" level=info msg="All of node test.whoami dependencies [] completed" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=dev workflow=test
time="2023-09-13T12:38:16.909Z" level=info msg="Pod node test-2372842584 initialized Pending" namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg="Created pod: test.whoami (test-whoami-template-2372842584)" namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:16.984Z" level=info msg="Workflow to be dehydrated" Workflow Size=2213
time="2023-09-13T12:38:17.002Z" level=info msg="Workflow update successful" namespace=dev phase=Running resourceVersion=93379873 workflow=test
time="2023-09-13T12:38:26.904Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:26.904Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:26.904Z" level=info msg="node changed" namespace=dev new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=test-2372842584 old.message= old.phase=Pending old.progress=0/1 workflow=test
time="2023-09-13T12:38:26.905Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:26.905Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:26.905Z" level=info msg="Workflow to be dehydrated" Workflow Size=2498
time="2023-09-13T12:38:26.928Z" level=info msg="Workflow update successful" namespace=dev phase=Running resourceVersion=93379981 workflow=test
time="2023-09-13T12:38:36.929Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:36.929Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:36.929Z" level=info msg="node unchanged" namespace=dev nodeID=test-2372842584 workflow=test
time="2023-09-13T12:38:36.930Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:36.930Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="node changed" namespace=dev new.message= new.phase=Running new.progress=0/1 nodeID=test-2372842584 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:48.121Z" level=info msg="Workflow to be dehydrated" Workflow Size=2482
time="2023-09-13T12:38:48.141Z" level=info msg="Workflow update successful" namespace=dev phase=Running resourceVersion=93380211 workflow=test
time="2023-09-13T12:38:56.430Z" level=info msg="Received Workflow Controller config map dev/argo-workflows-configmap-c9btt7bg2m update"
time="2023-09-13T12:38:56.435Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  azure:\n    accountKeySecret:\n      key: storage-account-access-key\n      name: storage-account-credentials\n    container: workflow-artifacts\n    endpoint: https://*******************.blob.core.windows.net\ninitialDelay: 0s\nmetricsConfig: {}\nnamespaceParallelism: 5\nnodeEvents: {}\nparallelism: 10\npodSpecLogStrategy: {}\nresourceRateLimit:\n  burst: 1\n  limit: 10\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \"\"\n  issuer: \"\"\n  redirectUrl: \"\"\n  sessionExpiry: 0s\ntelemetryConfig: {}\n"
time="2023-09-13T12:38:56.435Z" level=info msg="Persistence configuration disabled"
time="2023-09-13T12:38:56.435Z" level=info executorImage="quay.io/argoproj/argoexec:v3.4.11" executorImagePullPolicy= managedNamespace=dev
time="2023-09-13T12:38:58.142Z" level=info msg="Processing workflow" namespace=dev workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg="Task-result reconciliation" namespace=dev numObjs=0 workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg="node unchanged" namespace=dev nodeID=test-2372842584 workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg="TaskSet Reconciliation" namespace=dev workflow=test
time="2023-09-13T12:38:58.142Z" level=info msg=reconcileAgentPod namespace=dev workflow=test
time="2023-09-13T12:38:59.505Z" level=info msg="Received Workflow Controller config map dev/argo-workflows-configmap-c9btt7bg2m update"
time="2023-09-13T12:38:59.509Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  azure:\n    accountKeySecret:\n      key: storage-account-access-key\n      name: storage-account-credentials\n    container: workflow-artifacts\n    endpoint: https://********************.blob.core.windows.net\ninitialDelay: 0s\nmetricsConfig: {}\nnamespaceParallelism: 5\nnodeEvents: {}\nparallelism: 10\npodSpecLogStrategy: {}\nresourceRateLimit:\n  burst: 1\n  limit: 10\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \"\"\n  issuer: \"\"\n  redirectUrl: \"\"\n  sessionExpiry: 0s\ntelemetryConfig: {}\n"
time="2023-09-13T12:38:59.509Z" level=info msg="Persistence configuration disabled"
time="2023-09-13T12:38:59.509Z" level=info executorImage="quay.io/argoproj/argoexec:v3.4.11" executorImagePullPolicy= managedNamespace=dev

Logs from in your workflow's wait container

time="2023-09-13T12:38:35.970Z" level=info msg="Starting Workflow Executor" version=latest+v3.4.11.dirty
time="2023-09-13T12:38:36.124Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-09-13T12:38:36.124Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=dev podName=test-whoami-template-2372842584 template="{\"name\":\"whoami-template\",\"inputs\":{},\"outputs\":{},\"nodeSelector\":{\"kubernetes.io/os\":\"windows\"},\"metadata\":{\"annotations\":{\"sidecar.istio.io/inject\":\"false\"}},\"container\":{\"name\":\"whoami-container\",\"image\":\"mcr.microsoft.com/windows/servercore:ltsc2022\",\"command\":[\"whoami\"],\"resources\":{},\"securityContext\":{\"capabilities\":{\"drop\":[\"ALL\"]},\"runAsUser\":10001,\"runAsNonRoot\":true,\"allowPrivilegeEscalation\":false,\"seccompProfile\":{\"type\":\"RuntimeDefault\"}}},\"archiveLocation\":{\"archiveLogs\":true,\"azure\":{\"endpoint\":\"https://****************.blob.core.windows.net\",\"container\":\"workflow-artifacts\",\"accountKeySecret\":{\"name\":\"storage-account-credentials\",\"key\":\"storage-account-access-key\"},\"blob\":\"test/test-whoami-template-2372842584\"}},\"tolerations\":[{\"key\":\"windows\",\"operator\":\"Exists\",\"effect\":\"NoSchedule\"}]}" version="&Version{Version:latest+v3.4.11.dirty,BuildDate:2023-09-08T00:11:10Z,GitCommit:v3.4.11,GitTag:unknown,GitTreeState:,GoVersion:go1.20,Compiler:gc,Platform:windows/amd64,}"
time="2023-09-13T12:38:36.124Z" level=info msg="Starting deadline monitor"

Might be related to 1bcdba4 cc @cbuchli

Might be related to 1bcdba4 cc @cbuchli

Was also my first suspect when I went through the changelog between 3.4.8 and 3.4.11

I'll look into that. thanks!

@pizza-prosciutto

I can't reproduce this behaviour on my test system and based on the code change, I can't really imagine the fix caused the behaviour. Could you try these things please to help me debug:

  • Try with 2019 images if the behaviour is the same
  • Downgrade Argo and try with 2022 if the behaviour is the same
  • Optional: Remove the user from SecurityContext to run it as root and see if the behaviour is the same

Thank you!

@cbuchli

I am not close to my PC for one more week, so I’ll need a bit time. Thank you!

I am having this exact same issue. Upgrading our devtest environment from 3.3.7 to 3.4.10.

  • Prod version: 3.3.7
  • Test version: 3.4.10
  • Windows-based pods run indefinitely in test
  • All other workflows seem to not have any issues

I will attempt what was suggested above and report back - if memory serves there are some compatibility issues between Argo/AKS/ImageOS that limit my ability to do that but I will verify.

Thanks for testing @Freddybob4244!
Could you share the windows version of the container as well, is it 2022 too?

Yes, it is. Windows container is built from ltsc2022 - K8s nodepool OS is windows 2022 as well.

Alright I did some testing, here's what I found:

Use Case

Argo is deployed on Azure Kubernetes, I am testing with our devtest cluster - currently running Kubernetes version 1.26.6.

We're using argo as a cron scheduler to run executables that used to be on-prem. The exe's are pulled from an azure fileshare and executed inside the windows container. The workflows themselves are pretty basic. I know, they should be natively containerized, but we have to work with what we got sometimes lol.

Node OS Versions

Windows 2022 node version - AKSWindows-2022-containerd-20348.1970.230914
Windows 2019 node version - AKSWindows-2019-containerd-17763.4851.230914

Version 3.4.8

  • Windows 2019 image / 2019 node OS - pods schedule, run, complete, and exit
  • Windows 2022 image / 2019 node os - Image pull error argoexec 3.4.8
  • Windows 2019 image / 2022 node os - Error: failed to create containerd task: failed to create shim task: hcs::CreateComputeSystem main: The container operating system does not match the host operating system.: unknown
  • 2022 image / 2022 node os - Image pull error argoexec 3.4.8

Version 3.4.9

  • Win 2019 image / 2019 node os - Failed to pull image "quay.io/argoproj/argoexec:v3.4.9": rpc error: code = NotFound desc = failed to pull and unpack image "quay.io/argoproj/argoexec:v3.4.9": no match for platform in manifest: not found
  • Win 2022 image / 2019 node os - Failed to pull image "quay.io/argoproj/argoexec:v3.4.9": rpc error: code = NotFound desc = failed to pull and unpack image "quay.io/argoproj/argoexec:v3.4.9": no match for platform in manifest: not found
  • Win 2019 image / 2022 node OS - Schedules but doesn’t run exe - time="2023-09-30T05:00:21.780Z" level=info msg="finished streaming call with code Unauthenticated" error="rpc error: code = Unauthenticated desc = token not valid for running mode" grpc.code=Unauthenticated grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2023-09-30T05:00:21Z" grpc.time_ms=0.059 span.kind=server system=grpc error in argo-server pod logs
  • Win 2022 image / 2022 node os - schedules, runs, completes, and exits

Version 3.4.10



  • win 22 image / 2022 node os - schedules, runs, completes but does not exit

It appears that the jump from 3.4.9 to 3.4.10 has the breaking change - what that is I haven't looked into yet this evening. I feel like all of this info might prove to be useful.

Let me know if you want me to run any more tests - we're on 3.4.9 for now using 22/22 base images and node os versions and things seem to be running okay. I will keep an eye on our windows based workflows this weekend and report back. I don't anticipate any issues but I will update this thread regardless.

@cbuchli
@terrytangyuan
@pizza-prosciutto

@Freddybob4244 Yes, this is very useful, thank you very much.

Note to self: investigate #11368

Alright, I got an update - I am seeing some workflows intermittently continue to run forever, but for the most part it seems to be going okay.

Specifically I am seeing this event each time this occurs:

402 FailedKillPod: error killing pod: failed to "KillPodSandbox" for "ea69430f-d929-405d-9c7d-b48b0fb3f003" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to remove network namespace for sandbox "208c7146b19805606a52ff0b420809cf800de99cf66d5d60336708eed6a67d9f": hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {"Success":false,"Error":"The specified request is unsupported. ","ErrorCode":2151350293}"

I was able to find what appears to be some similar events reported in project calico:

I'll look into these and see if I can't mitigate this issue.

@Freddybob4244 your issue sounds more like a general containerd / windows issue than this issue specificly. Good to know might be: argo workflows changed default image build from 1809 base to ltsc2022 in 3.4.9 and this might have something to do with it.

@pizza-prosciutto your issue however I can confirm. Will look into it tomorrow and will try to confirm which change broke windows support.

I was able to fix this issue by commenting out the changes made in #11368. I'm however not yet sure as to why.
Therefore, the fix provided in #11276 is not the cause of the issue.

We may fix this by doing a os specific fix for windows not to wait, but I'd like to understand it more before doing that.

@boiledfroginthewell Do you have any insights maybe?

@mweibel @cbuchli

Thanks for the insight! I am happy to do any testing with any patches or windows image changes if it'll help resolve the problem. I am following this issue so I should be able to be fairly responsive.

I found Wait() is called twice on Windows; below and #11368

func Wait(process *os.Process) error {
stat, err := process.Wait()

I suspect multiple Wait() call causes this issue. The following script hung on Windows 8.1 running inside VirualBox.

package main

import (
	"fmt"
	"os/exec"
)

func main() {
	cmd := exec.Command("whoami")
	cmd.Start()

	cmd.Process.Wait()
	fmt.Printf("wait 1\n")
	cmd.Wait()
	fmt.Printf("wait 2\n")

	fmt.Printf("done\n")
}

@boiledfroginthewell yeah, that's the issue indeed.

I contributed a fix in #11993 which just removes the Wait again for windows. Also I opened #11994 for generally improving Windows CI runs since they don't seem to be run automatically, currently.

Thank you guys for the contribution, especially @mweibel for the fix! πŸ‘

So far this seems to have resolved my issues as well - thanks all!