NodeStage on arm64 occasionally failing while running e2e tests

Question

NodeStage on arm64 occasionally failing while running e2e tests

gnufied opened this issue a year ago · 7 comments

We have noticed that - occasionally few of our e2es fail with following error when running on gcp-pd csi driver:

23-09-17T11:45:11.746340654Z W0917 11:45:11.746267       1 device-utils.go:235] For disk pvc-9794d07f-947c-4218-
b03b-b0917af27118 couldn't find a device path, calling udevadmTriggerForDiskIfExists

2023-09-17T11:45:11.752561113Z E0917 11:45:11.752536       1 device-utils.go:332] failed to get serial num for disk 
pvc-9794d07f-947c-4218-b03b-b0917af27118 at device path /dev/nvme-fabrics: google_nvme_id failed for device 
"/dev/nvme-fabrics" with output [91 50 48 50 51 45 48 57 45 49 55 84 49 49 58 52 53 58 49 49 43 48 48 48 48 93 
58 32 80 97 115 115 101 100 32 100 101 118 105 99 101 32 119 97 115 32 110 111 11
6 32 97 110 32 78 86 77 101 32 100 101 118 105 99 101 46 32 32 40 89 111 117 32 109 97 121 32 110 101 101 100 32 116 111 32 114 117 110 32 116 104 105 115 32 115 99 114 105 112 116 32 97 115 32 114 111 111 116 47 119 105 116 104 32 115 117 100 111 41 46 10]: exit status 1

2023-09-17T11:45:11.765040132Z E0917 11:45:11.764999       1 device-utils.go:332] failed to get serial num for disk 
pvc-9794d07f-947c-4218-b03b-b0917af27118 at device path /dev/nvme0: google_nvme_id output cannot be parsed: 
"get-namespace-id: Inappropriate ioctl for device\nxxd: sorry cannot seek.\n[2023-09-17T11:45:11+0000]: NVMe 
Vendor Extension disk information not present\n"

2023-09-17T11:45:11.819252548Z W0917 11:45:11.819203       1 device-utils.go:340] udevadm --trigger running to fix 
disk at path /dev/nvme0n3 which has serial number pvc-9794d07f-947c-4218-b03b-b0917af27118
2023-09-17T11:45:11.848832605Z E0917 11:45:11.848793       1 utils.go:74] /csi.v1.Node/NodeStageVolume returned 
with error: rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error 
verifying GCE PD ("pvc-9794d07f-947c-4218-b03b-b0917af27118") is attached: failed to find and re-link disk 
pvc-9794d07f-947c-4218-b03b-b0917af27118 with udevadm after retrying for 3s: timed out waiting for the condition

Once the disk enters this state, it doesn't recover and hence the test fails.

Answer 1 · 2023-10-02T22:11:38.000Z

cc @msau42 @mattcary

Answer 2 · 2023-10-09T08:54:42.000Z

We're encountering similar issues with arm64 (t2a) instances. We have an statefulset with 22 replicas that initially worked well on ARM instances. After 1 day working normally, I find all the pods in Init state and the error shown is same as described by the original post:

Events:
  Type     Reason                  Age                  From                                   Message
  ----     ------                  ----                 ----                                   -------
  Warning  FailedScheduling        6m16s                gke.io/optimize-utilization-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) were unschedulable. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
  Normal   TriggeredScaleUp        6m6s                 cluster-autoscaler                     pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/.../zones/us-central1-a/instanceGroups/gke-t2a-standard-4-d-46033c64-grp 1->2 (max: 50)}]
  Warning  FailedScheduling        5m5s (x2 over 5m7s)  gke.io/optimize-utilization-scheduler  0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 1 node(s) had volume node affinity conflict. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
  Normal   Scheduled               5m1s                 gke.io/optimize-utilization-scheduler  Successfully assigned .../pod-0 to gke-t2a-standard-4-d-46033c64-wwxc
  Normal   SuccessfulAttachVolume  4m56s                attachdetach-controller                AttachVolume.Attach succeeded for volume "pvc-c2b0795e-570a-4385-8bb3-fd72f241a1f4"
  Warning  FailedMount             7s (x10 over 4m47s)  kubelet                                MountVolume.MountDevice failed for volume "pvc-c2b0795e-570a-4385-8bb3-fd72f241a1f4" : rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-c2b0795e-570a-4385-8bb3-fd72f241a1f4") is attached: failed to find and re-link disk pvc-c2b0795e-570a-4385-8bb3-fd72f241a1f4 with udevadm after retrying for 3s: timed out waiting for the condition

If I change the statefulset definition so pods are scheduled to AMD64 instances (n2 in this case), the same disks are mounted and the same workload start normally.

Events:
  Type     Reason                  Age    From                                   Message
  ----     ------                  ----   ----                                   -------
  Warning  FailedScheduling        2m28s  gke.io/optimize-utilization-scheduler  0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) had untolerated taint {kubernetes.io/arch: arm64}. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling..
  Normal   TriggeredScaleUp        2m22s  cluster-autoscaler                     pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/.../zones/us-central1-a/instanceGroups/gke-n2-standard-4-23-aac8a5e0-grp 1->2 (max: 2)}]
  Warning  FailedScheduling        2m10s  gke.io/optimize-utilization-scheduler  0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
...
  Normal   NotTriggerScaleUp       90s    cluster-autoscaler                     pod didn't trigger scale-up: 1 node(s) had untolerated taint {kubernetes.io/arch: arm64}, 1 max node group size reached
  Normal   Scheduled               85s    gke.io/optimize-utilization-scheduler  Successfully assigned .../pod-0 to gke-n2-standard-4-23-aac8a5e0-lvr7
  Normal   SuccessfulAttachVolume  78s    attachdetach-controller                AttachVolume.Attach succeeded for volume "pvc-c2b0795e-570a-4385-8bb3-fd72f241a1f4"
  Normal   Pulling                 77s    kubelet                                Pulling image "us-west1-docker.pkg.dev/..."
  Normal   Pulled                  71s    kubelet                                Successfully pulled image "us-west1-docker.pkg.dev/...(5.507763518s including waiting)
  Normal   Created                 71s    kubelet                                Created container init
  Normal   Started                 71s    kubelet                                Started container init
  Normal   Pulling                 64s    kubelet                                Pulling image "us-west1-docker.pkg.dev/..."
  ...

Answer 3 · 2023-11-08T14:44:29.000Z

The issue has been root caused to a bug in google_nvme_id introduced in GoogleCloudPlatform/guest-configs#49. The change has been reverted.

The affected versions of guest-configs are:

20230515.00
20230522.00
20230526.00

All versions after 20230626.00 should not contain the breaking change.

Answer 4 · 2024-02-06T14:47:40.000Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Answer 5 · 2024-03-07T15:14:21.000Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Answer 6 · 2024-03-07T16:58:54.000Z

/close

The comment by @msau42 should resolve this issue.

Answer 7 · 2024-03-07T16:58:58.000Z

@mattcary: Closing this issue.

In response to this:

/close

The comment by @msau42 should resolve this issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.