kubernetes-sigs/gcp-compute-persistent-disk-csi-driver

pdcsi 0.14.9 - failed to find and re-link disk with udevadm

devoxel opened this issue · 7 comments

We are hitting a similar issue to #608 in a GKE cluster - running version 1.25.8-gke.1000 and a Spot VM nodepool. That last comment of that bug mentioned it's best to start a new bug when discussing this.

Some Background

  • Probably unimportant, but for completeness, the workload is ECK. Suffice to say: it's deploys a statefulset which ended up having these issues.
  • We noticed unusual behavior with the cluster yesterday, July 4th. Lots of pods were encountering NetworkNotReady events and being rescheduled. We have seen this before especially with GKE spot-vms; and in some cases had to drain the node. Didn't see that here, the pods came back up, it was just happening periodically.
  • We didn't think too much about these issues. This is spot-vms so we expect to see some weirdness, and while this was worse than we've seen before; we just let it continue. Personally, I was more interested in how the workload was responding to this
  • Below is a snapshot of the Events from a pod in the StatefulSet while this was ongoing.
 Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Normal   Scheduled               56m                default-scheduler        Successfully assigned elasticsearch/eck-es-datanodes-1-2 to gke-cluster-1-spot-default-pool-19a36888-6cgl
  Warning  FailedAttachVolume      56m                attachdetach-controller  Multi-Attach error for volume "pvc-cbaea1e9-180d-4917-af81-3574b7f6b5dd" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulAttachVolume  56m                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-cbaea1e9-180d-4917-af81-3574b7f6b5dd"
  Normal   Pulled                  56m                kubelet                  Container image "docker.elastic.co/elasticsearch/elasticsearch:7.17.7" already present on machine
  Normal   Created                 56m                kubelet                  Created container elastic-internal-init-filesystem
  <skipping>
  Warning  Unhealthy               55m                kubelet                  Readiness probe failed: {"timestamp": "2023-07-04T11:32:24+00:00", "message": "readiness probe failed", "curl_rc": "7"}
  Warning  Unhealthy               55m                kubelet                  Readiness probe failed: {"timestamp": "2023-07-04T11:32:29+00:00", "message": "readiness probe failed", "curl_rc": "7"}
  Normal   Killing                 24m                kubelet                  Stopping container elasticsearch
  Warning  ExceededGracePeriod     24m                kubelet                  Container runtime did not kill the pod within specified grace period.
  Warning  FailedMount             22m (x2 over 22m)  kubelet                  MountVolume.SetUp failed for volume "elastic-internal-xpack-file-realm" : object "elasticsearch"/"eck-es-xpack-file-realm" not registered
  Warning  NetworkNotReady         22m (x2 over 22m)  kubelet                  network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

The Bug

  • Fast forward to today and things seem mostly recovered from the noise yesterday except one pod is stuck in PodInitalizing
  • We see the error from the bug title
Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  10m (x178 over 6h29m)  kubelet  MountVolume.MountDevice failed for volume "pvc-1bdf012f-f199-4761-b71a-7929d19e3e44" : rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("gke-cluster-1-56ca57ad-pvc-1bdf012f-f199-4761-b71a-7929d19e3e44") is attached: failed to find and re-link disk gke-cluster-1-56ca57ad-pvc-1bdf012f-f199-4761-b71a-7929d19e3e44 with udevadm after retrying for 3s: for disk gke-cluster-1-56ca57ad-pvc-1bdf012f-f199-4761-b71a-7929d19e3e44 failed to trigger udevadm fix of non existent device path: udevadm --trigger requested to fix disk gke-cluster-1-56ca57ad-pvc-1bdf012f-f199-4761-b71a-7929d19e3e44 but no such disk was found in device path map[/dev/sda:persistent-disk-0 /dev/sda1:persistent-disk-0 /dev/sda10:persistent-disk-0 /dev/sda11:persistent-disk-0 /dev/sda12:persistent-disk-0 /dev/sda2:persistent-disk-0 /dev/sda3:persistent-disk-0 /dev/sda4:persistent-disk-0 /dev/sda5:persistent-disk-0 /dev/sda6:persistent-disk-0 /dev/sda7:persistent-disk-0 /dev/sda8:persistent-disk-0 /dev/sda9:persistent-disk-0]
  Warning  FailedMount  48s (x187 over 6h29m)  kubelet  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[elasticsearch-data], unattached volumes=[elastic-internal-remote-certificate-authorities elastic-internal-elasticsearch-plugins-local elastic-internal-unicast-hosts tmp-volume elastic-internal-xpack-file-realm elastic-internal-scripts elasticsearch-data elasticsearch-logs downward-api elastic-internal-probe-user elastic-internal-elasticsearch-config elastic-internal-elasticsearch-bin-local elastic-internal-http-certificates elastic-internal-transport-certificates elastic-internal-elasticsearch-config-local elastic-internal-secure-settings]: timed out waiting for the condition

Here's the json of the csidriver object on the cluster:

{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "storage.k8s.io/v1",
            "kind": "CSIDriver",
            "metadata": {
                "annotations": {
                    "components.gke.io/component-name": "pdcsi",
                    "components.gke.io/component-version": "0.14.9",
                    "components.gke.io/layer": "addon"
                },
                "creationTimestamp": "2023-04-11T15:47:20Z",
                "labels": {
                    "addonmanager.kubernetes.io/mode": "Reconcile",
                    "k8s-app": "gcp-compute-persistent-disk-csi-driver"
                },
                "name": "pd.csi.storage.gke.io",
                "resourceVersion": "1688210485",
                "uid": "0162fe97-be9e-4b1b-9bfc-b73c16e6b192"
            },
            "spec": {
                "attachRequired": true,
                "fsGroupPolicy": "ReadWriteOnceWithFSType",
                "podInfoOnMount": false,
                "requiresRepublish": false,
                "storageCapacity": false,
                "volumeLifecycleModes": [
                    "Persistent"
                ]
            }
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": ""
    }
}

I haven't deleted the pod yet. This is a playground cluster and this sts can tolerate a single failure. I expect deleting the pod will fix the issue - but if requested I can verify this. I thought it might be more useful to keep the pod / node for direct debugging (but it's a spot VM so it could just be yanked away).

I'm also opening a support issue for this, but these things probably will move faster if I report it directly here.

I'm finding same error with version 1.28.1-gke.1066000. We have different nodePools running similar workload and we are facing these very frequently on a nodePool running ARM instances (t2a). We found that from a statefulset of 22 replicas, most of them failed simultaneously with this error, when the PVC was expected to be expanded.

We've tried upgrading GKE version (twice) with same results. Also replacing the VMs, deleting the pods, ... does not work. Disks seems to be healthy and can be mounted on the VM.

Events:
  Type     Reason                  Age                  From                                   Message
  ----     ------                  ----                 ----                                   -------
  Warning  FailedScheduling        6m16s                gke.io/optimize-utilization-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) were unschedulable. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
  Normal   TriggeredScaleUp        6m6s                 cluster-autoscaler                     pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/.../zones/us-central1-a/instanceGroups/gke-t2a-standard-4-d-46033c64-grp 1->2 (max: 50)}]
  Warning  FailedScheduling        5m5s (x2 over 5m7s)  gke.io/optimize-utilization-scheduler  0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 1 node(s) had volume node affinity conflict. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
  Normal   Scheduled               5m1s                 gke.io/optimize-utilization-scheduler  Successfully assigned .../pod-0 to gke-t2a-standard-4-d-46033c64-wwxc
  Normal   SuccessfulAttachVolume  4m56s                attachdetach-controller                AttachVolume.Attach succeeded for volume "pvc-c2b0795e-570a-4385-8bb3-fd72f241a1f4"
  Warning  FailedMount             7s (x10 over 4m47s)  kubelet                                MountVolume.MountDevice failed for volume "pvc-c2b0795e-570a-4385-8bb3-fd72f241a1f4" : rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-c2b0795e-570a-4385-8bb3-fd72f241a1f4") is attached: failed to find and re-link disk pvc-c2b0795e-570a-4385-8bb3-fd72f241a1f4 with udevadm after retrying for 3s: timed out waiting for the condition

I think in our case the issue is more related with #1398

Hi,

We had same issue this Saturday while we were upgrading the GKE to v1.27.4-gke.900.
3 pods were failing to start due to it's PV having that error:

Events:
  Type     Reason       Age                   From     Message
  ----     ------       ----                  ----     -------
  Warning  FailedMount  7m22s (x44 over 83m)  kubelet  MountVolume.MountDevice failed for volume "pvc-b2464e2e-b1f9-45e2-aef3-dba71ed061ce" : rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-b2464e2e-b1f9-45e2-aef3-dba71ed061ce") is attached: failed to find and re-link disk pvc-b2464e2e-b1f9-45e2-aef3-dba71ed061ce with udevadm after retrying for 3s: timed out waiting for the condition
  Warning  FailedMount  116s (x36 over 81m)   kubelet  Unable to attach or mount volumes: unmounted volumes=[storage-volume], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition

btw, @devoxel , restarting the pod does not help. We even tried to cordon the node where the pod was running and deleting the pod so the was scheduled to another node, but that didn't help. Same error.

As this is a Prod cluster we had no other choice that deleting the PVs and get new ones, luckily for us the data was replicated in other PVs.

Hi,

Same issue. multiple nodes in the cluster started having this error when the pods are rescheduled. Rescheduled the pods, provisioned new nodes, these PVs will not mount.

  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Normal   Scheduled               16m                 default-scheduler        Successfully assigned keramik-smoke-basic-rust/cas-postgres-0 to gke-ipfs-ceramic-service-default-pool-70450b14-zmqx
  Warning  FailedAttachVolume      16m                 attachdetach-controller  Multi-Attach error for volume "pvc-edd6fb61-615b-4cd6-ad89-a0ec00febde4" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulAttachVolume  16m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-edd6fb61-615b-4cd6-ad89-a0ec00febde4"
  Warning  FailedMount             59s (x15 over 16m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-edd6fb61-615b-4cd6-ad89-a0ec00febde4" : rpc error: code = Internal desc = Error when getting device path: rpc error: code = Internal desc = error verifying GCE PD ("pvc-edd6fb61-615b-4cd6-ad89-a0ec00febde4") is attached: failed to find and re-link disk pvc-edd6fb61-615b-4cd6-ad89-a0ec00febde4 with udevadm after retrying for 3s: timed out waiting for the condition
  Warning  FailedMount             44s (x7 over 14m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[postgres-data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition

Errors about the serial numbers in the gce-pd-driver logs

I1207 14:39:04.475652       1 device-utils.go:335] device path /dev/nvme0n12, serial number pvc-edd6fb61-615b-4cd6-ad89-a0ec00febde4
W1207 14:39:04.475662       1 device-utils.go:340] udevadm --trigger running to fix disk at path /dev/nvme0n12 which has serial number pvc-edd6fb61-615b-4cd6-ad89-a0ec00febde4
W1207 14:39:04.945260       1 device-utils.go:235] For disk pvc-edd6fb61-615b-4cd6-ad89-a0ec00febde4 couldn't find a device path, calling udevadmTriggerForDiskIfExists
E1207 14:39:04.951117       1 device-utils.go:332] failed to get serial num for disk pvc-edd6fb61-615b-4cd6-ad89-a0ec00febde4 at device path /dev/nvme0: google_nvme_id output cannot be parsed: "NVMe status: INVALID_NS: The namespace or the format of that namespace is invalid(0xb)\nxxd: sorry cannot seek.\n[2023-12-07T14:39:04+0000]: NVMe Vendor Extension disk information not present\n"

Just adding a +1 here: I've seen this, but curiously only in GKE ARM clusters.