kubernetes-sigs/gcp-compute-persistent-disk-csi-driver

Container csi-driver-registrar restarts when gce-pd-driver starts slower

awx-fuyuanchu opened this issue · 4 comments

We've met an issue on GKE when a new node is scaled up by the cluster-autoscaler.

The container csi-driver-registrar in the pod pdcsi-node could be restarted if gce-pd-driver starts slower. The csi-driver-registrar fails to call the csi driver to get the driver info since the container gce-pd-driver is not ready.

This is the log before csi-driver-registrar restarts

Defaulted container "csi-driver-registrar" out of: csi-driver-registrar, gce-pd-driver
I0219 10:31:33.712910       1 main.go:135] Version: v2.9.0-gke.3-0-g098349f5
I0219 10:31:33.712983       1 main.go:136] Running node-driver-registrar in mode=
I0219 10:31:33.712988       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0219 10:31:33.713005       1 connection.go:213] Connecting to unix:///csi/csi.sock
W0219 10:31:43.714057       1 connection.go:232] Still connecting to unix:///csi/csi.sock
W0219 10:31:53.713714       1 connection.go:232] Still connecting to unix:///csi/csi.sock
W0219 10:32:03.714076       1 connection.go:232] Still connecting to unix:///csi/csi.sock
E0219 10:32:03.714104       1 main.go:160] error connecting to CSI driver: context deadline exceeded

This is the log of container gce-pd-driver

I0219 10:32:06.282916       1 main.go:90] Sys info: NumCPU: 16 MAXPROC: 1
I0219 10:32:06.283058       1 main.go:95] Driver vendor version v1.8.16-gke.0
I0219 10:32:06.283255       1 mount_linux.go:208] Detected OS without systemd
I0219 10:32:06.285310       1 gce-pd-driver.go:96] Enabling volume access mode: SINGLE_NODE_WRITER
I0219 10:32:06.285322       1 gce-pd-driver.go:96] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0219 10:32:06.285325       1 gce-pd-driver.go:96] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0219 10:32:06.285328       1 gce-pd-driver.go:106] Enabling controller service capability: CREATE_DELETE_VOLUME
I0219 10:32:06.285332       1 gce-pd-driver.go:106] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0219 10:32:06.285334       1 gce-pd-driver.go:106] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0219 10:32:06.285337       1 gce-pd-driver.go:106] Enabling controller service capability: LIST_SNAPSHOTS
I0219 10:32:06.285339       1 gce-pd-driver.go:106] Enabling controller service capability: PUBLISH_READONLY
I0219 10:32:06.285342       1 gce-pd-driver.go:106] Enabling controller service capability: EXPAND_VOLUME
I0219 10:32:06.285345       1 gce-pd-driver.go:106] Enabling controller service capability: LIST_VOLUMES
I0219 10:32:06.285352       1 gce-pd-driver.go:106] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0219 10:32:06.285357       1 gce-pd-driver.go:106] Enabling controller service capability: CLONE_VOLUME
I0219 10:32:06.285360       1 gce-pd-driver.go:116] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0219 10:32:06.285363       1 gce-pd-driver.go:116] Enabling node service capability: EXPAND_VOLUME
I0219 10:32:06.285366       1 gce-pd-driver.go:116] Enabling node service capability: GET_VOLUME_STATS
I0219 10:32:06.285374       1 gce-pd-driver.go:167] Driver: pd.csi.storage.gke.io
I0219 10:32:06.285409       1 server.go:106] Start listening with scheme unix, addr /csi/csi.sock
I0219 10:32:06.285640       1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0219 10:32:07.354375       1 utils.go:66] /csi.v1.Identity/GetPluginInfo called with request:
I0219 10:32:07.354472       1 utils.go:75] /csi.v1.Identity/GetPluginInfo returned with response: name:"pd.csi.storage.gke.io" vendor_version:"v1.8.16-gke.0"
I0219 10:32:07.997047       1 utils.go:66] /csi.v1.Node/NodeGetInfo called with request:
I0219 10:32:07.997153       1 utils.go:75] /csi.v1.Node/NodeGetInfo returned with response: node_id:"projects/platform-prod-f112f5ae/zones/asia-southeast1-b/instances/gke-platform-prod-sg-prod-node-pool-ca34813d-jmf9" max_volumes_per_node:127 accessible_topology:<segments:<key:"topology.gke.io/zone" value:"asia-southeast1-b" > >

This is the pod status

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-02-19T10:31:22Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-02-19T10:32:08Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-02-19T10:32:08Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-02-19T10:31:22Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://be82a448e9c769a6b6528b0b0aa74aecc29cc26991d8a901ec60c5522f914417
    image: sha256:93a8638b2f16576ce0312a90508183f8176fb4c047cfa3c4fadf82dd909fd136
    imageID: gke.gcr.io/csi-node-driver-registrar@sha256:cce2f1e7860a0996711f55c4bab363f0a1a7f3ef7083017471c08e65d9c9c6e1
    lastState:
      terminated:
        containerID: containerd://0d7b5b05876edc763372039cfb5a3932d70a399ee037044667efa81943abbf0d
        exitCode: 1
        finishedAt: "2024-02-19T10:32:03Z"
        reason: Error
        startedAt: "2024-02-19T10:31:33Z"
    name: csi-driver-registrar
    ready: true
    restartCount: 1
    started: true
    state:
      running:
        startedAt: "2024-02-19T10:32:07Z"
  - containerID: containerd://f7bbaf294d1ee2bbdbbb5cc774389be7be947f13528369ce42f263f9d77dc347
    image: sha256:238d43027da3411f1c9177be11d4e7b73e6f2dcaed31cbee69172386314e3899
    imageID: gke.gcr.io/gcp-compute-persistent-disk-csi-driver@sha256:fde2f273b43b9a5e075276a1b33a4199468dd7e4475e107e9a06a2f049e7a2cd
    lastState: {}
    name: gce-pd-driver
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-02-19T10:32:06Z"
  hostIP: 10.72.6.130
  phase: Running
  podIP: 10.72.6.130
  podIPs:
  - ip: 10.72.6.130
  qosClass: Burstable
  startTime: "2024-02-19T10:31:22Z"