Failed to recover after node took the drive (volume) offline

Question

Failed to recover after node took the drive (volume) offline

SidAtQuanos opened this issue 7 months ago · 6 comments

TL;DR

Hi there,

we had a kubernetes volume autoscaling mechanism active in our cluster (devops-nirvana). For a while the autoscaling worked just fine. At some point the operating system of the kubernetes node took the drive offline. The csi driver failed to recognize the offline state and continued with increasing the PVC

Expected behavior

The csi driver is capable to react on offline drives in means of taking the drive online again, so that an actual resize2fs works just fine. Or alternatively the driver should not allow the increase of PVC if the corresponding drive is set offline

Observed behavior

PVC got increased.
PV got increased.
resize2fs was triggered but failed. Next iteration. (as autoscaler checks the existing quota which did not change, so the PVC got increased again and again)

CSI Driver seems to be totally unaware of these problems (see second log snippet).

In the minio pod log we faced these messages:
Error: node(: taking drive /export offline: unable to write+read for 2m0.001s (*errors.errorString)
3: internal/logger/logger.go:248:logger.LogAlwaysIf()
2: cmd/xl-storage-disk-id-check.go:1066:cmd.(*xlStorageDiskIDCheck).monitorDiskWritable.func1.1()
1: cmd/xl-storage-disk-id-check.go:1085:cmd.(*xlStorageDiskIDCheck).monitorDiskWritable.func1.2()

Minimal working example

No response

Log output

Looked up in journalctl

Apr 07 20:09:58 node-name kernel: sd 2:0:0:1: Capacity data has changed
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 abort
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: device reset
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 abort
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: Device offlined - not ready after error recovery
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=20s
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 CDB: Write(10) 2a 00 12 a7 d7 e0 00 00 20 00
Apr 07 20:10:19 node-name kernel: blk_update_request: I/O error, dev sdb, sector 312989664 op 0x1:(WRITE) flags 0x800 phys_seg 4 prio class 0
[... many more errors like that ... ]
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: rejecting I/O to offline device


=====
Apr 07 19:30:19 node-name kubelet[919]: I0407 19:30:19.030108     919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvc-id>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<pod-name>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"
Apr 07 19:45:49 node-name kernel: EXT4-fs (sdb): resizing filesystem from 26738688 to 34865152 blocks
Apr 07 19:45:49 node-name kernel: EXT4-fs (sdb): resized filesystem to 34865152
Apr 07 19:45:49 node-name kubelet[919]: I0407 19:45:49.435021     919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvcid>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<podname>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"
Apr 07 19:57:59 node-name kernel: EXT4-fs (sdb): resizing filesystem from 34865152 to 45350912 blocks
Apr 07 19:57:59 node-name kernel: EXT4-fs (sdb): resized filesystem to 45350912

and after the issue occurred
Apr 07 19:57:59 node-name kubelet[919]: I0407 19:57:59.488576     919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvd-id>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<podname>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"
Apr 07 20:10:52 node-name kubelet[919]: I0407 20:10:52.301601     919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvc-id>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<podname>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"

Additional information

Kubernetes: 1.29.1
csi driver: 2.6.0