Failed to recover after node took the drive (volume) offline
SidAtQuanos opened this issue · 6 comments
TL;DR
Hi there,
we had a kubernetes volume autoscaling mechanism active in our cluster (devops-nirvana). For a while the autoscaling worked just fine. At some point the operating system of the kubernetes node took the drive offline. The csi driver failed to recognize the offline state and continued with increasing the PVC
Expected behavior
The csi driver is capable to react on offline drives in means of taking the drive online again, so that an actual resize2fs works just fine. Or alternatively the driver should not allow the increase of PVC if the corresponding drive is set offline
Observed behavior
PVC got increased.
PV got increased.
resize2fs was triggered but failed. Next iteration. (as autoscaler checks the existing quota which did not change, so the PVC got increased again and again)
CSI Driver seems to be totally unaware of these problems (see second log snippet).
In the minio pod log we faced these messages:
Error: node(: taking drive /export offline: unable to write+read for 2m0.001s (*errors.errorString)
3: internal/logger/logger.go:248:logger.LogAlwaysIf()
2: cmd/xl-storage-disk-id-check.go:1066:cmd.(*xlStorageDiskIDCheck).monitorDiskWritable.func1.1()
1: cmd/xl-storage-disk-id-check.go:1085:cmd.(*xlStorageDiskIDCheck).monitorDiskWritable.func1.2()
Minimal working example
No response
Log output
Looked up in journalctl
Apr 07 20:09:58 node-name kernel: sd 2:0:0:1: Capacity data has changed
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 abort
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: device reset
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 abort
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: Device offlined - not ready after error recovery
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=20s
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 CDB: Write(10) 2a 00 12 a7 d7 e0 00 00 20 00
Apr 07 20:10:19 node-name kernel: blk_update_request: I/O error, dev sdb, sector 312989664 op 0x1:(WRITE) flags 0x800 phys_seg 4 prio class 0
[... many more errors like that ... ]
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: rejecting I/O to offline device
=====
Apr 07 19:30:19 node-name kubelet[919]: I0407 19:30:19.030108 919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvc-id>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<pod-name>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"
Apr 07 19:45:49 node-name kernel: EXT4-fs (sdb): resizing filesystem from 26738688 to 34865152 blocks
Apr 07 19:45:49 node-name kernel: EXT4-fs (sdb): resized filesystem to 34865152
Apr 07 19:45:49 node-name kubelet[919]: I0407 19:45:49.435021 919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvcid>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<podname>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"
Apr 07 19:57:59 node-name kernel: EXT4-fs (sdb): resizing filesystem from 34865152 to 45350912 blocks
Apr 07 19:57:59 node-name kernel: EXT4-fs (sdb): resized filesystem to 45350912
and after the issue occurred
Apr 07 19:57:59 node-name kubelet[919]: I0407 19:57:59.488576 919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvd-id>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<podname>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"
Apr 07 20:10:52 node-name kubelet[919]: I0407 20:10:52.301601 919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvc-id>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<podname>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"
Additional information
Kubernetes: 1.29.1
csi driver: 2.6.0