kubernetes-sigs/gcp-compute-persistent-disk-csi-driver

DisableDevice not working as expected

saikat-royc opened this issue · 12 comments

The target path /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state that the DisableDevice function is attempting to write at is missing.

I printed the path in my local setup PR:

I0216 06:18:40.987335       1 device-utils_linux.go:30] DisableDevice called with device path /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd device name google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd, target filepath /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state
E0216 06:18:40.987444       1 node.go:379] Failed to disabled device /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd for volume projects/saikatroyc-test/zones/us-central1-c/disks/pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd. Device may not be detached cleanly (error is ignored and unstaging is continuing): open /sys/block/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd/device/state: no such file or directory

Logging into the node the actual target path is /sys/block/sdb/device/state for a scsi device for example.

saikatroyc@gke-test-pd-default-pool-bd595302-9p4h ~ $ sudo /lib/udev/scsi_id -g -d /dev/sdb
0Google  PersistentDisk  pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd
saikatroyc@gke-test-pd-default-pool-bd595302-9p4h ~ $ cat /sys/block/sdb/device/state
running

What is of interest is the /dev/sd* derived in deviceFsPath here

$ ls -l /dev/disk/by-id/* | grep pvc-843
lrwxrwxrwx 1 root root  9 Feb 16 06:20 /dev/disk/by-id/google-pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd -> ../../sdb
lrwxrwxrwx 1 root root  9 Feb 16 06:20 /dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-843cdc45-7cf7-43d6-801b-84d69722ebdd -> ../../sdb

Ah, good catch.

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

Oops, dropped this. I have a quick PR ready.

We are seeing the same issue (we are on 1.9.5 atm):

2023-06-27 00:15:46 | E0627 00:15:46.846784       1 node.go:379] Failed to disabled device /dev/disk/by-id/google-pv--860c5184-2736-4ea3-9f38-d819cde18a01 for volume projects/myproject/zones/asia-south1-a/disks/pv--860c5184-2736-4ea3-9f38-d819cde18a01. Device may not be detached cleanly (error is ignored and unstaging is continuing): open /sys/block/google-pv--860c5184-2736-4ea3-9f38-d819cde18a01/device/state: no such file or directory

Seems we also don't have these kind of directories /sys/block/google... on our nodes.
However what I don't understand: I see this issue happening occasionally. Shouldn't this however always result in the volume failing to detach from GCP as these directories are never present?
(and btw, is there already a release containing #1235 ?)

That's expected, we have not cherry-picked #1225 and have not cut a new release since it was merged.

The error message is benign. We had thought that disabling the device would prevent some race conditions, but on further investigation they don't come up. So the fact that the disabling isn't succeeding shouldn't cause any problems.

But I agree it's annoying --- k8s errors have enough noise that we shouldn't be adding to them :-)

I'll cherry pick this back to 1.9 and it will get into releases in the next week or too (we're in the middle of pushing out patch releases and I don't know if the CPs will get into this cycle or the next one---there are some CVEs that are being fixed by updating golang versions / image bases that I don't want to delay).

The error message is benign.

Ah, so you mean it might not be the cause of what we observe?

Aehm, I haven't actually described what we observe :)
We see occasionally volumes failing to attach to node X (where a pod using that volume is scheduled to), because the volume is actually attached to node Y, so GCP rightfully refuses to attach that volume to X.
However there is no volumeattachment resource for this particular pv that would point to Y (only the "new" volumeattachment pointing to X), hence the csi-attacher has no reason to unattach it from Y. The aforementioned error msg was the only error that I spot upon the unmounting & detaching of this pv from Y.

Yeah, I think what you're observing is unrelated to this error message. Sorry for the noise & confusion!

What is in the node Y volumesAttached at the time you see this error (that's in the node status, you can see it with kubectl describe node)?

If there's no volumeattachment on Y, then the attacher should try to reconcile the attachment away, but only if it things that the disk is attached to the node. So there is a dance happening between the pv controller and the attacher.

Another thing to try is that if this situation comes up again, kill the csi-attacher container and let it restart. We've seen some cases with the provisioner where it looks like its informer cache gets stale --- although that seems unlikely and we don't have a consistent reproduction. But if a new csi attacher instance starts working correctly a stale cache seems less unlikely.

Happened again today:

If there's no volumeattachment on Y, then the attacher should try to reconcile the attachment away, but only if it things that the disk is attached to the node. So there is a dance happening between the pv controller and the attacher.

Again, no volumeattachment on Y (though the volume is still attached (according to gcloud) on Y, only one volumeattachment resource pointing to X (which of course fails due the volume being in-use).
Node Y has no mentioning of the volume in its .status.volumesAttached/volumesInUse.
Node X has the volume mentioned in .status.volumesInUse (and not in .status.volumesAttached which seems correct as it's not attached to X).
So no volumeattachment on Y (maybe with some deletionTimestamp) + no ref to that volume from node Y's .status.volumesAttached/volumesInUse => no controller will ever detach this volume from Y.

So I guess the important question is how the "old" volumeattachment on Y could have ever been successfully deleted/finalized?

Another thing to try is that if this situation comes up again, kill the csi-attacher container and let it restart.

Just tried and I guess as expected it didn't change anything as there is no hint that this volume should be detached from Y.

I'll try to dig more logs, would be happy if you had some ideas on how the old volumeattachment on Y could have vanished. Are there maybe some timeouts after which the detachment is finalized although the detachment operation never successfully finished?

I believe the flow is: delete volumeattachment, csi-attacher notices node.volumesAttached volume w/o a volume attachment, and the detaches it.

So I think the weird part is how the volumesAttached got cleared without the volume actually being detached?

Sorry for the initial confusion, of course you were right that the error msg I showed was not related (after all, it came from the NodeUnstageVolume which is related to mounting/unmounting while my actual issue is rather about attaching/detaching which would translate to ControllerPublish/UnpublishVolume 😅).
I finally figured the real cause and filed a dedicated issue: #1290
Thanks for your suggestions!