container-storage-interface/spec

Volume Lifecycle is not correspond to actual k8s behavior

YuikoTakada opened this issue ยท 8 comments

Current CSI volume lifecycle is not designed for the case when Node is unreachable.

When Node is shutdown or in a non-recoverable state such as hardware failure or broken OS, Node Plugin cannot issue NodeUnpublishVolume / NodeUnstageVolume . In this case, we want to make the status CREATED (volume is detached from the node and pods are evicted to other node and running)
But in current CSI volume lifecycle, there is no transition from PUBLISHED / VOL_READY / NODE_READY to CREATED.
As a result, k8s doesn't follow the CSI spec and the status moves from PUBLISHED to CREATED directly without going through VOL_READY and/or NODE_READY status.

We need to update the CSI volume lifecycle with considering the case when Node is unreachable.

jdef commented

@jdef Thank you for your comments.

I understand what you say.

At first glance, this seems like a breaking change to the spec. As this (CSI) is not a k8s project, it is not bound by the referenced KEP.

I want to find a good solution which doesn't break existing drivers.
Therefore, #477 also seems to try to solve this problem. WDYT?

This would trigger the deletion of the volumeAttachment objects. For CSI drivers, this would allow ControllerUnpublishVolume to happen without NodeUnpublishVolume and/or NodeUnstageVolume being called first. Note that there is no additional code changes required for this step. This happens automatically after the Proposed change in the previous step to force detach right away.

Note that this "force detach" behavior is not introduced by Non-Graceful Node Shutdown feature. Kubernetes already supports this behavior without Non-Graceful Node Shutdown. See "Test 2" in the PR description section below. By forcefully deleting the Pods on the shutdown node manually, volumes will be force-detached after a 6 minute wait by the Attach Detach Controller.

kubernetes/kubernetes#108486

Yes, Kubernetes already breaks CSI spec and can call ControllerUnpublish without NodeUnpublish / NodeUnstage succeeding if Kubernetes thinks the node is broken - it can't really call NodUnstage/Unpublish in that case or get its result.

The last attempt to fix this is officially in CSI is in #477.

Yes, Kubernetes already breaks CSI spec and can call ControllerUnpublish without NodeUnpublish / NodeUnstage succeeding if Kubernetes thinks the node is broken - it can't really call NodUnstage/Unpublish in that case or get its result.

Implementor of Nomad's CSI support here! ๐Ÿ‘‹ For what it's worth we originally implemented the spec as-written and it turned out to cause our users a lot of grief. As of Nomad 1.3.0 (shipped in May of this year), we're doing something similar to what k8s has done where we make a best effort attempt to NodeUnpublish / NodeUnstage before ControllerUnpublish.

We drive this from the "client node" (our equivalent of the kubelet), so if the client node is merely disconnected and not dead, we can rely on the node unpublish/unstage having happened by the time we try to GC the claim from the control plane side. The control plane ends up retrying the NodeUnstage/NodeUnpublish anyways, but proceeds on to ControllerUnpublish if it can't reach the node.

jdef commented

Thanks @tgross.

Are there any concerns from plugin providers that may be relying on CSI-as-written vs. the best-effort described herein?

Also Mesos is another CO w/ CSI integration - anyone on that side of the house have input to add here?

Are there any concerns from plugin providers that may be relying on CSI-as-written vs. the best-effort described herein?

IIRC all the plugins I've tested that support ControllerPublish implement that step as "make the API call to the storage provider to attach the volume as a device on the target host" and then implement NodeStage as "call mount(8) to mount that device". So when unpublishing, if the unmount is missed, the API call will try to detach a mounted device.

I know that, for example, the AWS EBS provider just merrily returns OK to that API call and then the device doesn't actually get detached until it's unmounted. (Or the user can "force detach" via the API out-of-band.) So in that case the provider is graceful and has behavior that's eventually correct, so long as that node unpublish happens eventually. If the node unpublish never happens (say the CO has crashed unrecoverably but the host is still live), I think you end up with a hung volume. But arguably that's the right behavior. I just don't how prevalent that graceful treatment is across the ecosystem.

@tgross Thank you for sharing information.
I've updated description to fit CO as a whole. This issue is not only of Kubernetes.