ThinkParQ/beegfs-csi-driver

Failed mount after kubernetes worker node upgrade from v1.23.15 to v1.24.9

Closed this issue · 2 comments

Hi!

So after upgrading half of my worker nodes to new kubernetes (v1.24.9) I noticed that some of the pods got stuck in failed mount.

Warning  FailedMount  15s (x6 over 31s)  kubelet
MountVolume.MountDevice failed for volume "pvc-5bc91a74" : rpc error: code = 
Internal desc = stat /var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/874cf8f302b0da66de76a4edb4ca3f7e0c5f7a6f25ad368e8ce8fda969225eb5/globalmount: no such file or directory

To get them up and running again I forced them to use nodes with the old kubernetes version (v1.23.15) and that works.

Versions:

  • BeeGFS: v7.3.2
  • CSI Driver: v1.3.0

Regarding the csi driver deployment I am using the k8s one from the repo.
Config:

config:
  beegfsClientConf:
    connClientPortUDP: "8028"
    connDisableAuthentication: "true"
    logType: "helperd"

And the only modification I had to make was in csi-beegfs-node.yaml where I set the plugins-mount-dir to /var/lib/kubelet/plugins/kubernetes.io/csi/pv instead of /var/lib/kubelet/plugins/kubernetes.io/csi

The kubernetes 1.23.15 worker node directory structure of /var/lib/kubelet/plugins/kubernetes.io/csi

tree -L 4
.
└── pv
    ├── pvc-01ba9661
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
    ├── pvc-03357f3e
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
...

The kubernetes 1.24.9 worker node directory structure of /var/lib/kubelet/plugins/kubernetes.io/csi

tree -L 4
.
├── beegfs.csi.netapp.com
└── pv
    ├── pvc-090f23e1
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
    ├── pvc-14ba4b44
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
...

So for some reason the node with the newer kubernetes version has an empty beegfs.csi.netapp.com directory.
Why are the pods on the "new" nodes trying to mount this other location? Is the v1.3.0 version of the driver incompatible with kubernetes 1.24.9? Should I upgrade the driver to v1.4.0?

Please say if you need any more info.

Thanks in advance!

Hi,

I upgraded beegfs-csi-driver to v1.4.0 and I removed /pv from plugins-mount-dir and it works now. I guess the upgrade of the version did the trick.

tree -L 4
.
├── beegfs.csi.netapp.com
│   └── 443016c18d78ef863d31e5904c346489c800bf6f3014713ce8a694a0fdce7bd6
│       ├── globalmount
│       │   ├── beegfs-client.conf
│       │   └── mount
│       └── vol_data.json
└── pv
    ├── pvc-0023367f
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
...

I will close the issue.

Thanks!

Thanks for opening this issue @mceronja. Apologies for the delayed response. It appears you have resolved the issue, but I thought I'd provide a bit of color anyway.

Kubernetes changed the staging paths for persistent volumes in kubernetes/kubernetes#107065. Because they are no longer staged under .../csi/pv (and instead under the more general .../csi), and our old deployment manifests only give the driver purview over .../csi/pv, our old manifests are not compatible with Kubernetes >=1.24. That being said, our updated manifests (since >=1.3.0) use the more general .../csi and should be backwards compatible (as .../csi/pv is a subdirectory of .../csi).

It's hard to understand the exact cause of your issue without diving in pretty deep, but it seems likely to be upgrade flow related. It would make a lot more sense to me if you were coming from v1.2.2 (which had incompatible base manifests), but that doesn't appear to be the case. To anyone else arriving here for some similar reason:

  1. Be sure you are using the base manifests and overlays as described in https://github.com/NetApp/beegfs-csi-driver/tree/master/deploy/k8s#basics so you will always pick up the latest base manifests when upgrading the driver.
  2. Do not attempt to use driver version <1.3.0 with Kubernetes version >=1.24.0, as there is a hard incompatibility related to the discussion here.