metal-stack/csi-driver-lvm

PVCs pending with WaitForFirstConsumer on fresh install

jtackaberry opened this issue · 5 comments

Not sure if this is a bug report or a support request, but in any case I can't spot what's going awry.

Fresh install of microk8s 1.23 and csi-driver-lvm v0.4.1 via the Helm chart at https://github.com/metal-stack/helm-charts/tree/master/charts/csi-driver-lvm (which supports StorageClass under storage.k8s.io/v1 ).

# Deploy CSI driver
$ cat values.yaml
lvm:
  devicePattern: /dev/sdb
rbac:
  pspEnabled: false
$ helm upgrade --install --create-namespace -n storage -f values.yaml csi-driver-lvm ./helm-charts/charts/csi-driver-lvm/

# Storage classes created
$ kubectl get storageclass
NAME                              PROVISIONER              RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
csi-driver-lvm-striped            lvm.csi.metal-stack.io   Delete          WaitForFirstConsumer   true                   27m
csi-driver-lvm-mirror             lvm.csi.metal-stack.io   Delete          WaitForFirstConsumer   true                   27m
csi-driver-lvm-linear (default)   lvm.csi.metal-stack.io   Delete          WaitForFirstConsumer   true                   27m

# Create a test PVC
$ cat pvc-test.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test
  namespace: default
spec:
  storageClassName: csi-driver-lvm-linear
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: "2Gi"

$ kubectl apply -f pvc-test.yaml
$ kubectl describe -n default pvc/test
Name:          test
Namespace:     default
StorageClass:  csi-driver-lvm-linear
Status:        Pending
Volume:
Labels:        <none>
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason                Age               From                         Message
  ----    ------                ----              ----                         -------
  Normal  WaitForFirstConsumer  4s (x4 over 42s)  persistentvolume-controller  waiting for first consumer to be created before binding

The first sign of trouble comes from the plugin pod, where it raises a couple errors:

$ kubectl -n storage logs csi-driver-lvm-plugin-9bqb4 -c csi-driver-lvm-plugin
2022/02/05 20:02:01 unable to configure logging to stdout:no such flag -logtostderr
I0205 20:02:01.834133       1 lvm.go:108] pullpolicy: IfNotPresent
I0205 20:02:01.834139       1 lvm.go:112] Driver: lvm.csi.metal-stack.io
I0205 20:02:01.834142       1 lvm.go:113] Version: dev
I0205 20:02:01.873219       1 lvm.go:411] unable to list existing volumegroups:exit status 5
I0205 20:02:01.873250       1 nodeserver.go:51] volumegroup: csi-lvm not found
I0205 20:02:02.119070       1 nodeserver.go:58] unable to activate logical volumes:  Volume group "csi-lvm" not found
  Cannot process volume group csi-lvm
 exit status 5
I0205 20:02:02.120111       1 controllerserver.go:259] Enabling controller service capability: CREATE_DELETE_VOLUME
I0205 20:02:02.120295       1 server.go:95] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}

Over on the k8s node, /dev/sdb does exist per lvm.devicePattern:

$ blockdev --getsize64 /dev/sdb
32212254720

While the documentation doesn't say this is necessary, I didn't see any indication from the code that pvcreate is called. So I figured perhaps that was the problem, and explicitly created it (which also demonstrates that the LVM command line tools are functional on the host):

# On k8s host
$ pvcreate /dev/sdb
  Physical volume "/dev/sdb" successfully created.

# On client
$ kubectl -n storage rollout restart ds/csi-driver-lvm-plugin

No change. Still the Volume group "csi-lvm" not found" errors from the plugin pod logs. Ok, this ostensibly shouldn't be necessary, but let's create it manually:

# On k8s host
$ vgcreate csi-lvm /dev/sdb
  Volume group "csi-lvm" successfully created
$ vgs
  VG      #PV #LV #SN Attr   VSize   VFree
  csi-lvm   1   0   0 wz--n- <30.00g <30.00g

# On client
$ kubectl -n storage rollout restart ds/csi-driver-lvm-plugin

This has addressed the errors from the plugin logs:

INFO: defaulting to container "csi-driver-lvm-plugin" (has: node-driver-registrar, csi-driver-lvm-plugin, liveness-probe)
2022/02/05 20:23:53 unable to configure logging to stdout:no such flag -logtostderr
I0205 20:23:53.656589       1 lvm.go:108] pullpolicy: IfNotPresent
I0205 20:23:53.656596       1 lvm.go:112] Driver: lvm.csi.metal-stack.io
I0205 20:23:53.656598       1 lvm.go:113] Version: dev
I0205 20:23:53.738596       1 controllerserver.go:259] Enabling controller service capability: CREATE_DELETE_VOLUME
I0205 20:23:53.738891       1 server.go:95] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}

But that didn't fix the pending PVC, even after recreating it:

$ kubectl describe -n default pvc/test
Name:          test
Namespace:     default
StorageClass:  csi-driver-lvm-linear
Status:        Pending
Volume:
Labels:        <none>
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason                Age               From                         Message
  ----    ------                ----              ----                         -------
  Normal  WaitForFirstConsumer  4s (x2 over 16s)  persistentvolume-controller  waiting for first consumer to be created before binding

Hopefully it's clear where things have gone wrong. :)

Thanks!

Hi, from the first view, all was done right.
PVCs must not created before, you can simply create a vg from a given block device or a list of block devices.

I guess your pod will mount the pvc when you delete it.

What OS is you worker node running ?

I guess your pod will mount the pvc when you delete it.

This is actually the revelation, and what's missing in my reproduction steps above: the PV isn't actually provisioned until a pod mounts the PVC. I tried creating the pod while the PVC is Pending, and things are working: the VG is created, the PV is provisioned and bound, and the pod starts.

I never got as far as creating a pod, because I figured what was the point if the PVC was stuck in pending state? Every other CSI driver I have experience with so far immediately provisions a PV and binds it when a PVC is created, so I'm embarrassed to say I never bothered creating a pod, because I was expecting csi-driver-lvm to work this way as well.

Can I humbly suggest this as an improvement? IMO it's surprising behavior to defer PV creation until after some pod mounts the PVC.

What OS is you worker node running ?

Apologies for not mentioning. Ubuntu 20.04.3.

No it cannot create the pv unless the pod is created, because this csi driver is a local-storage provider and therefor it is required to know on which node the pod get scheduled.

No it cannot create the pv unless the pod is created, because this csi driver is a local-storage provider and therefor it is required to know on which node the pod get scheduled.

Hah. You're completely right of course, I have no explanation for my momentary demonstration of stupidity. :)

Perhaps a quick note in the README might be helpful for the absentminded like me to remind us that local-storage providers will work differently than network-storage providers in this regard?

Thanks for your patience @majst01. Will close as this isn't a bug and I'm up and running.

No Problem.