Pod and PVC stuck in Pending with WaitForFirstConsumer
maxnrb opened this issue · 7 comments
What steps did you take and what happened:
Hello,
I'm trying to implement ZFS LocalPV with volumeBindingMode: WaitForFirstConsumer
for storage class, however my pod and the PVC stuck in Pending, also the PV is not created. I get the following message in the pod description:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 18s default-scheduler 0/1 nodes are available: 1 node(s) did not have enough free storage. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Knowing that if I deploy the same elements with volumeBindingMode: Immediate
in SC, my PV is created.
Here are the different elements I have deployed (SC, PVC and Pod) :
StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: openebs-zfspv
allowVolumeExpansion: true
parameters:
recordsize: "4k"
compression: "off"
dedup: "off"
fstype: "zfs"
poolname: "zfspv-pool"
provisioner: zfs.csi.openebs.io
volumeBindingMode: WaitForFirstConsumer
PersistentVolumeClaim:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: csi-zfspv
spec:
storageClassName: openebs-zfspv
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
Pod:
apiVersion: v1
kind: Pod
metadata:
name: fio
spec:
restartPolicy: Never
containers:
- name: perfrunner
image: openebs/tests-fio
command: ["/bin/bash"]
args: ["-c", "while true ;do sleep 50; done"]
volumeMounts:
- mountPath: /datadir
name: fio-vol
tty: true
volumes:
- name: fio-vol
persistentVolumeClaim:
claimName: csi-zfspv
Environment:
- ZFS-LocalPV version : v2.2.0
- Kubernetes version (use
kubectl version
): v1.27.2 - Kubernetes installer & version: microk8s v1.27.2
- Cloud provider or hardware configuration: Contabo Cloud VPS M (single node)
- OS (e.g. from
/etc/os-release
): Debian GNU/Linux 11 (bullseye)
How can I debug this?
Thanks for your help!
Hi @maxnrb . Can you please share the output of df-h
.
The Immediate mode indicates that volume binding and dynamic provisioning occurs once the PersistentVolumeClaim is created, while WaitForFirstConsumer mode which will delay the binding and provisioning of a PersistentVolume until a Pod using the PersistentVolumeClaim is created.
knowing the capacity on the node will surely help us in narrowing down the issue
One more thing to add , do you have zfs pools created . I see you are using zfspv-pool
in your storage class . Can you please send the pool description .
How many nodes do you have in your cluster?
Can you also try with the latest version and see whether the issue is being replicated or not?
@maxnrb Will close this issue in a week in case of no response.
got the same
log from openebs-zfs-controller-0
I1211 15:54:38.086575 1 controller.go:295] Started PVC processing "zaal/zaal-nats-js-zaal-nats-0"
I1211 15:54:38.086642 1 controller.go:318] PV bound to PVC "zaal/zaal-nats-js-zaal-nats-0" is not created yet
I1211 15:54:38.108148 1 controller.go:295] Started PVC processing "zaal/zaal-nats-js-zaal-nats-2"
I1211 15:54:38.108189 1 controller.go:318] PV bound to PVC "zaal/zaal-nats-js-zaal-nats-2" is not created yet
I1211 15:54:38.109170 1 controller.go:295] Started PVC processing "zaal/zaal-nats-js-zaal-nats-1"
I1211 15:54:38.109194 1 controller.go:318] PV bound to PVC "zaal/zaal-nats-js-zaal-nats-1" is not created yet
the node pods dont appear to print anything as a result of creating a pvc
Defaulted container "csi-node-driver-registrar" out of: csi-node-driver-registrar, openebs-zfs-plugin
I1206 11:09:47.895804 1 main.go:167] Version: v2.8.0
I1206 11:09:47.895931 1 main.go:168] Running node-driver-registrar in mode=registration
I1206 11:09:47.897183 1 main.go:192] Attempting to open a gRPC connection with: "/plugin/csi.sock"
I1206 11:09:47.897856 1 connection.go:164] Connecting to unix:///plugin/csi.sock
I1206 11:09:48.902531 1 main.go:199] Calling CSI driver to discover driver name
I1206 11:09:48.902582 1 connection.go:193] GRPC call: /csi.v1.Identity/GetPluginInfo
I1206 11:09:48.902590 1 connection.go:194] GRPC request: {}
I1206 11:09:48.916816 1 connection.go:200] GRPC response: {"name":"zfs.csi.openebs.io","vendor_version":"2.3.0"}
I1206 11:09:48.916860 1 connection.go:201] GRPC error: <nil>
I1206 11:09:48.916882 1 main.go:209] CSI driver name: "zfs.csi.openebs.io"
I1206 11:09:48.916944 1 node_register.go:53] Starting Registration Server at: /registration/zfs.csi.openebs.io-reg.sock
I1206 11:09:48.917481 1 node_register.go:62] Registration Server started at: /registration/zfs.csi.openebs.io-reg.sock
I1206 11:09:48.917772 1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I1206 11:09:48.984451 1 main.go:102] Received GetInfo call: &InfoRequest{}
I1206 11:09:48.985176 1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/zfs-localpv/registration"
I1206 11:09:49.013492 1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
pvc
Normal WaitForFirstConsumer 2m9s persistentvolume-controller waiting for first consumer to be created before binding
Normal WaitForPodScheduled 22s (x8 over 2m7s) persistentvolume-controller waiting for pod zaal-nats-2 to be scheduled
pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m29s default-scheduler 0/3 nodes are available: 3 node(s) did not have enough free storage. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
storageclass:
allowVolumeExpansion: true
allowedTopologies:
- matchLabelExpressions:
- key: kubernetes.io/hostname
values:
- uca1k
- uca2k
- uca3k
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"allowVolumeExpansion":true,"allowedTopologies":[{"matchLabelExpressions":[{"key":"kubernetes.io/hostname","values":["uca1k","uca2k","uca3k"]}]}],"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"},"name":"lv"},"parameters":{"compression":"off","dedup":"off","fstype":"zfs","poolname":"lv","recordsize":"128k"},"provisioner":"zfs.csi.openebs.io","volumeBindingMode":"WaitForFirstConsumer"}
creationTimestamp: "2023-12-04T13:13:20Z"
name: lv
resourceVersion: "4198060"
uid: 3f0fb9ec-073e-4cf9-bb58-f13f16a67031
parameters:
compression: "off"
dedup: "off"
fstype: zfs
poolname: lv
recordsize: 128k
provisioner: zfs.csi.openebs.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
aep@stark: /work/kraud/ansible/uca/k8s
zpool on the nodes
root@uca2k ~]$ zpool status
pool: lv
state: ONLINE
scan: scrub repaired 0B in 00:00:02 with 0 errors on Sun Dec 10 00:24:03 2023
config:
NAME STATE READ WRITE CKSUM
lv ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
df:
root@uca2k ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 7.8G 0 7.8G 0% /dev
tmpfs 1.6G 5.3M 1.6G 1% /run
/dev/vda1 98G 20G 74G 21% /
tmpfs 7.9G 0 7.9G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
lv 2.4T 128K 2.4T 1% /lv
the root cause might be this?
kubectl get zfsnodes -A
No resources found
not sure how these are supposed to be created. the manual doesnt mention that
just ran into this again on a different cluster
however, its caused by the same deployment. cockroachdb is creating an sts with 3 replicas.
when switching to immediate mode i get
zfs.csi.openebs.io_openebs-zfs-controller-0_d1de8468-f6e7-4e62-bf75-f001a7229b53 failed to provision volume with StorageClass "lv-im": rpc error: code = Internal desc = scheduler failed, node list is empty for creating the PV
possibly the root cause is still the same.
the openebs-zfs-node daemonset creates 3 node pods, but i only have 2 zfsnodes
i tried creating the missing zfsnode by hand, but that didnt help.
recreating the entire operator will result in exactly those 2 zfsnodes coming back, missing the 3rd again.
so whatever it is, its sticky.
finally figured out the missing zfsnode!
i completely missed that kubectl logs gives you only the first containers logs, but the actually interesting log is from openebs-zfs-plugin.
E0317 11:59:48.110538 1 zfsnode.go:279] error syncing 'zfs-localpv/ecl1h': cannot get free size for pool lv@syncoid_ecl1h_2024-03-12:03:14:55-GMT00:00: strconv.ParseInt: parsing "-": invalid syntax, requeuing
this is because the zpool has listsnapshots=on.
when setting it to off, it now stops erroring and has created the missing zfsnode
unfortunately this has nothing to do with the original issue here. the pods are still not scheduled.
none of the logs i looked at indicate that zfs-localpv even looked at the pvc. likely its waiting for the pods to bind, but the pod is waiting for the pv to appear before scheduling, so its deadlocked.