awslabs/mountpoint-s3-csi-driver

Pod "Sometimes" cannot mount PVC in CSI version 1.4.0

zhushendhh opened this issue ยท 30 comments

Hello Team,

I am trying to test CSI driver 1.4.0 in K8s 1.27. But I found "Some Times" the Pod cannot mount the PVC and the CSI driver Pod reports below error:

I0331 13:53:37.687569       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-east-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-outputs-835894076989-us-east-2" > 
I0331 13:53:37.687634       1 node.go:112] NodePublishVolume: mounting comfyui-outputs-835894076989-us-east-2 at /var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount with options [--allow-delete --region=us-east-2]
E0331 13:53:37.687730       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "comfyui-outputs-835894076989-us-east-2" at "/var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount": Could not check if "/var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" is a mount point: stat /var/lib/kubelet/pods/adaac085-e689-49f8-b9f5-d0467907d875/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

And even in the case of normal mounting, the CSI driver Pod still posting such of the above logs. Seems like the CSI driver keep tring to mount the same PVC to the sam Pod and failed. Not sure whether there is anything is misconfig.

Those problems only happend with "Karpenter" scale up worker nodes with new deployment/pod in EKS v1.27.
All mount operations are normal for static k8s worker nodes.

Workaround:

  1. delete the s3-csi-xxxxx Pod running on the Karpenter worker node
  2. using S3 CSi driver 1.0.0

Looking forward to your support, thanks.

jjkr commented

What underlying operating system are your node hosts running? The driver uses a host mount to read /proc/mounts from the host operating system to determine what is mounted on the system and that error suggests there was an error reading that file. This has been known to cause compatibility issues in the past, though it is odd that the behavior is intermittent.

I'm facing the same on version 1.4.0

E0421 17:49:31.700733       1 driver.go:97] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/1223b3bd-75c1-4ce8-ad56-f61ed72707e4/volumes/kubernetes.io~csi/s3-pv-appdata/mount": Failed to cat /proc/mounts: Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument
I0421 17:49:31.804114       1 node.go:188] NodeGetCapabilities: called with args
I0421 17:49:31.804755       1 node.go:188] NodeGetCapabilities: called with args
I0421 17:49:31.805458       1 node.go:188] NodeGetCapabilities: called with args
I0421 17:49:31.806119       1 node.go:49] NodePublishVolume: called with args volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"allow-overwrite" mount_flags:"region eu-central-1" mount_flags:"cache /tmp" mount_flags:"metadata-ttl 1200" mount_flags:"max-cache-size 2000" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"groovit-app-data" >
I0421 17:49:31.806171       1 node.go:81] NodePublishVolume: creating dir /var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount
I0421 17:49:31.806248       1 node.go:108] NodePublishVolume: mounting groovit-app-data at /var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount with options [--allow-delete --allow-overwrite --cache=/tmp --max-cache-size=2000 --metadata-ttl=1200 --region=eu-central-1]
E0421 17:49:31.806445       1 driver.go:97] GRPC error: rpc error: code = Internal desc = Could not mount "groovit-app-data" at "/var/lib/kubelet/pods/acc27aed-9202-4e11-995c-b5f0fadb70a5/volumes/kubernetes.io~csi/s3-pv-appdata/mount": Mount failed: Failed to start transient systemd service: Failed StartTransientUnit with Call.Err: dbus: connection closed by user output:
I0421 17:49:31.906146       1 node.go:144] NodeUnpublishVolume: called with args volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/1223b3bd-75c1-4ce8-ad56-f61ed72707e4/volumes/kubernetes.io~csi/s3-pv-appdata/mount"

Same problem for the version 1.6.0

Can confirm this happened on version 1.6.0, on some Karpenter managed nodes. Restarting the s3-csi pods did solve it but it would good to find the root cause of this (Especially if it happens when some a cluster scales nodes).
I can't be 100% sure but so far it seems to have happened on newly provisioned nodes.

Can confirm this happened on version 1.6.0, on some Karpenter managed nodes. Restarting the s3-csi pods did solve it but it would good to find the root cause of this (Especially if it happens when some a cluster scales nodes). I can't be 100% sure but so far it seems to have happened on newly provisioned nodes.

Same here with 1.60 too and same workaround restarting the s3 pod after GPU EC2 machine was created using karpenter.

I am experiencing the same issue with Kubernetes version 1.28 on EKS with p4d.24xlarge instances.

Are you still experiencing this on 1.7.0?

If so, we need logs to investigate this issue, please see https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/LOGGING.md for how to collect them

Thank you for your response. I just upgraded from version 1.6.0 to 1.7.0. I will need some time to observe the system for any reoccurrence of the issue. If the problem persists, I will collect and share the necessary logs as per your provided guidelines.

Thanks for taking some time to look into this. Unfortunately, I just updated to v1.7.0 and I still face the same issue.

Here are my CSI Driver logs with the default logLevel of 4:

I0708 09:01:01.678376       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-0-92-227.ap-southeast-2.compute.internal, mount-s3 version: 1.7.2
I0708 09:01:01.679523       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0708 09:01:01.679708       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0708 09:01:06.417348       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:35.412398       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:36.511799       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.513141       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.514163       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.515191       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.518005       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" volume_capability:<mount:<mount_flags:"region ap-southeast-2" mount_flags:"uid=1000" mount_flags:"allow-other" > access_mode:<mode:MULTI_NODE_READER_ONLY > > volume_context:<key:"bucketName" value:"mdr3ivviap-hpx-ai" > 
I0708 09:01:36.518078       1 node.go:112] NodePublishVolume: mounting mdr3ivviap-hpx-ai at /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount with options [--allow-other --read-only --region=ap-southeast-2 --uid=1000]
E0708 09:01:36.518202       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "mdr3ivviap-hpx-ai" at "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount": Could not check if "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" is a mount point: stat /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

I have also tried to retrieve the mountpoint logs but have not been able to retrieve any logs for it.
I have not been able to retrieve the MOUNT_PID as per the instructions you shared and journalctl --boot -t mount-s3 did not return any logs (See below).

-- No entries --

Please feel free to let me know if I missed a step or would like me to follow some more specific instructions in order to get you any more relevant information.

Thanks for taking some time to look into this. Unfortunately, I just updated to v1.7.0 and I still face the same issue.

Here are my CSI Driver logs with the default logLevel of 4:

I0708 09:01:01.678376       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-0-92-227.ap-southeast-2.compute.internal, mount-s3 version: 1.7.2
I0708 09:01:01.679523       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0708 09:01:01.679708       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0708 09:01:06.417348       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:35.412398       1 node.go:222] NodeGetInfo: called with args 
I0708 09:01:36.511799       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.513141       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.514163       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.515191       1 node.go:206] NodeGetCapabilities: called with args 
I0708 09:01:36.518005       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume" target_path:"/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" volume_capability:<mount:<mount_flags:"region ap-southeast-2" mount_flags:"uid=1000" mount_flags:"allow-other" > access_mode:<mode:MULTI_NODE_READER_ONLY > > volume_context:<key:"bucketName" value:"mdr3ivviap-hpx-ai" > 
I0708 09:01:36.518078       1 node.go:112] NodePublishVolume: mounting mdr3ivviap-hpx-ai at /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount with options [--allow-other --read-only --region=ap-southeast-2 --uid=1000]
E0708 09:01:36.518202       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "mdr3ivviap-hpx-ai" at "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount": Could not check if "/var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount" is a mount point: stat /var/lib/kubelet/pods/e7d5d638-eaa7-4050-b7d2-775c73d22cd7/volumes/kubernetes.io~csi/ai-models-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

I have also tried to retrieve the mountpoint logs but have not been able to retrieve any logs for it. I have not been able to retrieve the MOUNT_PID as per the instructions you shared and journalctl --boot -t mount-s3 did not return any logs (See below).

-- No entries --

Please feel free to let me know if I missed a step or would like me to follow some more specific instructions in order to get you any more relevant information.

Thanks for sharing this, @twellckhpx!

It's really unclear right now why the driver cannot read /proc/mounts. It is understandable that there's no Mountpoint logs since we don't get as far as launching Mountpoint.

If you still have access to that node or are able to reproduce, please can you check dmesg on the node to see if there's any log related to opening /proc/mounts. I'm hoping that will contain information that can give us a clue into what's going wrong with more granularity than "invalid argument".

Please can you also share what operating system you're using for your K8s nodes, and any other OS configurations (like SELinux) that may be interacting with the CSI driver.

I can 100% reproduce the issue, when scale out node with Karpenter, the pod on newly provisioned node cannot mount S3 bucket.

Node OS is Amazon Linux 2, AMI ID amazon-eks-gpu-node-1.30-v20240703 (older versions of amazon-eks-gpu-node-xxx also have same issue)

Here are some logs for your reference

Failed log

Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I0718 16:22:01.176122       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-2-126-165.us-west-2.compute.internal, mount-s3 version: 1.7.2                                                                                                                                                    
I0718 16:22:01.177147       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0718 16:22:01.177329       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}                                                           
I0718 16:22:01.962501       1 node.go:222] NodeGetInfo: called with args
I0718 16:22:39.128126       1 node.go:222] NodeGetInfo: called with args                                                                                                                     
I0718 16:22:59.729019       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.729166       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.732021       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:22:59.732073       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.732799       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.732898       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:22:59.733592       1 node.go:206] NodeGetCapabilities: called with args                                                                                                             
I0718 16:22:59.733592       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:22:59.734762       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-outputs" target_path:"/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-outputs-930179054915-us-west-2" >                                                                                                                           I0718 16:22:59.734824       1 node.go:112] NodePublishVolume: mounting comfyui-outputs-930179054915-us-west-2 at /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernete
s.io~csi/comfyui-outputs-pv/mount with options [--allow-delete --region=us-west-2]
I0718 16:22:59.734840       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-inputs" target_path:"/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kub
ernetes.io~csi/comfyui-inputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<k
ey:"bucketName" value:"comfyui-inputs-930179054915-us-west-2" >
I0718 16:22:59.734891       1 node.go:112] NodePublishVolume: mounting comfyui-inputs-930179054915-us-west-2 at /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes
.io~csi/comfyui-inputs-pv/mount with options [--allow-delete --region=us-west-2]
E0718 16:22:59.734961       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "comfyui-outputs-930179054915-us-west-2" at "/var/lib/kubelet/pods/5d662061-4f4b-45
4e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount": Could not check if "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-ou
tputs-pv/mount" is a mount point: stat /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount: no such file or directory, Failed to re
ad /host/proc/mounts: open /host/proc/mounts: invalid argument
E0718 16:22:59.735036       1 driver.go:96] GRPC error: rpc error: code = Internal desc = Could not mount "comfyui-inputs-930179054915-us-west-2" at "/var/lib/kubelet/pods/5d662061-4f4b-454
e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount": Could not check if "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-inpu
ts-pv/mount" is a mount point: stat /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount: no such file or directory, Failed to read /
host/proc/mounts: open /host/proc/mounts: invalid argument
I0718 16:23:00.333023       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.333022       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.333794       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.333796       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.334829       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:23:00.334832       1 node.go:206] NodeGetCapabilities: called with args

Restart and successfully mounted log

Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I0718 16:00:54.819571       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-2-135-107.us-west-2.compute.internal, mount-s3 version: 1.7.2
I0718 16:00:54.820723       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0718 16:00:54.821032       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0718 16:00:55.267048       1 node.go:222] NodeGetInfo: called with args
 ray@Rays-MacBook-Pro ๎‚ฐ ~ ๎‚ฐ kubectl logs -f s3-csi-node-vgvg9 -n kube-system
Defaulted container "s3-plugin" out of: s3-plugin, node-driver-registrar, liveness-probe, install-mountpoint (init)
I0718 16:00:54.819571       1 driver.go:59] Driver version: 1.7.0, Git commit: 53b62cb27036138b46e51f34ddef454fd0f89c6c, build date: 2024-06-18T11:10:59Z, nodeID: ip-10-2-135-107.us-west-2.compute.internal, mount-s3 version: 1.7.2
I0718 16:00:54.820723       1 driver.go:79] Found AWS_WEB_IDENTITY_TOKEN_FILE, syncing token
I0718 16:00:54.821032       1 driver.go:109] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0718 16:00:55.267048       1 node.go:222] NodeGetInfo: called with args
I0718 16:02:30.532895       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.532984       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.533770       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.533824       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.534440       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.534787       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:02:30.535870       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-outputs" target_path:"/var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-outputs-930179054915-us-west-2" >
I0718 16:02:30.535929       1 node.go:112] NodePublishVolume: mounting comfyui-outputs-930179054915-us-west-2 at /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount with options [--allow-delete --region=us-west-2]
I0718 16:02:30.535927       1 node.go:65] NodePublishVolume: req: volume_id:"s3-csi-driver-volume-inputs" target_path:"/var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount" volume_capability:<mount:<mount_flags:"allow-delete" mount_flags:"region us-west-2" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"bucketName" value:"comfyui-inputs-930179054915-us-west-2" >
I0718 16:02:30.535966       1 node.go:112] NodePublishVolume: mounting comfyui-inputs-930179054915-us-west-2 at /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount with options [--allow-delete --region=us-west-2]
I0718 16:02:30.663455       1 node.go:132] NodePublishVolume: /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount was mounted
I0718 16:02:30.663558       1 node.go:132] NodePublishVolume: /var/lib/kubelet/pods/3a2feb00-3fc3-468c-bc4a-d1d1038c9d63/volumes/kubernetes.io~csi/comfyui-inputs-pv/mount was mounted
I0718 16:04:25.236179       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:04:25.237064       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:05:35.113343       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:05:35.114167       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:06:53.403359       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:06:53.404562       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:08:02.679829       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:08:02.680536       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:09:26.008572       1 node.go:206] NodeGetCapabilities: called with args
I0718 16:09:26.009406       1 node.go:206] NodeGetCapabilities: called with args

I can also access the newly booted node, which cannot mount the S3.

Error log shows that

0s          Warning   FailedMount   pod/comfyui-54698dcb57-tzkp5   MountVolume.SetUp failed for volume "comfyui-outputs-pv" : rpc error: code = Internal desc = Could not mount "comfyui-outputs-930179054915-us-west-2" at "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount": Could not check if "/var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" is a mount point: stat /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument

It seems like there's no such file(dir) /var/lib/kubelet/pods/5d662061-4f4b-454e-bac1-2a051503c3f4/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount on the node.

After my restarting the s3-csi-node-xxx pod, the dir /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount created.

Still investigating, if you wanna reproduce the bug, I'm willing to help.

It seems like that the created directory is removed by this line.

I tried to add some debug messages to mountpoint-s3-csi-driver v1.7.0 source code, and rebuild the image to replace current daemonsets/s3-csi-node.

Add some debug message around suspected code
Screenshot 2024-07-24 at 00 57 14

Here're my findings:

When s3-csi-node-xxx run at the first time, cleanupDir = true, and clean the created directory
Screenshot 2024-07-24 at 00 55 47

But after restarting s3-csi-node-xxx, cleanupDir = false, and not clean the created directory
Screenshot 2024-07-24 at 00 58 54

It seems like that the created directory is removed by this line.

I tried to add some debug messages to mountpoint-s3-csi-driver v1.7.0 source code, and rebuild the image to replace current daemonsets/s3-csi-node.

Add some debug message around suspected code Screenshot 2024-07-24 at 00 57 14

Here're my findings:

When s3-csi-node-xxx run at the first time, cleanupDir = true, and clean the created directory Screenshot 2024-07-24 at 00 55 47

But after restarting s3-csi-node-xxx, cleanupDir = false, and not clean the created directory Screenshot 2024-07-24 at 00 58 54

Because driver cannot read /host/proc/mounts and then created directory was removed.

Retry read /host/proc/mounts will success, submitted the pull requests, please kindly review.

It is necessary to keep investigating the root cause, but this issue currently affects many Karpenter (and possibly other) users. To prevent users from having to manually restart s3-csi-node-xxx every time, it is better to solve the issue by retrying the read in the code.

Thanks a lot for the deep-dive and the pull request @Shellmode!

I'm trying to reproduce the issue to understand the root cause. I've tried using karpenter.k8s.aws/instance-category IN ["p"] in my Karpenter node pool configuration to get a GPU node, and I was able to get a node with amazon-eks-gpu-node-1.30-* AMI and AL2 OS as you mentioned.

I tried spawning new nodes couple of times but couldn't reproduce the issue. I didn't set up https://github.com/NVIDIA/k8s-device-plugin properly though, that might be the issue. I'll try to properly set up NVIDIA plugin and reproduce the issue.

Thanks a lot for the deep-dive and the pull request @Shellmode!

I'm trying to reproduce the issue to understand the root cause. I've tried using karpenter.k8s.aws/instance-category IN ["p"] in my Karpenter node pool configuration to get a GPU node, and I was able to get a node with amazon-eks-gpu-node-1.30-* AMI and AL2 OS as you mentioned.

I tried spawning new nodes couple of times but couldn't reproduce the issue. I didn't set up https://github.com/NVIDIA/k8s-device-plugin properly though, that might be the issue. I'll try to properly set up NVIDIA plugin and reproduce the issue.

You can reproduce the issue by building this solution. But it may take some time.

Same issue here, just want to know any update about this.

Hey @longzhihun, we haven't been able to find the root case yet, but meanwhile we'll merge @Shellmode's fix as a workaround.

When will the workaround be merged into current releases or new release like 1.8.0?
Our customer is still being impacted.

We plan to make a new release this or next week and the fix will be included in that release.

I encountered the same problem, and then I checked and found that gpu-feature-discovery could not start normally. I tried to restart it but it still could not run normally. Later, I installed k8s-device-plugin from https://github.com/NVIDIA/k8s-device-plugin and fixed it.

I encountered the same problem, and then I checked and found that gpu-feature-discovery could not start normally. I tried to restart it but it still could not run normally. Later, I installed k8s-device-plugin from https://github.com/NVIDIA/k8s-device-plugin and fixed it.

I've had the NVIDIA/k8s-device-plugin since I first faced this issue so I'm not sure this is fully related / a solution.

I'll update to their latest version and see if there is any impact on this particular issue.

Edit: Still facing the same issue with the latest k8s-device-plugin v0.16.2, obviously that's without the workaround.

I synchronized s3-csi, downgraded its version to v1.6.0-eksbuild.1, and then installed k8s-device-plugin. and then use the following to restart s3-csi
kubectl get pods -A|grep s3-csi|awk '{print $2}'|xargs -n1 kubectl delete pod -n kube-system
and resolved it. you can try it

I synchronized s3-csi, downgraded its version to v1.6.0-eksbuild.1, and then installed k8s-device-plugin. and then use the following to restart s3-csi kubectl get pods -A|grep s3-csi|awk '{print $2}'|xargs -n1 kubectl delete pod -n kube-system and resolved it. you can try it

Simply restart s3-csi-xxx pods will fix.

v1.8.0 has been released with @Shellmode's potential fix for this issue. Could you please try upgrading to 1.8.0 to see if that fixes the problem for you?

Recently released v1.8 added retry in the ListMounts() function, however I tried the new release and got the same error message still cannot mount S3. I found that if ListMounts() function ever return a nil, error it won't work.

Just leave error handling in parseProcMounts() function and retry reading /proc/mounts by calling ListMounts() function from other function, which will work.

It may be somehow confusing about "retry", it may because other function/module refresh/restart which fix the issue (just like restart the pod).

Experienced the same failure to mount PVC with driver v1.9.0, and k8s 1.30 -- after noticing original poster's downgrade workaround I decrementing the minor version to v1.8.0, which resolved it for me.

eksctl update addon --name aws-mountpoint-s3-csi-driver --version v1.8.0-eksbuild.1 --cluster <my-cluster> --region <region>

anyone update this issue and solve this problem?

Do you have met with "FailedMount" Errror? aws-samples/comfyui-on-eks#11

any update๏ผŸ

Hi @John-Funcity and @dienhartd, thanks for reporting that you're having a similar issue to this. Given our changes in v1.8.0, we're interested in root causing the issue you're having.

Please could you open a new bug report on this repository. It would be helpful for us if you included the following:

  1. The version of the CSI Driver you're using
  2. If you're installing via helm, karpenter, the EKS plugin, or something else
  3. Mountpoint and CSI Driver logs following this runbook: https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/LOGGING.md

I'm closing this issue - anyone else who has similar symptoms, please open a new issue so we can track better.