CSI driver miscounting ENIs on instance?
pvlkov opened this issue · 3 comments
/kind bug
What happened?
After repeated attachment errors of EBS-backed PVCs in our EKS cluster with the error message of AttachVolume.Attach failed for volume "pvc-xxxx" : rpc error: code = Internal desc = Could not attach volume "vol-xxxx" to node "i-xxxx": attachment of disk "vol-xxxx" failed, expected device to be attached but was attaching
, we decided to investigate why the error kept occuring. Since we have a custom node template with an additional EBS volume mounted as a data volume (to /var/lib/containerd) we assumed, that the attachment limits might be wrong somehow.
We found that CSINode
reports the number of allocatable volumes to be 25, which is not correct for our nodes.
# kubectl describe csinodes.storage.k8s.io
...
Name: ip-redacted.ec2.internal
...
Spec:
Drivers:
ebs.csi.aws.com:
...
Allocatables:
Count: 25
Topology Keys: [kubernetes.io/os topology.ebs.csi.aws.com/zone topology.kubernetes.io/zone]
Since we have 2 ENIs per node + 2 EBS volumes already attached on a t3.2xlarge
type instance, the reported number should be 28 - 2 (ENI) - 2(EBS) = 24.
Digging through the node driver code, we had a look at how the limits are actually calculated in getVolumeLimits()
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/v1.31.0/pkg/driver/node.go#L767
Baseline is set to 28 (since t3 is a Nitro instance type).
reservedVolumeAttachments
is unset in our case (-1), therefore it is calculated from instance metadata counting all volumes containing the "ebs" string in the name + 1 for the root instance. This is correct, as metadata of the instance reports
[root@ip-redacted ~]# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/block-device-mapping
ami
ebs1
Therefore this is 2 (the same effect can be achieved by setting the cli flag for reserved-volume-attachments to 2)
Next, the number of available attachments is reduced by the number of ENIs and the number of NVME Instance store volumes for the instance type
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/v1.31.0/pkg/driver/node.go#L798
The latter is zero for t3.2xlarge. However the former seems to NOT be 2, as would be expected from instance metadata.
[root@ip-redacted ~]# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/network/interfaces/macs
12:a7:xx:xx:xx:xx/
12:c0:xx:xx:xx:xx/[root@ip-redacted ~]#
because if it was, we would get 26 as the number for availableAttachments
, which once more reduced by a reservedVolumeAttachments
value of 2 should lead to a final total number of 24.
Taking a look at how the number of ENIs is calculated here
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/v1.31.0/pkg/cloud/metadata/ec2.go#L78
it seems like it just counts the number of newline characters in the response from the metadata server, which as you can see from the above output is only 1 (there is no newline at the end). Funnily enough, looking at git blame, this USED to be accounted for in earlier driver versions, but was then removed for some reason.
What you expected to happen?
CSINode
object reports the correct number of attachable EBS volumes
How to reproduce it (as minimally and precisely as possible)?
I cannot test right now, but if my assumption is correct, CSINode
should report an allocatable number which is always one more than it should be, i.e. for a normal Node with 1 EBS volume and 2 ENI interfaces it should report 26 instead of 25.
Anything else we need to know?:
Environment
- Kubernetes version (use
kubectl version
): v1.28.9-eks-036c24b - Driver version: v1.31.0-eksbuild.1
- Additional info:
- EKS nodes with custom launch template, adding an additional block device mounted under /var/lib/containerd
- Node instance type t3.2xlarge
Thank you @pvlkov for diving deep on this.
I'll check in with the team, this lack of ENI + 1 does look like it could be the regression.
Will have more news later today.
Thank you for the quick response.
For now as a workaround we account for the missing ENI interface by manually setting the reservedVolumeAttachments
to 3, i.e. one more than is actually necessary.
This will be fixed in EBS CSI Driver v1.32.0 and we may back-port this to v1.31.1.
Thank you for root causing this issue @pvlkov!