Incorrect allocatable volumes count in csinode for AWS vt1*/g4* instance types
mpatlasov opened this issue ยท 4 comments
/kind bug
What happened?
kubectl get csinode <node-name> -o json | jq .spec.drivers
says that allocatable.count is 26 for vt1* instance types and 25 for g4* ones. While actual number of volumes that can be attached to the node is smaller:
type / reported / actual
g4dn.xlarge / 25 / 24
g4ad.xlarge / 25 / 24
vt1.3xlarge / 26 / 24
vt1.6xlarge / 26 / 22
There are many other g4* instance types mentioned here, but I verified the issue only for g4dn.xlarge and g4ad.xlarge. Reported number for vt1.24xlarge (26) is correct, while numbers for other vt1* types are not.
What you expected to happen?
kubectl get csinode
must report correct max number of volumes to be attached.
How to reproduce it (as minimally and precisely as possible)?
Apply the following StatefulSet with 26 replicas:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: statefulset-vol-limit
spec:
serviceName: "my-svc"
replicas: 26
selector:
matchLabels:
app: my-svc
template:
metadata:
labels:
app: my-svc
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- <node-name>
containers:
- name: fedora
image: registry.fedoraproject.org/fedora-minimal
command:
- "sleep"
- "604800"
volumeMounts:
- name: data
mountPath: /mnt/storage
tolerations:
- key: "node-role.kubernetes.io/master"
effect: "NoSchedule"
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
In a while some pods get stuck at "ContainerCreating" status caused by volumes stuck at attaching status and couldn't be attached to the node. An error for a pod which got stuck looks like that:
$ oc describe po statefulset-vol-limit-22
...
Warning FailedAttachVolume 19s (x4 over 8m30s) attachdetach-controller (combined from similar events): AttachVolume.Attach failed for volume "pvc-f789150e-ef53-4166-97b1-8b44b4aadd54" : rpc error: code = Internal desc = Could not attach volume "vol-0cf35f3bad4a3e6f1" to node "i-0f445b9bbcbfbeb10": WaitForAttachmentState AttachVolume error, expected device but be attached but was attaching, volumeID="vol-0cf35f3bad4a3e6f1", instanceID="i-0f445b9bbcbfbeb10", Device="/dev/xvdaw", err=operation error EC2: AttachVolume, https response error StatusCode: 400, RequestID: b8659146-ddff-4c65-84a8-1e36e55ff3ec, api error VolumeInUse: vol-0cf35f3bad4a3e6f1 is already attached to an instance
Anything else we need to know?:
Official doc "Amazon EBS volume limits for Amazon EC2 instances" states clearly that GPU (or accelerators) must be counted:
For accelerated computing instances, the attached accelerators count towards the shared volume limit. For example, for p4d.24xlarge instances, which have a shared volume limit of 28, 8 GPUs, and 8 NVMe instance store volumes, you can attach up to 11 Amazon EBS volumes (28 volume limit - 1 network interface - 8 GPUs - 8 NVMe instance store volumes).
While getVolumesLimit() doesn't take care. It starts from availableAttachments=28 for Nitro instances, then applies the following arithmetic:
availableAttachments - enis - nvmeInstanceStoreVolumes - reservedVolumeAttachments
e.g. 28 - 1 - 1 - 1 == 25 for g4ad.xlarge.
There are must be other contributors (other than GPUs) because for vt1* instance types actual number doesn't decrease monotonically:
type / reported / actual
vt1.3xlarge / 26 / 24
vt1.6xlarge / 26 / 22
vt1.24xlarge / 26 / 26
I.e., it's hard to explain <24 , 22 , 26> solely from number-of-accelerators considerations.
Environment
- Kubernetes version (use
kubectl version
):
$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2
- Driver version:
Compiled manually (bydocker build -t quay.io/rh_ee_mpatlaso/misc:aws-ebs-csi-drv-upstream -f Dockerfile .
) from the head of master branch of https://github.com/kubernetes-sigs/aws-ebs-csi-driver :
commit 93dd985300a0fc61fe5e4957d43c52bc590abd28 (HEAD -> master, tag: helm-chart-aws-ebs-csi-driver-2.33.0, origin/master, origin/HEAD)
Merge: 25e3222a dc71aec9
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date: Wed Jul 24 15:34:27 2024 -0700
Merge pull request #2098 from kubernetes-sigs/release-1.33
Finalize Release v1.33.0
Hey @mpatlasov, thank you for raising this issue up! We will add this count of accelerators for these instance types to node startup by next release (as well as any other devices that we are missing).
Really appreciate the detailed ramp up and resources on this!
/assign @ElijahQuinones
@AndrewSirenko: GitHub didn't allow me to assign the following users: ElijahQuinones.
Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide
In response to this:
Hey @mpatlasov, thank you for raising this issue up! We will add this count of accelerators for these instance types to node startup by next release (as well as any other devices that we are missing).
Really appreciate the detailed ramp up and resources on this!
/assign @ElijahQuinones
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/priority important-soon
Hi @mpatlasov,
The PR for Gpus not being factored in has already been merged, and the PR for accelerators is in review right now.
As for your observation:
| There are must be other contributors (other than GPUs) because for vt1* instance types actual number doesn't decrease monotonically
The VT instance type is special in that both the vt1.3xlarge and vt1.6xlarge have accelerators that take up two attachment slots each. As for the vt1.24xlarge it's accelerators do not take up any attachment slots at all. This is not well documented and I have cut an internal documentation ticket to correct this.
Please let me know if you have any further questions or concerns!