Incorrect allocatable volumes count in csinode for AWS vt1/g4 instance types

Question

Incorrect allocatable volumes count in csinode for AWS vt1/g4 instance types

mpatlasov opened this issue 4 months ago · 4 comments

/kind bug

What happened?

kubectl get csinode <node-name> -o json | jq .spec.drivers says that allocatable.count is 26 for vt1* instance types and 25 for g4* ones. While actual number of volumes that can be attached to the node is smaller:

type / reported / actual
g4dn.xlarge / 25 / 24
g4ad.xlarge / 25 / 24
vt1.3xlarge / 26 / 24
vt1.6xlarge / 26 / 22

There are many other g4* instance types mentioned here, but I verified the issue only for g4dn.xlarge and g4ad.xlarge. Reported number for vt1.24xlarge (26) is correct, while numbers for other vt1* types are not.

What you expected to happen?

kubectl get csinode must report correct max number of volumes to be attached.

How to reproduce it (as minimally and precisely as possible)?

Apply the following StatefulSet with 26 replicas:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: statefulset-vol-limit
spec:
  serviceName: "my-svc"
  replicas: 26
  selector:
    matchLabels:
      app: my-svc
  template:
    metadata:
      labels:
        app: my-svc
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - <node-name>
      containers:
      - name: fedora
        image: registry.fedoraproject.org/fedora-minimal
        command:
        - "sleep"
        - "604800"
        volumeMounts:
        - name: data
          mountPath: /mnt/storage
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: "NoSchedule"
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

In a while some pods get stuck at "ContainerCreating" status caused by volumes stuck at attaching status and couldn't be attached to the node. An error for a pod which got stuck looks like that:

$ oc describe po statefulset-vol-limit-22
...
  Warning  FailedAttachVolume  19s (x4 over 8m30s)  attachdetach-controller  (combined from similar events): AttachVolume.Attach failed for volume "pvc-f789150e-ef53-4166-97b1-8b44b4aadd54" : rpc error: code = Internal desc = Could not attach volume "vol-0cf35f3bad4a3e6f1" to node "i-0f445b9bbcbfbeb10": WaitForAttachmentState AttachVolume error, expected device but be attached but was attaching, volumeID="vol-0cf35f3bad4a3e6f1", instanceID="i-0f445b9bbcbfbeb10", Device="/dev/xvdaw", err=operation error EC2: AttachVolume, https response error StatusCode: 400, RequestID: b8659146-ddff-4c65-84a8-1e36e55ff3ec, api error VolumeInUse: vol-0cf35f3bad4a3e6f1 is already attached to an instance

Anything else we need to know?:

Official doc "Amazon EBS volume limits for Amazon EC2 instances" states clearly that GPU (or accelerators) must be counted:

For accelerated computing instances, the attached accelerators count towards the shared volume limit. For example, for p4d.24xlarge instances, which have a shared volume limit of 28, 8 GPUs, and 8 NVMe instance store volumes, you can attach up to 11 Amazon EBS volumes (28 volume limit - 1 network interface - 8 GPUs - 8 NVMe instance store volumes).

While getVolumesLimit() doesn't take care. It starts from availableAttachments=28 for Nitro instances, then applies the following arithmetic:

availableAttachments - enis - nvmeInstanceStoreVolumes - reservedVolumeAttachments

e.g. 28 - 1 - 1 - 1 == 25 for g4ad.xlarge.

There are must be other contributors (other than GPUs) because for vt1* instance types actual number doesn't decrease monotonically:

type / reported / actual
vt1.3xlarge / 26 / 24
vt1.6xlarge / 26 / 22
vt1.24xlarge / 26 / 26

I.e., it's hard to explain <24 , 22 , 26> solely from number-of-accelerators considerations.

Environment

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2

Driver version:
Compiled manually (by docker build -t quay.io/rh_ee_mpatlaso/misc:aws-ebs-csi-drv-upstream -f Dockerfile .) from the head of master branch of https://github.com/kubernetes-sigs/aws-ebs-csi-driver :

commit 93dd985300a0fc61fe5e4957d43c52bc590abd28 (HEAD -> master, tag: helm-chart-aws-ebs-csi-driver-2.33.0, origin/master, origin/HEAD)
Merge: 25e3222a dc71aec9
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Wed Jul 24 15:34:27 2024 -0700

    Merge pull request #2098 from kubernetes-sigs/release-1.33
    
    Finalize Release v1.33.0

Answer 1 · 2024-08-02T15:40:12.000Z

Hey @mpatlasov, thank you for raising this issue up! We will add this count of accelerators for these instance types to node startup by next release (as well as any other devices that we are missing).

Really appreciate the detailed ramp up and resources on this!

/assign @ElijahQuinones

Answer 2 · 2024-08-02T15:40:14.000Z

@AndrewSirenko: GitHub didn't allow me to assign the following users: ElijahQuinones.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Hey @mpatlasov, thank you for raising this issue up! We will add this count of accelerators for these instance types to node startup by next release (as well as any other devices that we are missing).

Really appreciate the detailed ramp up and resources on this!

/assign @ElijahQuinones

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Answer 3 · 2024-08-02T16:25:22.000Z

/priority important-soon

Answer 4 · 2024-08-14T20:20:22.000Z

Hi @mpatlasov,

The PR for Gpus not being factored in has already been merged, and the PR for accelerators is in review right now.

As for your observation:

| There are must be other contributors (other than GPUs) because for vt1* instance types actual number doesn't decrease monotonically

The VT instance type is special in that both the vt1.3xlarge and vt1.6xlarge have accelerators that take up two attachment slots each. As for the vt1.24xlarge it's accelerators do not take up any attachment slots at all. This is not well documented and I have cut an internal documentation ticket to correct this.

Please let me know if you have any further questions or concerns!