NVIDIA/mig-parted

MIG partitioning leading to nvidia_a100_3g.39gb instead of 3g.40gb partition for NVIDIA driver versions 535.x and 545.x

berhane opened this issue · 7 comments

Hi,

I've a bunch of servers with 4 A100 GPUs each and I've MIG-partitioned each GPU in the 'all-balanced' profile and managed them through Slurm.

$ cat /etc/nvidia-mig-manager/config.yaml
...
  all-balanced:
...
    # H100-80GB, H800-80GB, A100-80GB, A800-80GB
    - device-filter: ["0x233110DE", "0x233010DE", "0x232210DE", "0x20B210DE", "0x20B510DE", "0x20F310DE", "0x20F510DE"]
      devices: all
      mig-enabled: true
      mig-devices:
        "1g.10gb": 2
        "2g.20gb": 1
        "3g.40gb": 1
...

WIth NVIDIA driver 495.x, I could partition them as follows without any issues.

  • 1g.10gb : 2
  • 2g.20gb : 1
  • 3g.40gb : 1

However, with the latest drivers, namely 535.x and 545.x, each GPUs get partitioned into

  • 1g.10gb : 2
  • 2g.20gb : 1
  • 3g.39gb : 1

I use AutoDetect=nvml for Slurm to detect the types of MIG partitions and their CPU affinities. Slurm reports this discrepancy in the logs:

$ slurmd -G 

slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: gres/gpu: _merge_system_gres_conf: WARNING: The following autodetected GPUs are being ignored:
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):24-31  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 UniqueId:MIG-efa8d929-9af6-5083-af99-f1ceefb8b29a
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):8-15  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 UniqueId:MIG-44b932cc-40b5-5e7b-b01b-7e342ecfcb64
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):56-63  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 UniqueId:MIG-e3ab25b5-7be9-5d4b-940d-63841fead660
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):40-47  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3,/dev/nvidia-caps/nvidia-cap426,/dev/nvidia-caps/nvidia-cap427 UniqueId:MIG-60b3faac-01f1-5bc1-be5c-c53e2c4e0d82
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=31 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap30,/dev/nvidia-caps/nvidia-cap31 Cores=24-31 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=166 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap165,/dev/nvidia-caps/nvidia-cap166 Cores=8-15 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=301 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap300,/dev/nvidia-caps/nvidia-cap301 Cores=56-63 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=436 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap435,/dev/nvidia-caps/nvidia-cap436 Cores=40-47 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=85 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap84,/dev/nvidia-caps/nvidia-cap85 Cores=24-31 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=220 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap219,/dev/nvidia-caps/nvidia-cap220 Cores=8-15 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=355 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap354,/dev/nvidia-caps/nvidia-cap355 Cores=56-63 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=490 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap489,/dev/nvidia-caps/nvidia-cap490 Cores=40-47 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=94 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap93,/dev/nvidia-caps/nvidia-cap94 Cores=24-31 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=229 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap228,/dev/nvidia-caps/nvidia-cap229 Cores=8-15 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=364 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap363,/dev/nvidia-caps/nvidia-cap364 Cores=56-63 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=499 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap498,/dev/nvidia-caps/nvidia-cap499 Cores=40-47 CoreCnt=64 Links=0,0,0,-1

I have tried using nvidia-mig-manager versions [0.5.3, 0.5.4.1 and 0.5.5] and I see the same behavior as long as the NVIDIA driver version is 535 or 545. I haven't tried 505, 515, 525.

=== w/ NVIDIA driver 495.x ===

$ nvidia-smi
Wed Dec 13 00:32:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:01:00.0 Off |                   On |
| N/A   25C    P0    50W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                   On |
| N/A   25C    P0    51W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000000:81:00.0 Off |                   On |
| N/A   25C    P0    48W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  Off  | 00000000:C1:00.0 Off |                   On |
| N/A   25C    P0    50W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

=== w/ NVIDIA driver 545.x ===

$  nvidia-smi
Wed Dec 13 00:29:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:01:00.0 Off |                   On |
| N/A   26C    P0              51W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:41:00.0 Off |                   On |
| N/A   26C    P0              49W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:81:00.0 Off |                   On |
| N/A   27C    P0              51W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:C1:00.0 Off |                   On |
| N/A   25C    P0              48W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Looking at the memory of the different partitions, the 10GB and 20GB partitions are the same regardless of the NVIDIA driver version, but the "40GB" partitions are a little lower (40192MiB) for driver version 545.x compared to 40448 MiB for driver version 495.x, I see that the memory is smaller for

=== w/ NVIDIA driver 495.x ===

$ nvidia-smi | grep " 2   0   0"
|  0    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|  1    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|  2    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|  3    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |

=== w/ NVIDIA driver 545.x ===

$ nvidia-smi | grep " 2   0   0"
|  0    2   0   0  |   37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|  1    2   0   0  |    37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|  2    2   0   0  |    37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|  3    2   0   0  |    37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |

This is perhaps the reason why the partition is reported as 3g.39gb instead of 3g.40gb. Since we already have lots of GPUs with 3g.40gb partitions and people are trained to use them, having to hack things by introducing a different label for the same Slurm GRES would create a lot of confusion and inconvenience. So, we should appreciate any guidance in resolving this issue.

Thanks a lot.

elezar commented

Just to confirm, which version of mig-parted is being used?

Hi @elezar ,

I've tried mig-parted version 0.5.4 and 0.5.5 on two different servers.

  • gpu025,

    • Driver Version: 535.86.10
    • CUDA Version: 12.2
    • Mig-parted version: 0.5.4
  • gpu030,

    • Driver Version: 545.23.08
    • CUDA Version: 12.3
    • Mig-parted version: 0.5.5

This used to be a known issue and was supposed to be resolved by this commit:
ef1220b

I verified this is part of v0.5.4 and v0.5.5 so maybe something has changed again and we need to sync up.

Can you run your memory sizes through this calculation manually to see if it gives the wrong value:
ef1220b#diff-7ec539aa7d394c923f08662184f76347a3760f25f64848107e6668c1bf21cf84R261

Hi @klueska : I wasn't able to get the memory of the "40gb" partition manually using the logic in ef1220b. The full GPU and MIG partition memory sizes differ based in the driver versions:

  • Driver Version: 495.29.05

    • Full: 81251MiB
    • 3g.40gb: 40448MiB
    • 2g.20gb: 19968MiB
    • 1g.10gb: 9728MiB
  • Driver Version: 545.23.08

    • Full: 81920MiB (+669 MiB)
    • 3g.40gb: 40192MiB (-256MiB)
    • 2g.20gb: 19968MiB
    • 1g.10gb: 9728MiB

The full GPU memory is reported as 669 MiB higher, but the "40GB" partition is 256MiB smaller.

Hi @klueska , @elezar :
Do you have recommendations on how to move forward with this? I see v0.5.3 - v0.5.5 were tested with CUDA base image 12.2.2. Are there plans to release a new version tested with CUDA 12.3.x and the latest drivers? Thanks.

elezar commented

Hi @klueska , @elezar : Do you have recommendations on how to move forward with this? I see v0.5.3 - v0.5.5 were tested with CUDA base image 12.2.2. Are there plans to release a new version tested with CUDA 12.3.x and the latest drivers? Thanks.

Sorry for the late reply here. I am following up internally as to whether there were changes in how the profile names are calculated. We do qualify the mig manager as part of the GPU Operator releases on new driver versions, but it may be that we miss specific hardware-driver combinations.

Hi @klueska , @elezar : Do you have recommendations on how to move forward with this? I see v0.5.3 - v0.5.5 were tested with CUDA base image 12.2.2. Are there plans to release a new version tested with CUDA 12.3.x and the latest drivers? Thanks.

Sorry for the late reply here. I am following up internally as to whether there were changes in how the profile names are calculated. We do qualify the mig manager as part of the GPU Operator releases on new driver versions, but it may be that we miss specific hardware-driver combinations.

Thanks, @elezar . If you are able to reproduce the same behavior internally and have a patch to fix it, we would appreciate using that until an updated release comes out.