MIG partitioning leading to nvidia_a100_3g.39gb instead of 3g.40gb partition for NVIDIA driver versions 535.x and 545.x
berhane opened this issue · 7 comments
Hi,
I've a bunch of servers with 4 A100 GPUs each and I've MIG-partitioned each GPU in the 'all-balanced' profile and managed them through Slurm.
$ cat /etc/nvidia-mig-manager/config.yaml
...
all-balanced:
...
# H100-80GB, H800-80GB, A100-80GB, A800-80GB
- device-filter: ["0x233110DE", "0x233010DE", "0x232210DE", "0x20B210DE", "0x20B510DE", "0x20F310DE", "0x20F510DE"]
devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 2
"2g.20gb": 1
"3g.40gb": 1
...
WIth NVIDIA driver 495.x, I could partition them as follows without any issues.
- 1g.10gb : 2
- 2g.20gb : 1
- 3g.40gb : 1
However, with the latest drivers, namely 535.x and 545.x, each GPUs get partitioned into
- 1g.10gb : 2
- 2g.20gb : 1
- 3g.39gb : 1
I use AutoDetect=nvml for Slurm to detect the types of MIG partitions and their CPU affinities. Slurm reports this discrepancy in the logs:
$ slurmd -G
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: gres/gpu: _merge_system_gres_conf: WARNING: The following autodetected GPUs are being ignored:
slurmd: GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):24-31 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 UniqueId:MIG-efa8d929-9af6-5083-af99-f1ceefb8b29a
slurmd: GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):8-15 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 UniqueId:MIG-44b932cc-40b5-5e7b-b01b-7e342ecfcb64
slurmd: GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):56-63 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 UniqueId:MIG-e3ab25b5-7be9-5d4b-940d-63841fead660
slurmd: GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):40-47 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3,/dev/nvidia-caps/nvidia-cap426,/dev/nvidia-caps/nvidia-cap427 UniqueId:MIG-60b3faac-01f1-5bc1-be5c-c53e2c4e0d82
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=31 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap30,/dev/nvidia-caps/nvidia-cap31 Cores=24-31 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=166 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap165,/dev/nvidia-caps/nvidia-cap166 Cores=8-15 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=301 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap300,/dev/nvidia-caps/nvidia-cap301 Cores=56-63 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=436 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap435,/dev/nvidia-caps/nvidia-cap436 Cores=40-47 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=85 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap84,/dev/nvidia-caps/nvidia-cap85 Cores=24-31 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=220 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap219,/dev/nvidia-caps/nvidia-cap220 Cores=8-15 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=355 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap354,/dev/nvidia-caps/nvidia-cap355 Cores=56-63 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=490 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap489,/dev/nvidia-caps/nvidia-cap490 Cores=40-47 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=94 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap93,/dev/nvidia-caps/nvidia-cap94 Cores=24-31 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=229 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap228,/dev/nvidia-caps/nvidia-cap229 Cores=8-15 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=364 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap363,/dev/nvidia-caps/nvidia-cap364 Cores=56-63 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=499 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap498,/dev/nvidia-caps/nvidia-cap499 Cores=40-47 CoreCnt=64 Links=0,0,0,-1
I have tried using nvidia-mig-manager versions [0.5.3, 0.5.4.1 and 0.5.5] and I see the same behavior as long as the NVIDIA driver version is 535 or 545. I haven't tried 505, 515, 525.
=== w/ NVIDIA driver 495.x ===
$ nvidia-smi
Wed Dec 13 00:32:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:01:00.0 Off | On |
| N/A 25C P0 50W / 500W | 24MiB / 81251MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... Off | 00000000:41:00.0 Off | On |
| N/A 25C P0 51W / 500W | 24MiB / 81251MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... Off | 00000000:81:00.0 Off | On |
| N/A 25C P0 48W / 500W | 24MiB / 81251MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... Off | 00000000:C1:00.0 Off | On |
| N/A 25C P0 50W / 500W | 24MiB / 81251MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 3 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 3 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 3 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 9 0 2 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 10 0 3 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 3 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 9 0 2 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 10 0 3 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 3 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 9 0 2 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 10 0 3 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
=== w/ NVIDIA driver 545.x ===
$ nvidia-smi
Wed Dec 13 00:29:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:01:00.0 Off | On |
| N/A 26C P0 51W / 500W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:41:00.0 Off | On |
| N/A 26C P0 49W / 500W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:81:00.0 Off | On |
| N/A 27C P0 51W / 500W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:C1:00.0 Off | On |
| N/A 25C P0 48W / 500W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 3 0 1 | 25MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 9 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 10 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 3 0 1 | 25MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 9 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 10 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 2 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 2 3 0 1 | 25MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 2 9 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 2 10 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 3 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 3 3 0 1 | 25MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 3 9 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 3 10 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Looking at the memory of the different partitions, the 10GB and 20GB partitions are the same regardless of the NVIDIA driver version, but the "40GB" partitions are a little lower (40192MiB) for driver version 545.x compared to 40448 MiB for driver version 495.x, I see that the memory is smaller for
=== w/ NVIDIA driver 495.x ===
$ nvidia-smi | grep " 2 0 0"
| 0 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| 1 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| 2 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| 3 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
=== w/ NVIDIA driver 545.x ===
$ nvidia-smi | grep " 2 0 0"
| 0 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| 1 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| 2 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| 3 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
This is perhaps the reason why the partition is reported as 3g.39gb instead of 3g.40gb. Since we already have lots of GPUs with 3g.40gb partitions and people are trained to use them, having to hack things by introducing a different label for the same Slurm GRES would create a lot of confusion and inconvenience. So, we should appreciate any guidance in resolving this issue.
Thanks a lot.
Just to confirm, which version of mig-parted
is being used?
Hi @elezar ,
I've tried mig-parted
version 0.5.4
and 0.5.5
on two different servers.
-
gpu025,
- Driver Version: 535.86.10
- CUDA Version: 12.2
- Mig-parted version: 0.5.4
-
gpu030,
- Driver Version: 545.23.08
- CUDA Version: 12.3
- Mig-parted version: 0.5.5
This used to be a known issue and was supposed to be resolved by this commit:
ef1220b
I verified this is part of v0.5.4 and v0.5.5 so maybe something has changed again and we need to sync up.
Can you run your memory sizes through this calculation manually to see if it gives the wrong value:
ef1220b#diff-7ec539aa7d394c923f08662184f76347a3760f25f64848107e6668c1bf21cf84R261
Hi @klueska : I wasn't able to get the memory of the "40gb" partition manually using the logic in ef1220b. The full GPU and MIG partition memory sizes differ based in the driver versions:
-
Driver Version: 495.29.05
- Full: 81251MiB
- 3g.40gb: 40448MiB
- 2g.20gb: 19968MiB
- 1g.10gb: 9728MiB
-
Driver Version: 545.23.08
- Full: 81920MiB (+669 MiB)
- 3g.40gb: 40192MiB (-256MiB)
- 2g.20gb: 19968MiB
- 1g.10gb: 9728MiB
The full GPU memory is reported as 669 MiB higher, but the "40GB" partition is 256MiB smaller.
Hi @klueska , @elezar : Do you have recommendations on how to move forward with this? I see v0.5.3 - v0.5.5 were tested with CUDA base image 12.2.2. Are there plans to release a new version tested with CUDA 12.3.x and the latest drivers? Thanks.
Sorry for the late reply here. I am following up internally as to whether there were changes in how the profile names are calculated. We do qualify the mig manager as part of the GPU Operator releases on new driver versions, but it may be that we miss specific hardware-driver combinations.
Hi @klueska , @elezar : Do you have recommendations on how to move forward with this? I see v0.5.3 - v0.5.5 were tested with CUDA base image 12.2.2. Are there plans to release a new version tested with CUDA 12.3.x and the latest drivers? Thanks.
Sorry for the late reply here. I am following up internally as to whether there were changes in how the profile names are calculated. We do qualify the mig manager as part of the GPU Operator releases on new driver versions, but it may be that we miss specific hardware-driver combinations.
Thanks, @elezar . If you are able to reproduce the same behavior internally and have a patch to fix it, we would appreciate using that until an updated release comes out.