MIG devices not found after applying configuration
Opened this issue · 0 comments
jungsdao commented
I'm using nvidia-mig-parted
version 0.8.0 and I have two nvidia A100 80GB PCIe GPUs in my node.
This is my config.yaml file which I have applied.
version: v1
mig-configs:
all-disabled:
- devices: all
mig-enabled: false
all-enabled:
- devices: all
mig-enabled: true
mig-devices: {}
all-1g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
all-2g.20gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.20gb": 3
all-3g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 2
all-4g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"4g.40gb": 1
all-7g.80gb:
- devices: all
mig-enabled: true
mig-devices:
"7g.80gb": 1
custom-config:
- devices: [0,1,2,3]
mig-enabled: false
- devices: [4]
mig-enabled: true
mig-devices:
"1g.10gb": 7
- devices: [5]
mig-enabled: true
mig-devices:
"2g.20gb": 3
- devices: [6]
mig-enabled: true
mig-devices:
"3g.40gb": 2
- devices: [7]
mig-enabled: true
mig-devices:
"1g.10gb": 2
"2g.20gb": 1
"3g.40gb": 1
Runnfing this command gives following output probably meaning all-1g.10gb
profile has been selected: sudo nvidia-mig-parted export
2024/07/18 13:41:59 WARNING: unable to get device name: [failed to find device with id '20b5']
2024/07/18 13:41:59 WARNING: unable to get device name: [failed to find device with id '20b5']
version: v1
mig-configs:
current:
- devices: all
mig-enabled: true
mig-devices:
1g.10gb: 7
but when I run nvidia-smi
, I'm having following output with no MIG devices found.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:00:08.0 Off | On |
| N/A 40C P0 71W / 300W | 45MiB / 80994MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 00000000:00:0B.0 Off | On |
| N/A 41C P0 65W / 300W | 45MiB / 80994MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Also running sudo nvidia-smi mig -lgip
gives following
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|=============================================================================|
| 0 MIG 1g.10gb 19 0/7 9.50 No 14 0 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me 20 0/1 9.50 No 14 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb 14 0/3 19.50 No 28 1 0 |
| 2 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb 9 0/2 39.25 No 42 2 0 |
| 3 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb 5 0/1 39.25 No 56 2 0 |
| 4 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.79gb 0 0/1 78.75 No 98 5 0 |
| 7 1 1 |
+-----------------------------------------------------------------------------+
| 1 MIG 1g.10gb 19 0/7 9.50 No 14 0 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 1 MIG 1g.10gb+me 20 0/1 9.50 No 14 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 1 MIG 2g.20gb 14 0/3 19.50 No 28 1 0 |
| 2 0 0 |
+-----------------------------------------------------------------------------+
| 1 MIG 3g.40gb 9 0/2 39.25 No 42 2 0 |
| 3 0 0 |
+-----------------------------------------------------------------------------+
| 1 MIG 4g.40gb 5 0/1 39.25 No 56 2 0 |
| 4 0 0 |
+-----------------------------------------------------------------------------+
| 1 MIG 7g.79gb 0 0/1 78.75 No 98 5 0 |
| 7 1 1 |
+-----------------------------------------------------------------------------+
I wonder why MIG devices I expected were not created.
I'm getting following error when I try to create GPU instances.
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19
Unable to create a GPU instance on GPU 0 using profile 19: Insufficient Resources
Failed to create GPU instances: Insufficient Resources
Any help would be appreciated!