Insufficient Permissions error on NVIDIA Multi-Instance GPU (MIG)
j3soon opened this issue · 2 comments
j3soon commented
When using sacred on A100 with MIG mode enabled, it produces the following error:
"model": child.find("product_name").text,
"total_memory": int(
---> child.find("fb_memory_usage").find("total").text.split()[0]
),
"persistence_mode": (child.find("persistence_mode").text == "Enabled"),
ValueError: invalid literal for int() with base 10: 'Insufficient'
This is due to line 171 in sacred/host_info.py
file, where child.find("fb_memory_usage").find("total").text
received "Insufficient Permissions"
.
The output of nvidia-smi -q -x
looks something like this:
<nvidia_smi_log>
<gpu>
<product_name>NVIDIA A100-SXM4-40GB</product_name>
<product_brand>NVIDIA</product_brand>
<display_mode>Enabled</display_mode>
<display_active>Disabled</display_active>
<persistence_mode>Enabled</persistence_mode>
<mig_mode>
<current_mig>Enabled</current_mig>
<pending_mig>Enabled</pending_mig>
</mig_mode>
<mig_devices>
<mig_device>
<index>0</index>
<gpu_instance_id>10</gpu_instance_id>
<compute_instance_id>0</compute_instance_id>
<fb_memory_usage>
<total>4864 MiB</total>
<used>0 MiB</used>
<free>4860 MiB</free>
</fb_memory_usage>
</mig_device>
</mig_devices>
<fb_memory_usage>
<total>Insufficient Permissions</total>
<used>Insufficient Permissions</used>
<free>Insufficient Permissions</free>
</fb_memory_usage>
</gpu>
</nvidia_smi_log>
This error do not occur on A100 with MIG mode disabled, since we have sufficient permission:
<nvidia_smi_log>
<gpu>
<product_name>NVIDIA A100-SXM4-40GB</product_name>
<product_brand>NVIDIA</product_brand>
<display_mode>Enabled</display_mode>
<display_active>Disabled</display_active>
<persistence_mode>Enabled</persistence_mode>
<mig_mode>
<current_mig>Disabled</current_mig>
<pending_mig>Disabled</pending_mig>
</mig_mode>
<mig_devices>
None
</mig_devices>
<fb_memory_usage>
<total>40536 MiB</total>
<used>0 MiB</used>
<free>40536 MiB</free>
</fb_memory_usage>
</gpu>
</nvidia_smi_log>
This can be fixed by parsing the fb_memory_usage
of the MIG device when permission is denied.
thequilo commented
j3soon commented
Multi-instance GPU (MIG) mode is a relatively new feature introduced in the NVIDIA Ampere architecture that allows partitioning a GPU into multiple isolated instances. You can refer to the MIG User Guide for more details.