IDSIA/sacred

Insufficient Permissions error on NVIDIA Multi-Instance GPU (MIG)

j3soon opened this issue · 2 comments

When using sacred on A100 with MIG mode enabled, it produces the following error:

     "model": child.find("product_name").text,
     "total_memory": int(
--->     child.find("fb_memory_usage").find("total").text.split()[0]
     ),
     "persistence_mode": (child.find("persistence_mode").text == "Enabled"),

ValueError: invalid literal for int() with base 10: 'Insufficient'

This is due to line 171 in sacred/host_info.py file, where child.find("fb_memory_usage").find("total").text received "Insufficient Permissions".

The output of nvidia-smi -q -x looks something like this:

<nvidia_smi_log>
	<gpu>
		<product_name>NVIDIA A100-SXM4-40GB</product_name>
		<product_brand>NVIDIA</product_brand>
		<display_mode>Enabled</display_mode>
		<display_active>Disabled</display_active>
		<persistence_mode>Enabled</persistence_mode>
		<mig_mode>
			<current_mig>Enabled</current_mig>
			<pending_mig>Enabled</pending_mig>
		</mig_mode>
		<mig_devices>
		<mig_device>
			<index>0</index>
			<gpu_instance_id>10</gpu_instance_id>
			<compute_instance_id>0</compute_instance_id>
			<fb_memory_usage>
				<total>4864 MiB</total>
				<used>0 MiB</used>
				<free>4860 MiB</free>
			</fb_memory_usage>
		</mig_device>
		</mig_devices>
		<fb_memory_usage>
			<total>Insufficient Permissions</total>
			<used>Insufficient Permissions</used>
			<free>Insufficient Permissions</free>
		</fb_memory_usage>
	</gpu>
</nvidia_smi_log>

This error do not occur on A100 with MIG mode disabled, since we have sufficient permission:

<nvidia_smi_log>
	<gpu>
		<product_name>NVIDIA A100-SXM4-40GB</product_name>
		<product_brand>NVIDIA</product_brand>
		<display_mode>Enabled</display_mode>
		<display_active>Disabled</display_active>
		<persistence_mode>Enabled</persistence_mode>
		<mig_mode>
			<current_mig>Disabled</current_mig>
			<pending_mig>Disabled</pending_mig>
		</mig_mode>
		<mig_devices>
			None
		</mig_devices>
		<fb_memory_usage>
			<total>40536 MiB</total>
			<used>0 MiB</used>
			<free>40536 MiB</free>
		</fb_memory_usage>
	</gpu>
</nvidia_smi_log>

This can be fixed by parsing the fb_memory_usage of the MIG device when permission is denied.

Hi @j3soon!

I didn't know this multi-instance mode even existed. The changes you suggested in #865 seem reasonable. Thank you!

Multi-instance GPU (MIG) mode is a relatively new feature introduced in the NVIDIA Ampere architecture that allows partitioning a GPU into multiple isolated instances. You can refer to the MIG User Guide for more details.