How to run nccl test in vm without nvswitch passthroughed?
joydchh opened this issue · 2 comments
Hi,
We are trying to run 4 vms in a host with 8 H100s, and each vm with 2 GPUs.
We found that the nvswitches can only be passthroughed into a single vm, and the rest vms got none. In this case, vms without nvswitch cannot run nccl test. The error is like blow.
Then, it came to my mind that maybe disabling nvlink would help to find the path with pcie. So, I tried to set NCCL_P2P_DISABLE=1, but still not working.
I don't know if there is any way to make through?
Any insights on this?
you have to run the nvidia-fabricmanager on the hypervisor itself
so you only passthrough the GPUs to your VMs, NVSwitches stay attached to hypervisor and bound to the nvidia driver. and then you can run nvidia-fabricmanager and configure the partitions:
https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html#shared-nvswitch-virtualization-model