NVIDIA/caffe

ACS may cause P2P bandwidth problem

lukeyeager opened this issue · 2 comments

If lower than expected performance is observed when executing a training and DIGITS has been configured to use multiple GPUs, verify that PCI Express Access Control Services (ACS) are disabled.

NVIDIA recommends that the system BIOS (SBIOS) disables ACS to ensure maximum P2P bandwidth between GPUs. The SBIOS should leave the ACS capability exposed but disabled on switch downstream ports and root ports so that ACS-aware OS and Hypervisors can choose to enable ACS when required.

Please verify with your motherboard manufacturer that the SBIOS correctly disables ACS, and if this is not the case whether an updated SBIOS is available.

If an SBIOS that correctly disables ACS is not yet available from your motherboard manufacturer, you can attempt to disable ACS programmatically by running the following script that uses the linux lspci utility. Note that this script must be run after every system boot or system reset.

#!/usr/bin/env bash
for i in $(lspci -d "10b5:" | awk '{print $1}') ; do
       o=$(lspci -vvv -s $i | grep ACSCtl)
       if [ $? -eq 0 ] ; then
               echo $o | grep "+"
               if [ $? -eq 0 ] ; then
                       setpci -s $i f2a.w=0000
               fi
       fi
done

Or one can disable the ACS directly in the BIOS of the server.

how much performance downgrade do you observed? we have similar issue. @lukeyeager