
GPU Admin Tools. Includes Confidential Computing controls for H100, and other functionality

Primary LanguagePython

NVIDIA GPU Admin Tools

This utility is used for various configuration including the Confidential Computing modes of supported GPUs as well as some debug/test tasks. It is designed to be run as a privileged python3 command.

Supported CC modes are:

  • on
    • All supported GPU security features are enabled (e.g., bus encryption, performance counters off)
  • devtools
    • All supported GPU security features are enabled, however blocks preventing DevTools profiling/debugging are lifted
  • off
    • The GPU operates in its default mode; no supplementary confidential computing features are enabled


sudo apt install patchelf python3-pip

Most Commonly Used Examples

Query the CC mode of all H100s in the system

sudo python3 ./nvidia_gpu_tools.py --gpu-name=H100 --query-cc-mode

Enable CC-On mode of all H100s in the system

sudo python3 ./nvidia_gpu_tools.py --gpu-name=H100 --set-cc-mode=on --reset-after-cc-mode-switch

Disable CC mode on a specific H100 in the system

sudo python3 ./nvidia_gpu_tools.py --gpu-bdf=45:00.0 --set-cc-mode=off --reset-after-cc-mode-switch

Generic debug dump from GPU

sudo python3 ./nvidia_gpu_tools.py --gpu-bdf=45:00.0 --debug-dump --log debug

Debug dump of NVLINK state

sudo python3 ./nvidia_gpu_tools.py --gpu-bdf=45:00.0 --nvlink-debug-dump --log debug


sudo python3 nvidia_gpu_tools.py --help

NVIDIA GPU Tools version v2024.01.11o
Command line arguments: ['./nvidia_gpu_tools.py', '--help']
Usage: nvidia_gpu_tools.py [options]

-h, --help            show this help message and exit
--gpu-bdf=GPU_BDF     Select a single GPU by providing a substring of the
                      BDF, e.g. '01:00'.
--gpu-name=GPU_NAME   Select a single GPU by providing a substring of the
                      GPU name, e.g. 'T4'. If multiple GPUs match, the first
                      one will be used.
--no-gpu              Do not use any of the GPUs; commands requiring one
                      will not work.
                      On Linux, specify whether to do MMIO through /dev/mem
                      or /sys/bus/pci/devices/.../resourceN
--recover-broken-gpu  Attempt recovering a broken GPU (unresponsive config
                      space or MMIO) by performing an SBR. If the GPU is
                      broken from the beginning and hence correct config
                      space wasn't saved then reenumarate it in the OS by
                      sysfs remove/rescan to restore BARs etc.
--reset-with-sbr      Reset the GPU with SBR and restore its config space
                      settings, before any other actions
--reset-with-flr      Reset the GPU with FLR and restore its config space
                      settings, before any other actions
--reset-with-os       Reset with OS through /sys/.../reset
--remove-from-os      Remove from OS through /sys/.../remove
--unbind-gpu          Unbind GPU
--unbind-gpus         Unbind GPUs
--bind-gpu=BIND_GPU   Bind GPUs to the specified driver
                      Bind GPUs to the specified driver
--query-ecc-state     Query the ECC state of the GPU
--query-cc-mode       Query the current Confidential Computing (CC) mode of
                      the GPU.
--query-cc-settings   Query the Confidential Computing (CC) settings of the
                      GPU.This prints the lower level setting knobs that
                      will take effect upon GPU reset.
--query-prc-knobs     Query all the Product Reconfiguration (PRC) knobs.
                      Configure Confidentail Computing (CC) mode. The
                      choices are off (disabled), on (enabled) or devtools
                      (enabled in DevTools mode).The GPU needs to be reset
                      to make the selected mode active. See --reset-after-
                      cc-mode-switch for one way of doing it.
                      Reset the GPU after switching CC mode such that it is
                      activated immediately.
                      Test switching CC modes.
--query-module-name   Query the module name (aka physical ID and module ID).
                      Supported only on H100 SXM and NVSwitch_gen3
--clear-memory        Clear the contents of the GPU memory. Supported on
                      Pascal+ GPUs. Assumes the GPU has been reset with SBR
                      prior to this operation and can be comined with
                      --reset-with-sbr if not.
--debug-dump          Dump various state from the device for debug
--nvlink-debug-dump   Dump NVLINK debug state.
                      Force ECC to be enabled after a subsequent GPU reset
--test-ecc-toggle     Test toggling ECC mode.
--query-mig-mode      Query whether MIG mode is enabled.
                      Force MIG mode to be disabled after a subsequent GPU
--test-mig-toggle     Test toggling MIG mode.
                      Block the specified NVLink. Can be specified multiple
                      times to block more NVLinks. NVLinks will be blocked
                      until an SBR. Supported on A100 only.
--block-all-nvlinks   Block all NVLinks. NVLinks will be blocked until a
                      subsequent SBR. Supported on A100 only.
--dma-test            Check that GPUs are able to perform DMA to all/most of
                      available system memory.
--test-pcie-p2p       Check that all GPUs are able to perform DMA to each
                      Use GPU's DMA to read 32-bits from the specified
                      sysmem physical address
                      Use GPU's DMA to write specified 32-bits to the
                      specified sysmem physical address
                      Read 32-bits from device's config space at specified
                      Write 32-bit to device's config space at specified
                      Read 32-bits from GPU BAR0 at specified offset
                      Write 32-bit to GPU BAR0 at specified offset
                      Read 32-bits from GPU BAR1 at specified offset
                      Write 32-bit to GPU BAR1 at specified offset
                      Do not treat nvidia driver apearing to be loaded as an