/rocm_gpu_tradecraft

Commands that will make you more comfortable with the ROCm toolkit.

Continously being improved

ROCm GPU Tradecraft

This guide is designed for engineers and developers seeking to migrate from Nvidia's CUDA to the open, community-driven environment provided by ROCm. It offers a comprehensive collection of ROCm commands, best practices, and performance tuning techniques to help you become proficient with the AMD ROCm toolkit.

Advanced Computational Infrastructure Engineering -- For Everyone

As the demand for high-performance computing continues to grow, many engineers are looking for alternatives to proprietary solutions like CUDA. ROCm provides an open ecosystem that empowers developers with greater control over their parallel programming environments. This guide aims to facilitate your transition by covering essential topics such as:

Topic Description
System Setup and Installation Step-by-step instructions to get ROCm up and running on your hardware.
ROCm Component Packages Detailed descriptions of key ROCm packages and their components.
Monitoring and Managing GPU Usage Tools and commands to monitor GPU utilization, temperature, power consumption, and more.
GPU Performance Tuning Techniques to optimize GPU performance, including setting power profiles, fan speeds, and clock frequencies.
Diagnostics and Debugging Methods to check GPU health, reset GPUs, and manage logs.
GPU Memory Management Commands to check and clear GPU memory usage.
Advanced GPU Configuration Tips for overclocking, underclocking, and setting power caps on GPUs.
ROCm-SMI Commands Summary A comprehensive summary of ROCm System Management Interface commands.
Development Tools Instructions for installing and using HIP, ROCm's Heterogeneous-Computing Interface for Portability.
Performance Tuning and Benchmarking Guides to installing and running performance tests using ROCm tools like rocprof and rocminfo.
RoCE and GPU Network Fabrics Steps to set up and manage RDMA over Converged Ethernet for high-bandwidth GPU communication.
Performance and Benchmarking Cookbook Practical recipes for benchmarking and optimizing ROCm, RDMA, and RoCE performance.

Why Choose ROCm?

  • Open Ecosystem: Enjoy the freedom and flexibility of an open-source platform with a vibrant community of contributors.
  • Control and Customization: Gain granular control over your parallel programming environment, allowing for custom optimizations and enhancements.
  • Future-Ready: Leverage cutting-edge technologies and stay ahead in the rapidly evolving field of high-performance computing.

Welcome to the future of parallel programming. Let's get started with ROCm and unlock the full potential of your hardware!

Table of Contents

Transitioning from Nvidia CUDA to AMD ROCm

Hardware Concepts

CUDA (Nvidia) ROCm (AMD) Description
Nvidia GPU AMD GPU Graphics Processing Unit (GPU) for parallel processing
Tensor Cores Matrix Cores Specialized cores for deep learning operations
NVLink Infinity Fabric High-bandwidth interconnect for communication between GPUs
CUDA Cores Stream Processors Fundamental processing units in the GPU
SM (Streaming Multiprocessor) CU (Compute Unit) Hardware block containing multiple processing units
Warp Wavefront A group of threads executed in lock-step
Unified Memory Unified Address Space Memory management allowing shared address space between CPU and GPU
nvcc (CUDA Compiler) hipcc (HIP Compiler) Compiler for converting code into executable for the GPU
CUDA Driver ROCm Driver Software component to manage GPU resources and execution

CUDA vs ROCm General Terminology Concepts

CUDA (Nvidia) ROCm (AMD) Description
CUDA HIP ROCm's Heterogeneous-Computing Interface for Portability (HIP)
CUDA Toolkit ROCm Toolkit The software suite for developing applications for AMD GPUs
cuBLAS rocBLAS Library for basic linear algebra subprograms
cuDNN MIOpen Library for deep neural networks
cuFFT rocFFT Library for Fast Fourier Transform operations
cuRAND rocRAND Library for random number generation
cuSPARSE rocSPARSE Library for sparse matrix operations
nvprof rocprof Profiling tools for performance analysis
Nsight Compute rocProfiler, CodeXL Tools for performance analysis and debugging
Nsight Systems rocTracer Tools for system-wide tracing and performance optimization

Programming Concepts

CUDA vs ROCm

CUDA (Nvidia) ROCm (AMD) Description
global global Qualifier to define a kernel function
device device Qualifier for functions executed on the device
host host Qualifier for functions executed on the host
shared shared Qualifier for shared memory
constant constant Qualifier for constant memory
threadIdx hipThreadIdx_x, hipThreadIdx_y, hipThreadIdx_z Variable for thread indices within a block
blockIdx hipBlockIdx_x, hipBlockIdx_y, hipBlockIdx_z Variable for block indices within a grid
blockDim hipBlockDim_x, hipBlockDim_y, hipBlockDim_z Variable for block dimensions
gridDim hipGridDim_x, hipGridDim_y, hipGridDim_z Variable for grid dimensions
cudaMalloc hipMalloc Function to allocate device memory
cudaFree hipFree Function to free device memory
cudaMemcpy hipMemcpy Function to copy memory between host and device

System Setup and Installation

Installing ROCm:

sudo apt update
sudo apt install -y rocm-dkms

Adding ROCm Repository:

echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-key 609B6B9E
sudo apt update

Installing ROCm Components:

sudo apt install -y rocm-dev rocm-utils rocm-libs miopen-hip

ROCm Component Packages

rocm-dev

  • Description: Development package for ROCm, including compiler and libraries needed for developing applications.
  • Components: Includes the ROCm compiler, ROCm runtime, and various ROCm libraries for development.

rocm-utils

  • Description: Utility package for ROCm, providing tools for monitoring and managing ROCm-enabled devices.
  • Components: Contains utilities like rocm-smi for system management and monitoring of ROCm devices.

rocm-libs

  • Description: Library package for ROCm, providing essential libraries for ROCm applications.
  • Components: Includes libraries like ROCm Math Libraries (rocBLAS, rocFFT, rocSPARSE), ROCm Communication Libraries, and others necessary for running ROCm applications.

miopen-hip

  • Description: Machine Intelligence library for deep learning frameworks on ROCm, optimized for AMD GPUs.
  • Components: Includes MIOpen, a GPU-accelerated library providing highly optimized implementations of standard deep learning operations.

Monitoring and Managing GPU Usage

Listing Available GPUs

To list all available GPUs:

/opt/rocm/bin/rocm-smi

Monitoring GPU Utilization

To monitor GPU utilization in real-time:

watch -n 1 /opt/rocm/bin/rocm-smi --showhw

Checking GPU Temperature

To check the current temperature of the GPUs:

/opt/rocm/bin/rocm-smi --showtemp

Checking GPU Power Consumption

To check the power consumption of the GPUs:

/opt/rocm/bin/rocm-smi --showpower

Checking GPU Fan Speed

To check the fan speed of the GPUs:

/opt/rocm/bin/rocm-smi --showfan

Checking GPU Clock Frequencies

To check the clock frequencies of the GPUs:

/opt/rocm/bin/rocm-smi --showclk

GPU Performance Tuning

Setting GPU Power Profile

To set the power profile of a GPU:

sudo /opt/rocm/bin/rocm-smi --setsclk 4 --device 0
  • Description: Sets the GPU power profile to a specific state.
  • Example: --setsclk 4 sets the GPU's power state to level 4.
  • Device: --device 0 specifies the target GPU (device 0).
  • Real-World Reasons:
    • Energy Efficiency: Reduce power consumption and heat output in low-demand scenarios.
    • Performance Tuning: Increase power limits for high-performance tasks like deep learning model training.

Setting GPU Fan Speed

To set the fan speed of a GPU:

sudo /opt/rocm/bin/rocm-smi --setfan 100 --device 0
  • Description: Sets the GPU fan speed to a specific percentage.
  • Example: --setfan 100 sets the fan speed to 100% (full speed).
  • Device: --device 0 specifies the target GPU (device 0).
  • Real-World Reasons:
    • Overheating Prevention: Increase fan speed during intensive tasks to prevent overheating.
    • Noise Management: Lower fan speed in less demanding scenarios to reduce noise in quiet environments.

Setting GPU Memory Clock

To set the memory clock speed of a GPU:

sudo /opt/rocm/bin/rocm-smi --setmclk 2 --device 0
  • Description: Sets the memory clock speed of the GPU to a specific level.
  • Example: --setmclk 2 sets the memory clock to level 2.
  • Device: --device 0 specifies the target GPU (device 0).
  • Real-World Reasons:
    • Performance Optimization: Increase memory clock speed to boost performance in memory-intensive applications.
    • Power Saving: Reduce memory clock speed when high performance is not required to save energy.

Setting GPU Performance Level

To set the performance level of a GPU:

sudo /opt/rocm/bin/rocm-smi --setperflevel high --device 0
  • Description: Sets the GPU performance level.
  • Example: --setperflevel high sets the performance level to high.
  • Device: --device 0 specifies the target GPU (device 0).
  • Real-World Reasons:
    • High Performance: Set to high performance for demanding tasks like real-time rendering or complex simulations.
    • Balanced Mode: Use balanced performance levels to maintain a balance between performance and power consumption for typical workloads.

Diagnostics and Debugging

Checking GPU Health:

/opt/rocm/bin/rocm-smi --showhealth
  • Summary: This command is used to display the health status of the GPU. It provides information on various health metrics such as temperature, fan speed, power consumption, and more.

Resetting GPU:

sudo /opt/rocm/bin/rocm-smi --reset
  • Summary: This command resets the GPU. It can be useful for troubleshooting and resolving issues related to the GPU by restarting its state.

Saving GPU Logs:

/opt/rocm/bin/rocm-smi --save log.txt
  • Summary: This command saves the current GPU logs to a specified file (in this case, log.txt). These logs can be used for diagnostics and analyzing the performance and issues of the GPU.

Clearing GPU Logs:

sudo /opt/rocm/bin/rocm-smi --clearlog
  • Summary: This command clears the GPU logs. It can help in managing disk space and removing old log data that is no longer needed for analysis.

GPU Memory Management

Checking GPU Memory Usage:

/opt/rocm/bin/rocm-smi --showmemuse
  • Summary: This command displays the current GPU memory usage, helping you monitor the memory allocation and identify any potential memory bottlenecks.

Clearing GPU Memory:

sudo /opt/rocm/bin/rocm-smi --clearmem
  • Summary: This command clears the GPU memory, which can be useful for freeing up memory resources and resolving memory-related issues.

Advanced GPU Configuration

Overclocking GPU:

sudo /opt/rocm/bin/rocm-smi --setsclkoc 1700 --device 0
  • Summary: This command overclocks the GPU to a specified frequency (1700 MHz in this case), potentially improving performance but also increasing power consumption and heat.

Underclocking GPU:

sudo /opt/rocm/bin/rocm-smi --setsclkoc 1400 --device 0
  • Summary: This command underclocks the GPU to a specified frequency (1400 MHz in this case), reducing power consumption and heat, which may be beneficial for stability and longevity.

Setting Power Cap:

sudo /opt/rocm/bin/rocm-smi --setpoweroverdrive 200 --device 0
  • Summary: This command sets a power cap (200W in this case) on the GPU, limiting its maximum power consumption to control thermal output and power usage.

ROCm-SMI Commands Summary

General Information:

/opt/rocm/bin/rocm-smi --showhw
/opt/rocm/bin/rocm-smi --showallinfo
  • Summary: These commands provide detailed hardware information and all available information about the GPU, respectively, helping in understanding the system's configuration and capabilities.

Temperature and Fan:

/opt/rocm/bin/rocm-smi --showtemp
/opt/rocm/bin/rocm-smi --showfan
/opt/rocm/bin/rocm-smi --setfan <percentage> --device <device_id>
  • Summary: These commands display the GPU temperature and fan speed, and allow setting the fan speed to a specific percentage, useful for thermal management.

Power and Performance:

/opt/rocm/bin/rocm-smi --showpower
/opt/rocm/bin/rocm-smi --setperflevel <level> --device <device_id>
/opt/rocm/bin/rocm-smi --setpoweroverdrive <value> --device <device_id>
  • Summary: These commands display power consumption, set performance levels, and configure power overdrive settings, crucial for optimizing power efficiency and performance.

Clock Speeds:

/opt/rocm/bin/rocm-smi --showclk
/opt/rocm/bin/rocm-smi --setsclk <value> --device <device_id>
/opt/rocm/bin/rocm-smi --setmclk <value> --device <device_id>
/opt/rocm/bin/rocm-smi --setsclkoc <value> --device <device_id>
  • Summary: These commands show and set the GPU's clock speeds, including overclocking and underclocking, to adjust performance and power consumption.

Memory:

/opt/rocm/bin/rocm-smi --showmemuse
/opt/rocm/bin/rocm-smi --clearmem
  • Summary: These commands display current GPU memory usage and clear the GPU memory, aiding in memory management and troubleshooting.

Logs and Health:

/opt/rocm/bin/rocm-smi --showhealth
/opt/rocm/bin/rocm-smi --save <filename>
/opt/rocm/bin/rocm-smi --clearlog
  • Summary: These commands show the health status of the GPU, save logs to a file, and clear existing logs, useful for diagnostics and maintenance.

Reset:

sudo /opt/rocm/bin/rocm-smi --reset
  • Summary: This command resets the GPU, which can be helpful for recovering from errors and ensuring the GPU is in a clean state.

Here is a summary of each element for clarity:

RoCE (RDMA over Converged Ethernet) and GPU Network Fabrics

1. Installing RDMA Tools:

  • Command: sudo apt install -y rdma-core
  • Purpose: Installs the essential RDMA tools and libraries.

2. Configuring RoCE:

  • Commands:
    sudo modprobe mlx4_ib
    sudo modprobe rdma_ucm
    sudo modprobe ib_ipoib
  • Purpose: Loads the necessary kernel modules for RDMA functionality.

3. Checking RDMA Devices:

  • Command: ibv_devices
  • Purpose: Lists the available RDMA devices on the system.

4. Displaying RDMA Configuration:

  • Command: ibv_devinfo
  • Purpose: Displays detailed information about RDMA devices.

5. Setting Up RDMA Over TCP/IP:

  • Command: sudo rdma link add rxe0 type rxe netdev eth0
  • Purpose: Configures RDMA over TCP/IP using the specified network interface.

6. Listing RDMA Interfaces:

  • Command: rdma link show
  • Purpose: Lists the configured RDMA interfaces.

7. Running RDMA Bandwidth Test:

  • Command: ib_send_bw -d mlx5_0 -i 1
  • Purpose: Measures the bandwidth performance of the RDMA device.

8. RDMA Latency Test:

  • Command: ib_send_lat -d mlx5_0 -i 1
  • Purpose: Measures the latency of the RDMA device.

9. Connecting RDMA Devices:

  • Commands:
    rdma resource show qp
    rdma resource show pd
  • Purpose: Displays RDMA resources such as queue pairs and protection domains.

10. Using rping for RDMA Connectivity Testing:

  • Server-side:
    sudo rping -s -a <server_ip> -v -C 10
  • Client-side:
    sudo rping -c -a <server_ip> -v -C 10
  • Purpose: Tests RDMA connectivity between server and client.

11. Configuring QoS for RDMA Traffic:

  • Commands:
    sudo tc qdisc add dev eth0 root handle 1: htb default 10
    sudo tc class add dev eth0 parent 1:1 classid 1:10 htb rate 10Gbit
    sudo tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 match ip dport 4791 0xffff flowid 1:10
  • Purpose: Sets up Quality of Service (QoS) to prioritize RDMA traffic.

12. Setting up RDMA Multicast:

  • Command: sudo ib_mcjoin -v <multicast_group>
  • Purpose: Joins an RDMA multicast group.

13. Monitoring RDMA Traffic:

  • Command: sudo perfquery -a
  • Purpose: Monitors the performance and statistics of RDMA traffic.## Performance and Benchmarking Cookbook Using ROCm, RDMA, and RoCE

Basic Performance Benchmarking and Stress Testing

1. Installing ROCm and Performance Tools:

  • Commands:
    sudo apt update
    sudo apt install -y rocm-dkms rocm-utils rocm-libs rocm-bandwidth-test rocprof rdma-core
  • Purpose: Installs ROCm (Radeon Open Compute) and associated performance tools.

2. Running ROCm Bandwidth Test:

  • Command: /opt/rocm/bin/rocm-bandwidth-test
  • Purpose: Measures the bandwidth performance of GPU memory transfers.

3. Interpreting Bandwidth Test Results:

  • Key Metrics: Peak Bandwidth
  • Purpose: Understands the data transfer capabilities of the system by reviewing various bandwidth metrics for device-to-device, host-to-device, and device-to-host transfers.

Profiling Applications with rocprof

4. Profiling a HIP Program:

  • Command: rocprof --hip-trace ./my_program
  • Purpose: Profiles a HIP (Heterogeneous-Compute Interface for Portability) program to gather performance data.

5. Generating a Profile Report:

  • Command: rocprof --hsa-trace ./my_program
  • Purpose: Generates a profile report with detailed tracing of HSA (Heterogeneous System Architecture) activities.

6. Analyzing Profile Data:

  • Purpose: Analyzes rocprof output to identify kernel execution times, memory transfer times, and other performance metrics to find bottlenecks.

Using rocminfo for System Information

7. Checking System Information:

  • Command: /opt/rocm/bin/rocminfo
  • Purpose: Provides detailed information about the GPUs in the system, including device IDs, memory sizes, and supported features.

RDMA and RoCE Performance Testing

8. Configuring RoCE:

  • Commands:
    sudo modprobe mlx4_ib
    sudo modprobe rdma_ucm
    sudo modprobe ib_ipoib
  • Purpose: Loads the necessary kernel modules for RDMA over Converged Ethernet (RoCE) functionality.

9. Setting Up RDMA Over TCP/IP:

  • Commands:
    sudo rdma link add rxe0 type rxe netdev eth0
    sudo ip link set rxe0 up
  • Purpose: Configures RDMA over TCP/IP using the specified network interface and brings it up.

10. Checking RDMA Devices: - Command: ibv_devices - Purpose: Lists the available RDMA devices on the system.

11. Displaying RDMA Configuration: - Command: ibv_devinfo - Purpose: Displays detailed information about RDMA devices.

12. Running RDMA Bandwidth Test: - Command: ib_send_bw -d mlx5_0 -I 1 - Purpose: Measures the bandwidth performance of the RDMA device.

13. RDMA Latency Test: - Command: ib_send_lat -d mlx5_0 -I 1 - Purpose: Measures the latency of the RDMA device.

14. Using rping for RDMA Connectivity Testing:

**Server-side:**
  ```bash
  sudo rping -s -a <server_ip> -v -C 10
  ```
  
**Client-side:**
  ```bash
  sudo rping -c -a <server_ip> -v -C 10
  ```

- Purpose: Tests RDMA connectivity between server and client.

15. Configuring QoS for RDMA Traffic: - Commands: bash sudo tc qdisc add dev eth0 root handle 1: htb default 10 sudo tc class add dev eth0 parent 1:1 classid 1:10 htb rate 10Gbit sudo tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 match ip dport 4791 0xffff flowid 1:10 - Purpose: Sets up Quality of Service (QoS) to prioritize RDMA traffic.

16. Setting up RDMA Multicast: - Command: sudo ib_mcjoin -v <multicast_group> - Purpose: Joins an RDMA multicast group.

17. Monitoring RDMA Traffic: - Command: sudo perfquery -a - Purpose: Monitors the performance and statistics of RDMA traffic.

Advanced Performance Benchmarking

1. Using rocblas-bench for BLAS Performance

Installing rocBLAS:

  • Command: sudo apt install -y rocblas
  • Purpose: Installs the rocBLAS library for BLAS (Basic Linear Algebra Subprograms) operations.

Running rocblas-bench:

  • Command:
    rocblas-bench -f gemm -r f32_r -m 4096 -n 4096 -k 4096 --a_type f32_r --b_type f32_r --c_type f32_r --d_type f32_r --alpha 1 --beta 0
  • Purpose: Benchmarks the General Matrix Multiply (GEMM) operation for single-precision floating-point numbers.

Using rocFFT for FFT Performance

2. Installing rocFFT:

  • Command: sudo apt install -y rocfft
  • Purpose: Installs the rocFFT library for Fast Fourier Transform (FFT) operations.

Running rocFFT Benchmarks:

  • Command:
    /opt/rocm/bin/rocfft-bench -n 1000 -b 2048 -p single -t complex_forward
  • Purpose: Benchmarks the FFT operation for single-precision complex numbers.

Using RVS (ROCm Validation Suite)

3. Installing RVS:

  • Command: sudo apt install -y rocm-validation-suite
  • Purpose: Installs the ROCm Validation Suite (RVS) for stress testing and validation.

Example Configuration for RVS Stress Test:

Creating stress.json:

  • Configuration:
    {
      "stress": {
        "device": ["all"],
        "count": 1,
        "duration": 5000,
        "metrics": ["gpu_busy", "mem_busy"]
      }
    }
  • Purpose: Specifies the stress test parameters for all available GPUs, running for 5000 milliseconds and measuring GPU and memory busy metrics.

Running the Stress Test:

  • Command: sudo rvs -c stress.json
  • Purpose: Executes a stress test based on the configuration in stress.json.

Custom Performance Scripts

4. Writing a Custom Benchmark Script:

Example Script:

  • Script:
    #!/bin/bash
    
    # Set environment variables for ROCm
    export ROCM_PATH=/opt/rocm
    export PATH=$ROCM_PATH/bin:$PATH
    export LD_LIBRARY_PATH=$ROCM_PATH/lib:$LD_LIBRARY_PATH
    
    # Run rocminfo
    echo "Running rocminfo..."
    /opt/rocm/bin/rocminfo
    
    # Run rocprof on a HIP program
    echo "Profiling HIP program..."
    rocprof --hip-trace ./my_program
    
    # Run rocBLAS benchmark
    echo "Running rocBLAS GEMM benchmark..."
    rocblas-bench -f gemm -r f32_r -m 4096 -n 4096 -k 4096 --a_type f32_r --b_type f32_r --c_type f32_r --d_type f32_r --alpha 1 --beta 0
    
    # Run rocFFT benchmark
    echo "Running rocFFT benchmark..."
    /opt/rocm/bin/rocfft-bench -n 1000 -b 2048 -p single -t complex_forward
    
    # Configure RDMA
    echo "Configuring RDMA..."
    sudo modprobe mlx4_ib
    sudo modprobe rdma_ucm
    sudo modprobe ib_ipoib
    sudo rdma link add rxe0 type rxe netdev eth0
    sudo ip link set rxe0 up
    
    # Run RDMA bandwidth test
    echo "Running RDMA bandwidth test..."
    ib_send_bw -d mlx5_0 -I 1
    
    # Run RDMA latency test
    echo "Running RDMA latency test..."
    ib_send_lat -d mlx5_0 -I 1
    
    # Run RVS stress test
    echo "Running RVS stress test..."
    sudo rvs -c stress.json

Running the Custom Script:

  • Commands:
    chmod +x custom_benchmark.sh
    ./custom_benchmark.sh
  • Purpose: Grants execute permission to the custom script and runs it, executing a series of benchmarks and configurations.