/GPUInspector.jl

Inspecting GPUs with Julia

Primary LanguageJuliaMIT LicenseMIT

GPUInspector.jl

Inspecting high-performance (multi-)GPU systems with Julia

Documentation Build Status Quality

Installation

The package is registered in the General registry and can readily be added by using the Pkg REPL mode.

] add GPUInspector

Example

The package allows you to do various tests and benchmarks. Below we show a demonstrate a little stress test which lets a few A100 GPUs "burn" (i.e. lets them perform computations at close to peak performance) and monitors a few key metrics, such as power usage, temperature, and utilization, at the same time.

julia> using GPUInspector

julia> using CUDA # loading a GPU backend

julia> monitoring_start()                                                           
[ Info: Spawning monitoring on Julia thread 20.

julia> stresstest(; devices=CUDA.devices(), duration=10) # all devices, 10 seconds
[ Info: Will try to run for approximately 10 seconds on each GPU.
[ Info: Running StressTest{Float32} on Julia thread 4 and CuDevice(2).
[ Info: Running StressTest{Float32} on Julia thread 2 and CuDevice(0).
[ Info: Running StressTest{Float32} on Julia thread 6 and CuDevice(4).
[ Info: Running StressTest{Float32} on Julia thread 3 and CuDevice(1).
[ Info: Running StressTest{Float32} on Julia thread 9 and CuDevice(7).
[ Info: Running StressTest{Float32} on Julia thread 7 and CuDevice(5).
[ Info: Running StressTest{Float32} on Julia thread 5 and CuDevice(3).
[ Info: Running StressTest{Float32} on Julia thread 8 and CuDevice(6).
[ Info: Ran 11215 iterations on CuDevice(2).
[ Info: Ran 11241 iterations on CuDevice(6).
[ Info: Ran 11261 iterations on CuDevice(1).
[ Info: Ran 11236 iterations on CuDevice(5).
[ Info: Ran 11263 iterations on CuDevice(4).
[ Info: Ran 11261 iterations on CuDevice(3).
[ Info: Ran 11270 iterations on CuDevice(7).
[ Info: Ran 11241 iterations on CuDevice(0).
[ Info: Clearing GPU memory.
[ Info: Took 10.0 seconds to run the tests.

julia> results = monitoring_stop();
[ Info: Stopping monitoring and fetching results...

julia> plot_monitoring_results(results)

             ⠀⠀⠀⠀⠀⠀⠀⠀⠀GPU Utilization (Compute)
             ┌────────────────────────────────────────┐        
         105 │⠀⠀⠀⠀⡤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⢤⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 0: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⢰⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 1: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 2: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 3: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⡸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 4: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 5: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 6: NVIDIA A100-SXM4-40GB
   U [%]     │⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 7: NVIDIA A100-SXM4-40GB
             │⠀⠀⢰⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                             
             │⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                             
             │⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                             
             │⠀⠀⡸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                             
             │⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                             
             │⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                             
           0 │⣀⣀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                             
             └────────────────────────────────────────┘                             
             ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀20⠀                             
             ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀Time [s]⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀                                                          
                                                                                                                 
            ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀GPU Temperature⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀                             
            ┌────────────────────────────────────────┐                             
         63 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 0: NVIDIA A100-SXM4-40GB
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠤⠤⠤⠒⢒⣲⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 1: NVIDIA A100-SXM4-40GB
            │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠤⠒⠊⣁⠤⣤⠤⠶⠮⠛⠛⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 2: NVIDIA A100-SXM4-40GB
            │⠀⠀⠀⠀⠀⠀⠀⡠⣒⣉⣉⣭⣓⠭⠛⠋⠉⠉⠉⠀⠀⠀⠀⠀⠀⢻⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 3: NVIDIA A100-SXM4-40GB 
            │⠀⠀⠀⠀⠀⢠⣮⠮⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⢤⣤⠸⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 4: NVIDIA A100-SXM4-40GB 
            │⠀⠀⠀⠀⢠⡿⠁⠀⠀⠀⠀⠀⢀⡠⠤⣤⠤⠶⠮⠛⠋⣉⠭⠥⠤⡇⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 5: NVIDIA A100-SXM4-40GB 
            │⠀⠀⠀⢠⣷⠁⠀⣠⣒⠶⠒⢉⣉⢭⣛⣒⣒⡪⠝⠛⠛⠊⠉⠉⠉⣿⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 6: NVIDIA A100-SXM4-40GB 
   T [C]    │⠀⠀⠀⣼⠃⢠⠊⣁⠤⠔⠚⠓⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢻⡞⠲⡤⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 7: NVIDIA A100-SXM4-40GB 
            │⠀⠀⢠⡏⢠⣿⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣧⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
            │⠀⠀⣼⢡⣷⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
            │⠤⠤⡟⣸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⡳⡢⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
            │⠉⠛⢇⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠪⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
            │⠀⠀⣸⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
            │⠒⠒⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
         28 │⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
            └────────────────────────────────────────┘                              
            ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀20⠀                              
            ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀Time [s]⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀                              
                                                                                                                 
             ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀GPU Power Usage⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀                              
             ┌────────────────────────────────────────┐                             
         340 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 0: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⠀⣤⣠⣤⣤⣶⡶⠶⠶⠶⠾⠿⠿⣶⣿⣷⣶⣶⣶⣒⣒⣺⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 1: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⢰⡯⡭⡿⠟⠛⠛⠛⠛⠛⠛⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠹⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 2: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⢸⠋⡜⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 3: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⢸⡰⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 4: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⣾⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢻⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 5: NVIDIA A100-SXM4-40GB
             │⠀⠀⠀⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 6: NVIDIA A100-SXM4-40GB
   P [W]     │⠀⠀⠀⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ GPU 7: NVIDIA A100-SXM4-40GB
             │⠀⠀⢠⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
             │⠀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
             │⠀⠀⢸⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
             │⠀⠀⣼⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢹⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
             │⠀⠀⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
             │⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣶⣶⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
          56 │⣶⣶⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│                              
             └────────────────────────────────────────┘                              
             ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀20⠀                              
             ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀Time [s]⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀                              

Note that we can get even fancier and plot the monitoring results on-the-fly, i.e. while the stress test is running. See e.g. GPUInspector.livemonitor_temperature.

Documentation

For more information, please find the documentation here.