/gpu-monitor

Monitor nvidia-smi output over SSH

Primary LanguageRust

GPU monitor over SSH

In the situation where one has asscess to multiple nodes, each having multiple GPUs, it can be tedious to find out which nodes have available GPUs and which do not.

This tool calls nvidia-smi over SSH to multiple nodes concurrently and reports the results.

Installation

Installation requires Rust.

cargo install

Usage

gpu-monitor HOSTNAME1 HOSTNAME2 ...

Example

$ gpu-monitor n1 n2 n3 foo  
+----------+--------------------------------------------------------------------------+
| Hostname | GPUs                                                                     |
+----------+--------------------------------------------------------------------------+
| n1       | +-------+----------------+---------------+----------+------------------+ |
|          | | Index | Total mem (GB) | Used mem (GB) | Util (%) | Name             | |
|          | +-------+----------------+---------------+----------+------------------+ |
|          | | 0     | 12.20          | 0.08          | 0.00     | TITAN X (Pascal) | |
|          | +-------+----------------+---------------+----------+------------------+ |
|          | | 1     | 12.20          | 0.01          | 0.00     | TITAN X (Pascal) | |
|          | +-------+----------------+---------------+----------+------------------+ |
|          | | 2     | 12.20          | 0.01          | 0.00     | TITAN X (Pascal) | |
|          | +-------+----------------+---------------+----------+------------------+ |
|          | | 3     | 12.19          | 0.01          | 0.00     | TITAN X (Pascal) | |
|          | +-------+----------------+---------------+----------+------------------+ |
+----------+--------------------------------------------------------------------------+
| n2       | +-------+----------------+---------------+----------+------------------+ |
|          | | Index | Total mem (GB) | Used mem (GB) | Util (%) | Name             | |
|          | +-------+----------------+---------------+----------+------------------+ |
|          | | 0     | 12.20          | 11.77         | 0.00     | TITAN X (Pascal) | |
|          | +-------+----------------+---------------+----------+------------------+ |
|          | | 1     | 12.20          | 11.77         | 0.00     | TITAN X (Pascal) | |
|          | +-------+----------------+---------------+----------+------------------+ |
|          | | 2     | 12.19          | 11.77         | 0.00     | TITAN X (Pascal) | |
|          | +-------+----------------+---------------+----------+------------------+ |
+----------+--------------------------------------------------------------------------+
| n3       | +-------+----------------+---------------+----------+------------------+ |
|          | | Index | Total mem (GB) | Used mem (GB) | Util (%) | Name             | |
|          | +-------+----------------+---------------+----------+------------------+ |
|          | | 0     | 12.19          | 0.00          | 0.00     | TITAN X (Pascal) | |
|          | +-------+----------------+---------------+----------+------------------+ |
|          | | 1     | 12.19          | 0.00          | 0.00     | TITAN X (Pascal) | |
|          | +-------+----------------+---------------+----------+------------------+ |
|          | | 2     | 12.19          | 0.38          | 97.00    | TITAN X (Pascal) | |
|          | +-------+----------------+---------------+----------+------------------+ |
+----------+--------------------------------------------------------------------------+
| foo      | SSH error: ssh: connect to host foo port 22: No route to host            |
+----------+--------------------------------------------------------------------------+