In the situation where one has asscess to multiple nodes, each having multiple GPUs, it can be tedious to find out which nodes have available GPUs and which do not.
This tool calls nvidia-smi
over SSH to multiple nodes concurrently and reports the results.
Installation requires Rust.
cargo install
gpu-monitor HOSTNAME1 HOSTNAME2 ...
$ gpu-monitor n1 n2 n3 foo
+----------+--------------------------------------------------------------------------+
| Hostname | GPUs |
+----------+--------------------------------------------------------------------------+
| n1 | +-------+----------------+---------------+----------+------------------+ |
| | | Index | Total mem (GB) | Used mem (GB) | Util (%) | Name | |
| | +-------+----------------+---------------+----------+------------------+ |
| | | 0 | 12.20 | 0.08 | 0.00 | TITAN X (Pascal) | |
| | +-------+----------------+---------------+----------+------------------+ |
| | | 1 | 12.20 | 0.01 | 0.00 | TITAN X (Pascal) | |
| | +-------+----------------+---------------+----------+------------------+ |
| | | 2 | 12.20 | 0.01 | 0.00 | TITAN X (Pascal) | |
| | +-------+----------------+---------------+----------+------------------+ |
| | | 3 | 12.19 | 0.01 | 0.00 | TITAN X (Pascal) | |
| | +-------+----------------+---------------+----------+------------------+ |
+----------+--------------------------------------------------------------------------+
| n2 | +-------+----------------+---------------+----------+------------------+ |
| | | Index | Total mem (GB) | Used mem (GB) | Util (%) | Name | |
| | +-------+----------------+---------------+----------+------------------+ |
| | | 0 | 12.20 | 11.77 | 0.00 | TITAN X (Pascal) | |
| | +-------+----------------+---------------+----------+------------------+ |
| | | 1 | 12.20 | 11.77 | 0.00 | TITAN X (Pascal) | |
| | +-------+----------------+---------------+----------+------------------+ |
| | | 2 | 12.19 | 11.77 | 0.00 | TITAN X (Pascal) | |
| | +-------+----------------+---------------+----------+------------------+ |
+----------+--------------------------------------------------------------------------+
| n3 | +-------+----------------+---------------+----------+------------------+ |
| | | Index | Total mem (GB) | Used mem (GB) | Util (%) | Name | |
| | +-------+----------------+---------------+----------+------------------+ |
| | | 0 | 12.19 | 0.00 | 0.00 | TITAN X (Pascal) | |
| | +-------+----------------+---------------+----------+------------------+ |
| | | 1 | 12.19 | 0.00 | 0.00 | TITAN X (Pascal) | |
| | +-------+----------------+---------------+----------+------------------+ |
| | | 2 | 12.19 | 0.38 | 97.00 | TITAN X (Pascal) | |
| | +-------+----------------+---------------+----------+------------------+ |
+----------+--------------------------------------------------------------------------+
| foo | SSH error: ssh: connect to host foo port 22: No route to host |
+----------+--------------------------------------------------------------------------+