NUS SoC cluster GPU status checker
Sample output
========== xgpf3 ==========
------- 0 Tesla T4 -------
Load: 0.0
Memory Free: 15079.0MB
Memory Used: 0.0MB / 15079.0MB 0.00%
Auth fail for xgpf7, the host may be reserved.
========== xgpf8 ==========
------- 0 Tesla T4 -------
Load: 1.0
Memory Free: 680.0MB
Memory Used: 14399.0MB / 15079.0MB 95.49%
========== xgpf9 ==========
------- 0 Tesla T4 -------
Load: 0.07
Memory Free: 444.0MB
Memory Used: 14635.0MB / 15079.0MB 97.06%
========== xgpf10 ==========
------- 0 Tesla T4 -------
Load: 1.0
Memory Free: 49.0MB
Memory Used: 15030.0MB / 15079.0MB 99.68%
------- 1 Tesla T4 -------
Load: 1.0
Memory Free: 6.0MB
Memory Used: 15073.0MB / 15079.0MB 99.96%
========== xgpf11 ==========
------- 0 Tesla T4 -------
Load: 0.05
Memory Free: 265.0MB
Memory Used: 14814.0MB / 15079.0MB 98.24%
------- 1 Tesla T4 -------
Load: 0.43
Memory Free: 351.0MB
Memory Used: 14728.0MB / 15079.0MB 97.67%
********** Most Free Mem: xgpf3 **********
------- 0 Tesla T4 -------
Load: 0.0
Memory Free: 15079.0MB
Memory Used: 0.0MB / 15079.0MB 0.00%
********** Least Utilization: xgpf3 **********
------- 0 Tesla T4 -------
Load: 0.0
Memory Free: 15079.0MB
Memory Used: 0.0MB / 15079.0MB 0.00%
Usage
- Upload
mi_gpu_slave_script.py
to your home directory. Upload to one server is enough as the home directory is shared. - Make sure dependencies are installed
pip3 install -r req.txt
- Fill in your private key path in
mi_nus_soc_gpu_status_reader.py
if you are using key pair to authenticate, if not use this line instead(uncomment)
client.connect(f'{host}.comp.nus.edu.sg', username='mingda', password='PASSWORD_HERE')
Now run mi_nus_soc_gpu_status_reader.py
:)
TODO
- Multi-thread?
- CLI args?
Please STAR if you find it useful!!!
Thanks!!!