utkuozdemir/nvidia_gpu_exporter

Metric per process/pod

andlogreg opened this issue · 5 comments

Is is possible to see memory utilization per process instead of just the total memory usage on a specific gpu?

If not this could be quite useful. Given that this information is already available through nvidia-smi I imagine it should be doable.

The feature does not exist yet, but if it is possible to get that data using nvidia-smi, it should be fairly straightforward to implement. I might have a look into it some time, but don't know when (and no promises).

👍 for this functionality.

If it is of any use: a few months ago I needed to know which process and user was running what in each GPU of a server (+ memory and elapsed time). Since nvidia-smi did not provide such details, I ended up combining it with simple ps calls (which will depend on the host OS, thus may not be a great option). Basically, it went around:

# get gpus
nvidia-smi --query-gpu=index,uuid,gpu_name --format=csv

# get running processes
nvidia-smi --query-compute-apps=pid,process_name,gpu_name,gpu_uuid,used_memory --format=csv

# for each process, using -o lstart, -o etimes or -o cmd to get other details.
ps -fwwp #{pid} -o user -h

Initially tried to get everything about each process in one go but was having difficulties parsing stuff and just hacked this up, which at the time served the purpose. That was then transformed into a decent table, taking into account that a process might be running on multiple GPUs and several processes can be also placed in a single one.