andyljones/coolgpus

Limit GPU binding with CUDA_VISIBLE_DEVICES or so

Opened this issue · 3 comments

d355 commented

Hello, and, first all I'd like to thank you for project, it's still the best way we found to workaround NVIDIA cooling issues.

To the point. Thanks to latest NVIDIA drivers updates, now instead of usual primary contexts [with nwidia-smi tool] we have displayed all contexts created. So if earlier we've got output like this:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3541      G   /usr/libexec/Xorg                   8MiB |
|    1   N/A  N/A      3543      G   /usr/libexec/Xorg                   8MiB |
|    2   N/A  N/A      3544      G   /usr/libexec/Xorg                   8MiB |
|    3   N/A  N/A      3546      G   /usr/libexec/Xorg                   8MiB |
|    4   N/A  N/A      3548      G   /usr/libexec/Xorg                   8MiB |
|    5   N/A  N/A      3549      G   /usr/libexec/Xorg                   8MiB |
|    6   N/A  N/A      3550      G   /usr/libexec/Xorg                   8MiB |
|    7   N/A  N/A      3552      G   /usr/libexec/Xorg                   8MiB |
+-----------------------------------------------------------------------------+

...now we have:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    400553      G   /usr/libexec/Xorg                   8MiB |
|    0   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400554      G   /usr/libexec/Xorg                   8MiB |
|    1   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400555      G   /usr/libexec/Xorg                   8MiB |
|    2   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400556      G   /usr/libexec/Xorg                   8MiB |
|    3   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400557      G   /usr/libexec/Xorg                   8MiB |
|    4   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400558      G   /usr/libexec/Xorg                   8MiB |
|    5   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400559      G   /usr/libexec/Xorg                   8MiB |
|    6   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400560      G   /usr/libexec/Xorg                   8MiB |
+-----------------------------------------------------------------------------+

Is it possible to limit Xorg processes with something like CUDA_VISIBLE_DEVICES environment variable ( https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ )?

I guess some minor changes are needed somewhere arond this line so each Xorg instance run like CUDA_VISIBLE_DEVICES=1 Xorg ... .

Lawd, that's a mess.

To triage this properly, are there any consequences other than nvidia-smi being very tall?

Also I'm unlikely to personally upgrade the drivers any time soon, and I don't like to fix bugs blind. I think the fix should be as simple as

p = Popen(xorgargs, env={'CUDA_VISIBLE_DEVICES': display[1:]})

Would you be able to make this change yourself and test it out for a few days? If this particular change fails, try adding a breakpoint() immediately before the line; it'll drop you into pdb and you can have a poke around.

d355 commented

Thank you! Sure, I'll check it out and report result here.

Lawd, that's a mess.

To triage this properly, are there any consequences other than nvidia-smi being very tall?

Also I'm unlikely to personally upgrade the drivers any time soon, and I don't like to fix bugs blind. I think the fix should be as simple as

p = Popen(xorgargs, env={'CUDA_VISIBLE_DEVICES': display[1:]})

Would you be able to make this change yourself and test it out for a few days? If this particular change fails, try adding a breakpoint() immediately before the line; it'll drop you into pdb and you can have a poke around.

Hi, I have tried the modification here but it didn't work. I have found another workaround.
In the source of coolgpus, just replace

buses = gpu_buses()

with the specific gpu bus_id you would like coolgpus to take effect, e.g.

buses = ['00000000:65:00.0']

The bus id could be seen from the output of nvidia-smi. Hope this helps.