Limit GPU binding with CUDA_VISIBLE_DEVICES or so
Opened this issue · 3 comments
Hello, and, first all I'd like to thank you for project, it's still the best way we found to workaround NVIDIA cooling issues.
To the point. Thanks to latest NVIDIA drivers updates, now instead of usual primary contexts [with nwidia-smi
tool] we have displayed all contexts created. So if earlier we've got output like this:
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3541 G /usr/libexec/Xorg 8MiB |
| 1 N/A N/A 3543 G /usr/libexec/Xorg 8MiB |
| 2 N/A N/A 3544 G /usr/libexec/Xorg 8MiB |
| 3 N/A N/A 3546 G /usr/libexec/Xorg 8MiB |
| 4 N/A N/A 3548 G /usr/libexec/Xorg 8MiB |
| 5 N/A N/A 3549 G /usr/libexec/Xorg 8MiB |
| 6 N/A N/A 3550 G /usr/libexec/Xorg 8MiB |
| 7 N/A N/A 3552 G /usr/libexec/Xorg 8MiB |
+-----------------------------------------------------------------------------+
...now we have:
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 400553 G /usr/libexec/Xorg 8MiB |
| 0 N/A N/A 400554 G /usr/libexec/Xorg 0MiB |
| 0 N/A N/A 400555 G /usr/libexec/Xorg 0MiB |
| 0 N/A N/A 400556 G /usr/libexec/Xorg 0MiB |
| 0 N/A N/A 400557 G /usr/libexec/Xorg 0MiB |
| 0 N/A N/A 400558 G /usr/libexec/Xorg 0MiB |
| 0 N/A N/A 400559 G /usr/libexec/Xorg 0MiB |
| 0 N/A N/A 400560 G /usr/libexec/Xorg 0MiB |
| 1 N/A N/A 400553 G /usr/libexec/Xorg 0MiB |
| 1 N/A N/A 400554 G /usr/libexec/Xorg 8MiB |
| 1 N/A N/A 400555 G /usr/libexec/Xorg 0MiB |
| 1 N/A N/A 400556 G /usr/libexec/Xorg 0MiB |
| 1 N/A N/A 400557 G /usr/libexec/Xorg 0MiB |
| 1 N/A N/A 400558 G /usr/libexec/Xorg 0MiB |
| 1 N/A N/A 400559 G /usr/libexec/Xorg 0MiB |
| 1 N/A N/A 400560 G /usr/libexec/Xorg 0MiB |
| 2 N/A N/A 400553 G /usr/libexec/Xorg 0MiB |
| 2 N/A N/A 400554 G /usr/libexec/Xorg 0MiB |
| 2 N/A N/A 400555 G /usr/libexec/Xorg 8MiB |
| 2 N/A N/A 400556 G /usr/libexec/Xorg 0MiB |
| 2 N/A N/A 400557 G /usr/libexec/Xorg 0MiB |
| 2 N/A N/A 400558 G /usr/libexec/Xorg 0MiB |
| 2 N/A N/A 400559 G /usr/libexec/Xorg 0MiB |
| 2 N/A N/A 400560 G /usr/libexec/Xorg 0MiB |
| 3 N/A N/A 400553 G /usr/libexec/Xorg 0MiB |
| 3 N/A N/A 400554 G /usr/libexec/Xorg 0MiB |
| 3 N/A N/A 400555 G /usr/libexec/Xorg 0MiB |
| 3 N/A N/A 400556 G /usr/libexec/Xorg 8MiB |
| 3 N/A N/A 400557 G /usr/libexec/Xorg 0MiB |
| 3 N/A N/A 400558 G /usr/libexec/Xorg 0MiB |
| 3 N/A N/A 400559 G /usr/libexec/Xorg 0MiB |
| 3 N/A N/A 400560 G /usr/libexec/Xorg 0MiB |
| 4 N/A N/A 400553 G /usr/libexec/Xorg 0MiB |
| 4 N/A N/A 400554 G /usr/libexec/Xorg 0MiB |
| 4 N/A N/A 400555 G /usr/libexec/Xorg 0MiB |
| 4 N/A N/A 400556 G /usr/libexec/Xorg 0MiB |
| 4 N/A N/A 400557 G /usr/libexec/Xorg 8MiB |
| 4 N/A N/A 400558 G /usr/libexec/Xorg 0MiB |
| 4 N/A N/A 400559 G /usr/libexec/Xorg 0MiB |
| 4 N/A N/A 400560 G /usr/libexec/Xorg 0MiB |
| 5 N/A N/A 400553 G /usr/libexec/Xorg 0MiB |
| 5 N/A N/A 400554 G /usr/libexec/Xorg 0MiB |
| 5 N/A N/A 400555 G /usr/libexec/Xorg 0MiB |
| 5 N/A N/A 400556 G /usr/libexec/Xorg 0MiB |
| 5 N/A N/A 400557 G /usr/libexec/Xorg 0MiB |
| 5 N/A N/A 400558 G /usr/libexec/Xorg 8MiB |
| 5 N/A N/A 400559 G /usr/libexec/Xorg 0MiB |
| 5 N/A N/A 400560 G /usr/libexec/Xorg 0MiB |
| 6 N/A N/A 400553 G /usr/libexec/Xorg 0MiB |
| 6 N/A N/A 400554 G /usr/libexec/Xorg 0MiB |
| 6 N/A N/A 400555 G /usr/libexec/Xorg 0MiB |
| 6 N/A N/A 400556 G /usr/libexec/Xorg 0MiB |
| 6 N/A N/A 400557 G /usr/libexec/Xorg 0MiB |
| 6 N/A N/A 400558 G /usr/libexec/Xorg 0MiB |
| 6 N/A N/A 400559 G /usr/libexec/Xorg 8MiB |
| 6 N/A N/A 400560 G /usr/libexec/Xorg 0MiB |
| 7 N/A N/A 400553 G /usr/libexec/Xorg 0MiB |
| 7 N/A N/A 400554 G /usr/libexec/Xorg 0MiB |
| 7 N/A N/A 400555 G /usr/libexec/Xorg 0MiB |
| 7 N/A N/A 400556 G /usr/libexec/Xorg 0MiB |
| 7 N/A N/A 400557 G /usr/libexec/Xorg 0MiB |
| 7 N/A N/A 400558 G /usr/libexec/Xorg 0MiB |
| 7 N/A N/A 400559 G /usr/libexec/Xorg 0MiB |
| 7 N/A N/A 400560 G /usr/libexec/Xorg 8MiB |
+-----------------------------------------------------------------------------+
Is it possible to limit Xorg processes with something like CUDA_VISIBLE_DEVICES
environment variable ( https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ )?
I guess some minor changes are needed somewhere arond this line so each Xorg instance run like CUDA_VISIBLE_DEVICES=1 Xorg ...
.
Lawd, that's a mess.
To triage this properly, are there any consequences other than nvidia-smi
being very tall?
Also I'm unlikely to personally upgrade the drivers any time soon, and I don't like to fix bugs blind. I think the fix should be as simple as
p = Popen(xorgargs, env={'CUDA_VISIBLE_DEVICES': display[1:]})
Would you be able to make this change yourself and test it out for a few days? If this particular change fails, try adding a breakpoint()
immediately before the line; it'll drop you into pdb and you can have a poke around.
Thank you! Sure, I'll check it out and report result here.
Lawd, that's a mess.
To triage this properly, are there any consequences other than
nvidia-smi
being very tall?Also I'm unlikely to personally upgrade the drivers any time soon, and I don't like to fix bugs blind. I think the fix should be as simple as
p = Popen(xorgargs, env={'CUDA_VISIBLE_DEVICES': display[1:]})Would you be able to make this change yourself and test it out for a few days? If this particular change fails, try adding a
breakpoint()
immediately before the line; it'll drop you into pdb and you can have a poke around.
Hi, I have tried the modification here but it didn't work. I have found another workaround.
In the source of coolgpus, just replace
buses = gpu_buses()
with the specific gpu bus_id you would like coolgpus to take effect, e.g.
buses = ['00000000:65:00.0']
The bus id could be seen from the output of nvidia-smi
. Hope this helps.