andyljones/coolgpus

Partially-headed servers

andyljones opened this issue · 4 comments

Servers where some GPUs have displays attached and some don't is trickier than the fully-headless case. The first and obvious issue is that some display IDs are already occupied. That's fixed by picking displays :10, :11, :12 etc, which aren't commonly used. Then the script will actually run!

Unfortunately, it also blanks your physical display, presumably because it's nicking the GPU off of your primary X server. This is fixed for by only looking at GPU buses that nvidia-smi reports as 'not displayed'. That leaves these problems:

  • How'd you figure out which physical display IDs correspond to which PCI buses, so the script can manage those too? This seems easy but Google's failed me so far. xdpyinfo seemed promising, but the extension that presumably has the bus info in - NV-CONTROL - isn't supported.
  • When launching X servers for non-displayed GPUs, the monitor will blank and you need to hit Ctrl+Alt+F2 to get back to the desktop. I think this is something to do with X 'resetting' VTs, because the same problem was originally showing up every time nvidia-settings was called. That was suppressed by -novtswitch and passing a new VT ID, but the blank-on-launch persists.

So yeah, if you've got a partially headless box and want to fix this up

git clone https://github.com/andyljones/coolgpus.git
cd coolgpus
git checkout partial-head
sudo $(which coolgpus)

and Ctrl+Alt+F2 back to your desktop.

One possible solution is to start a single X server occupying all the GPUs. If you specify Option "AllowEmptyInitialConfiguration" "true" in device section, X will happily start on device not attached to any display.

Unfortunately my GPUs are now a long way away from my monitor, so I can't test that out. It sounds promising though!

One year update: my own machine is still entirely headless, and my partially-headless experiments were painful enough the first time round that I'm not willing to revisit them.

This will have to wait for contributor with a partially-headless setup that's willing to put the time into figuring this out. It might even be better as a separate repo, as I suspect partially-headless would use almost entirely different code than entirely-headless - and I wouldn't be able to support it with my entirely-headless setup anyway.

Also, if I understand the suggestion (@akamaus) of adding Option "AllowEmptyInitialConfiguration" "true" correctly, I can confirm that it didn't work in my case.

More specifically, I added Option "AllowEmptyInitialConfiguration" "true" after this line inside of Driver "nvidia" context

BusID "PCI:{bus}"