andyljones/coolgpus

Couldn't connect to accessibility bus: Failed to connect to socket

Closed this issue ยท 13 comments

I run coolgpus on a 8 gpu machine (ubuntu18.04, NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 ) .

It reports some errors and the main issue is

(nvidia-settings:38841): dbind-WARNING **: 13:57:53.241: Couldn't connect to accessibility bus: Failed to connect to socket /tmp/dbus-OxChDaN4Rm: Connection refused

I try methods in https://unix.stackexchange.com/questions/230238/x-applications-warn-couldnt-connect-to-accessibility-bus-on-stderr, but it doesn't work, so I want to find some help here.

The full log is here log.LOG

I've not seen this issue before I'm afraid, so all I can do is give you pointers on how to debug this.

These turn up earlier, during xserver creation:

_XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
_XSERVTransMakeAllCOTSServerListeners: server already running
(EE)
Fatal server error:
(EE) Cannot establish any listening sockets - Make sure an X server isn't already running(EE)
(EE)
Please consult the The X.Org Foundation support
         at http://wiki.x.org
 for help.
(EE) Please also check the log file at "/var/log/Xorg.5.log" for additional information.
(EE)
(EE) Server terminated with error (1). Closing log file.

It's the 8th server of 8, started with

Xorg :7 -once -config /tmp/cool-gpu-00000000:0F:00.0igxee6_5/xorg.conf

Googling around for the error message suggests that the most likely cause is that you've got a display server running on :7. coolgpus checks for existing x servers by grepping the process list, and while I thought that was a good idea at the time I realise now that there's a better way (grepping for the lock file). I'm away from my server at the moment so I can't update coolgpus to fix that, but try finding xservers as the FAQ recommends and killing those yourself.

If that doesn't work, next thing to try is to isolate the problem. If you haven't restarted since this failed, then that xorg.conf file should still be hanging about. You can check by ls'ing /tmp/cool-gpu-00000000:0F:00.0igxee6_5.

If the file is still there, then try running the command manually and seeing if it fails on its own. If it does fail, then hooray! We've isolated the problem.

If the file is not still there - because you restarted or cleared your tmpdir - then run coolgpus again, see which server fails again, look for the tempdir the failed server uses, and ctrl+f for that in the logs to get a command like the above.

Thanks for your quick and detailed reply.

  1. I do kill the xservers process before I run coolgpus. ps aux | grep Xorg and kill. This doesn't work.

  2. I find the directory in /tmp/cool-gpu-00000000:0F:00.0igxee6_5


root@klfy-SYS-4028GR-TR2:/tmp/cool-gpu-00000000:0F:00.0igxee6_5# ls
edid.bin  xorg.conf
root@klfy-SYS-4028GR-TR2:/tmp/cool-gpu-00000000:0F:00.0igxee6_5# cat xorg.conf
Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"     0    0
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "VideoCard0"
    Monitor        "Monitor0"
    DefaultDepth   8
    Option         "UseDisplayDevice" "DFP-0"
    Option         "ConnectedMonitor" "DFP-0"
    Option         "CustomEDID" "DFP-0:/tmp/cool-gpu-00000000:0F:00.0igxee6_5/edid.bin"
    Option         "Coolbits" "20"
    SubSection "Display"
                Depth   8
                Modes   "160x200"
    EndSubSection
EndSection

Section "ServerFlags"
    Option         "AllowEmptyInput" "on"
    Option         "Xinerama"        "off"
    Option         "SELinux"         "off"
EndSection

Section "Device"
    Identifier  "Videocard0"
    Driver      "nvidia"
        Screen      0
        Option      "UseDisplayDevice" "DFP-0"
        Option      "ConnectedMonitor" "DFP-0"
        Option      "CustomEDID" "DFP-0:/tmp/cool-gpu-00000000:0F:00.0igxee6_5/edid.bin"
        Option      "Coolbits" "29"
        BusID       "PCI:15:0:0"
EndSection

Section "Monitor"
    Identifier      "Monitor0"
    Vendorname      "Dummy Display"
    Modelname       "160x200"
    #Modelname       "1024x768"
EndSection

If the file is still there, then try running the command manually and seeing if it fails on its own. If it does fail, then hooray! We've isolated the problem.

But I did not know the meaning of the above word and how to run the command manually ๐Ÿ˜•...

  1. Besides, I think the problem is relevant to the GPU with pci bus id 00000000:0D:00.00 not 0F. When I run coolgpus,
    it works well for the former 5 gpus, id 04/06/07/08/0c, then error occurs when it reaches gpus with id '0F'.
    Another evidence is there is no relevant process about 0F gpu in the nvidia-smiresult

image

I do kill the xservers process before I run coolgpus. ps aux | grep Xorg and kill. This doesn't work.

Reading the FAQ, the way it says to look for lock files rather than process names makes me think that sometimes X servers might not be called Xorg in the ps list. Check whether you've got any anomalous lock files of the format /tmp/.X0-lock.

But I did not know the meaning of the above word and how to run the command manually ๐Ÿ˜•...

Pardon me, the command to run manually is

Xorg :7 -once -config /tmp/cool-gpu-00000000:0F:00.0igxee6_5/xorg.conf

following the pattern

Xorg $DISPLAY_ID -once -config /tmp/$CONFIG_DIR/xorg.conf

This is what coolgpus uses to create a display server. There should be one directory in /tmp/ for each GPU bus; try running the command for #5 if you think that's the one that's broken.

Cool!

  1. No *lock files in /tmp

Xorg :5 -once -config /tmp/cool-gpu-00000000:0D:00.0v07gty2v/xorg.conf did cause error.

root@klfy-SYS-4028GR-TR2:/tmp# Xorg :5 -once -config /tmp/cool-gpu-00000000:0D:00.0v07gty2v/xorg.conf
_XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed
_XSERVTransMakeAllCOTSServerListeners: server already running
(EE)
Fatal server error:
(EE) Cannot establish any listening sockets - Make sure an X server isn't already running(EE)
(EE)
Please consult the The X.Org Foundation support
         at http://wiki.x.org
 for help.
(EE) Please also check the log file at "/var/log/Xorg.5.log" for additional information.
(EE)
(EE) Server terminated with error (1). Closing log file.



root@klfy-SYS-4028GR-TR2:/tmp# Xorg :4 -once -config /tmp/cool-gpu-00000000:0C:00.0jgcto0z1/xorg.conf

X.Org X Server 1.19.6
Release Date: 2017-12-20
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.4.0-168-generic x86_64 Ubuntu
Current Operating System: Linux klfy-SYS-4028GR-TR2 4.15.0-109-generic #110-Ubuntu SMP Tue Jun 23 02:39:32 UTC 2020 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-109-generic root=UUID=c010ec2c-e172-49e3-bac0-f03ff4366b2d ro text
Build Date: 14 November 2019  06:20:00PM
xorg-server 2:1.19.6-1ubuntu4.4 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.34.0
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.4.log", Time: Fri Aug 14 19:35:26 2020
(++) Using config file: "/tmp/cool-gpu-00000000:0C:00.0jgcto0z1/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
  1. Is there any way to ignore the id-5/0D gpu, then coolgpus should work well with other gpus?
    An advanced function is that I can choose which gpus to control their fans.
    It seems CUDA_VISIBLE_DEVICES="0,1,2,3,4,6,7" coolgpus does not work.

Again, thank you for your reply and help.

Xorg :5 -once -config /tmp/cool-gpu-00000000:0D:00.0v07gty2v/xorg.conf did cause error.

Hooray - if you want to debug this further, then add a -logverbose 6 switch to that Xorg command.

Is there any way to ignore the id-5/0D gpu, then coolgpus should work well with other gpus?

There's no supported way to restrict which GPUs are altered, but coolgpus is a pretty simple script and you can hack it in. Open the script with vim $(which coolgpus) (or whatever text editor you prefer) and edit the bottom of the file

def run():
    buses = gpu_buses() 

to look like

def run():
    buses = [b for b in gpu_buses() if != '00000000:0D:00.0']

Or something like that. Again, I don't have access to my own server right now so I can't test this out for you. You might want to add a print(buses) line to check that it is actually a list of bus strings.

Anyway, save it and then it should run for all GPUs but the specified one.

I change the code like


def run():
    #buses = gpu_buses()
    buses = [b for b in gpu_buses() if b not in ['00000000:0D:00.0', '00000000:0E:00.0', '00000000:0F:00.0']]
    print("The PCI BUS ID of GPUs:", buses)
    with xservers(buses) as displays:
        if args.debug:
            debug_loop(displays)
        else:
            manage_fans(displays)

if __name__ == '__main__':
    run()

It works well for the first 5 gpus ๐Ÿ˜ƒ ๐Ÿ˜ƒ.
Only ignoring ['00000000:0D:00.0', '00000000:0E:00.0'] or['00000000:0D:00.0'] is not enough. Same error occurs.
Pretty wired.

Hrm. If you skip the first three GPUs instead, do the last five then work?

Hrm. If you skip the first three GPUs instead, do the last five then work?

sudo coolgpus --temp 17 84 --speed 15 96
The PCI BUS ID of GPUs: ['00000000:08:00.0', '00000000:0C:00.0', '00000000:0D:00.0', '00000000:0E:00.0', '00000000:0F:00.0']

(==) Using system config directory "/usr/share/X11/xorg.conf.d"
GPU :0, 56C -> [62%-64%]. Setting speed to 62%
GPU :1, 60C -> [66%-69%]. Setting speed to 66%
GPU :2, 63C -> [70%-73%]. Setting speed to 70%
GPU :3, 68C -> [76%-79%]. Setting speed to 76%
GPU :4, 67C -> [75%-77%]. Setting speed to 75%
GPU :0, 60C -> [66%-69%]. Setting speed to 66%

Enn, it also works well.

Well, this seems I can set the fan speed separately, HAHA.
The issue is partly solved.

OK, it certainly looks like your machine is limited to 5 Xorg processes at a time.

If 18.04 has a newer version of xorg-server available, try upgrading and see if this all magically goes away?

I haven't gotten any previous bug reports like this. Either you're the first person to try coolgpus on an 8 GPU machine, or there's something special about your setup.

If you've got a friend with another 8 GPU machine, you could try asking them if they have the same problem.

Otherwise, I think you should gather up a bug report and post it to the xorg-server issue tracker or to the ubuntu stackexchange. Some stuff that'd be useful to include:

  • A short description of what's happening: you're trying to start eight headless Xorg instances on eight separate GPUs, and on the sixth launch you get this error message. It happens no matter which six GPUs you launch on.
  • A description of your operating system, nvidia drivers, Xorg version, motherboard and GPUs.
  • A copy of the -logverbose 6 log file you get off of the failed process.
  • A minimal example, if you can spare the time to put it together. I've made a quick attempt below
#!/usr/bin/python
import os
import re
from subprocess import TimeoutExpired, Popen, PIPE, STDOUT
from tempfile import mkdtemp
from contextlib import contextmanager
import time

# EDID for an arbitrary display
EDID = b'\x00\xff\xff\xff\xff\xff\xff\x00\x10\xac\x15\xf0LTA5.\x13\x01\x03\x804 x\xee\x1e\xc5\xaeO4\xb1&\x0ePT\xa5K\x00\x81\x80\xa9@\xd1\x00qO\x01\x01\x01\x01\x01\x01\x01\x01(<\x80\xa0p\xb0#@0 6\x00\x06D!\x00\x00\x1a\x00\x00\x00\xff\x00C592M9B95ATL\n\x00\x00\x00\xfc\x00DELL U2410\n  \x00\x00\x00\xfd\x008L\x1eQ\x11\x00\n      \x00\x1d'

# X conf for a single screen server with fake CRT attached
XORG_CONF = """Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"     0    0
EndSection
Section "Screen"
    Identifier     "Screen0"
    Device         "VideoCard0"
    Monitor        "Monitor0"
    DefaultDepth   8
    Option         "UseDisplayDevice" "DFP-0"
    Option         "ConnectedMonitor" "DFP-0"
    Option         "CustomEDID" "DFP-0:{edid}"
    Option         "Coolbits" "20"
    SubSection "Display"
                Depth   8
                Modes   "160x200"
    EndSubSection
EndSection
Section "ServerFlags"
    Option         "AllowEmptyInput" "on"
    Option         "Xinerama"        "off"
    Option         "SELinux"         "off"
EndSection
Section "Device"
    Identifier  "Videocard0"
    Driver      "nvidia"
        Screen      0
        Option      "UseDisplayDevice" "DFP-0"
        Option      "ConnectedMonitor" "DFP-0"
        Option      "CustomEDID" "DFP-0:{edid}"
        Option      "Coolbits" "29"
        BusID       "PCI:{bus}"
EndSection
Section "Monitor"
    Identifier      "Monitor0"
    Vendorname      "Dummy Display"
    Modelname       "160x200"
    #Modelname       "1024x768"
EndSection
""" 

def log_output(command, ok=(0,)):
    output = []
    print('Command launched: ' + ' '.join(command))
    p = Popen(command, stdout=PIPE, stderr=STDOUT)
    try:
        p.wait(60)
        for line in p.stdout:
            output.append(line.decode().strip())
            print(line.decode().strip())
        print('Command finished')
    except TimeoutExpired:
        print('Command timed out: ' + ' '.join(command))
        raise
    finally:
        if p.returncode not in ok:
            print('\n'.join(output))
            raise ValueError('Command crashed with return code ' + str(p.returncode) + ': ' + ' '.join(command))
        return '\n'.join(output)

def decimalize(bus):
    """Converts a bus ID to an xconf-friendly format by dropping the domain and converting each hex component to 
    decimal"""
    return ':'.join([str(int('0x' + p, 16)) for p in re.split('[:.]', bus[9:])])

def gpu_buses():
    return log_output(['nvidia-smi', '--format=csv,noheader', '--query-gpu=pci.bus_id']).splitlines()

def config(bus):
    """Writes out the X server config for a GPU to a temporary directory"""
    tempdir = mkdtemp(prefix='cool-gpu-' + bus)
    edid = os.path.join(tempdir, 'edid.bin')
    conf = os.path.join(tempdir, 'xorg.conf')

    with open(edid, 'wb') as e, open(conf, 'w') as c:
        e.write(EDID)
        c.write(XORG_CONF.format(edid=edid, bus=decimalize(bus)))

    return conf

def xserver(display, bus):
    """Starts the X server for a GPU under a certain display ID""" 
    conf = config(bus)
    xorgargs = ['Xorg', display, '-once', '-config', conf]
    print('Starting xserver: '+' '.join(xorgargs))
    p = Popen(xorgargs)
    print('Started xserver')
    return p

@contextmanager
def xservers(buses):
    """A context manager for launching an X server for each GPU in a list. Yields the mapping from bus ID to 
    display ID, and cleans up the X servers on exit."""
    displays, servers = {}, {}
    try:
        for d, bus in enumerate(buses):
            displays[bus] = ':' + str(d)
            servers[bus] = xserver(displays[bus], bus)
        yield displays
    finally:
        for bus, server in servers.items():
            print('Terminating xserver for display ' + displays[bus])
            server.terminate()

if __name__ == '__main__':
    buses = gpu_buses() 
    with xservers(buses) as displays:
        time.sleep(5)

Sorry I couldn't resolve this for you! I'd appreciate it if you could link any bug reports you make back here, so the next person with this same issue can follow along.

If 18.04 has a newer version of xorg-server available, try upgrading and see if this all magically goes away?

It seems X.Org X Server 1.19.6 is the newest version on ubuntu 18.04. I will try it if there is a new version and report here.

If you've got a friend with another 8 GPU machine, you could try asking them if they have the same problem.

I ask for help from one friend. He can access to a 10 gpu machine. coolgpus works well on the 10 gpu machine. The machine is ubuntu 18.04 with NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0.

I think the problem may be related to the version of nvidia dirver. I will try to update my driver and run coolgpus again, but I cannot do it now. Because the gpu is working now.

Otherwise, I think you should gather up a bug report and post it to the xorg-server issue tracker or to the ubuntu stackexchange. Some stuff that'd be useful to include:

Thank you for yous detailed instructions. I will try it.

At this time, I first set fan speed of the first 5 gpus then set fan speed for last 5 gpus . Then I interrupt (ctrl+c) coolgpus when it runs normally and the high fan speed (set by coolgpus) will remain high. So I make 8 gpus run with high fan speed and achieve my goal ๐Ÿ˜„.

Smart fix, nice job!

If anyone else comes across this and wants to use @mzhaoshuai's approach as a general solution, editing the script with vim $(which coolgpus) and substituting the run function with something like

def run():
    buses = gpu_buses() 
    for bus in buses:
        print('Setting speed of ' + str(bus))
        with xservers([bus]) as displays:
            set_speed(displays[bus], 99)

should work.

I'll close this issue for now, but reopen it if you have any further comments.

If 18.04 has a newer version of xorg-server available, try upgrading and see if this all magically goes away?

It seems X.Org X Server 1.19.6 is the newest version on ubuntu 18.04. I will try it if there is a new version and report here.

If you've got a friend with another 8 GPU machine, you could try asking them if they have the same problem.

I ask for help from one friend. He can access to a 10 gpu machine. coolgpus works well on the 10 gpu machine. The machine is ubuntu 18.04 with NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0.

I think the problem may be related to the version of nvidia dirver. I will try to update my driver and run coolgpus again, but I cannot do it now. Because the gpu is working now.

Well, I got some news.
I upgraded the nvidia diver on the machine to Driver Version: 450.36.06, and now, coolgpus works perfectly with 8 gpus:smile:.
image

Great! And thanks for coming back and sharing the solution; it'll be a big help to the next person with the same issues.