rwth-i6/returnn

multiprocessing: OSError: AF_UNIX path too long

michelwi opened this issue · 11 comments

Yesterday I started a training with DistributeFilesDataset and file caching which today crashed and consistently crashes after restarting with what I think is OSError: AF_UNIX path too long in the _TouchFilesThread.run.

The original files reside on a ceph cluster and are cached to an LVM volume with ext4 filesystem. The basename should be within reasonable length but the absolute path can be quite long.

I try to attach one version of each flavor of traces in order of appearence:

OSError
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__
    self._socket.bind(address)
OSError: AF_UNIX path too long
_TouchFilesThread
Unhandled exception <class 'EOFError'> in thread <_MainThread(MainThread, started 139642623541696)>, proc 96008.

Thread <_TouchFilesThread(Thread-1, started daemon 139638043833920)>:
Process SyncManager-1:
  File "/usr/lib/python3.10/threading.py", line 324, in Condition.wait
    line: gotit = waiter.acquire(True, timeout)
    locals:
      gotit = <local> False
      waiter = <local> <locked _thread.lock object at 0x7f6b43d94640>
      waiter.acquire = <local> <built-in method acquire of _thread.lock object at 0x7f6b43d94640>
      timeout = <local> 1.0
  File "/usr/lib/python3.10/threading.py", line 607, in Event.wait
    line: signaled = self._cond.wait(timeout)
    locals:
      signaled = <local> False
      self = <local> <threading.Event object at 0x7f6b43d661d0>
      self._cond = <local> <Condition(<unlocked _thread.lock object at 0x7f6b43d85e80>, 1)>
      self._cond.wait = <local> <bound method Condition.wait of <Condition(<unlocked _thread.lock object at 0x7f6b43d85e80>, 1)>>
      timeout = <local> 1.0
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/util/file_cache.py", line 418, in _TouchFilesThread.run
    line: if self.stop.wait(self.interval):
    locals:
      self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
      self.stop = <local> <threading.Event object at 0x7f6b43d661d0>
      self.stop.wait = <local> <bound method Event.wait of <threading.Event object at 0x7f6b43d661d0>>
      self.interval = <local> 1.0
  File "/usr/lib/python3.10/threading.py", line 1016, in Thread._bootstrap_inner
    line: self.run()
    locals:
      self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
      self.run = <local> <bound method _TouchFilesThread.run of <_TouchFilesThread(Thread-1, started daemon 140098675930688)>>
  File "/usr/lib/python3.10/threading.py", line 973, in Thread._bootstrap
    line: self._bootstrap_inner()
    locals:
      self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
      self._bootstrap_inner = <local> <bound method Thread._bootstrap_inner of <_TouchFilesThread(Thread-1, started daemon 140098675930688)>>

Thread current, main, <_MainThread(MainThread, started 140103104491968)>:
(Excluded thread.)
Main
EXCEPTION
Traceback (most recent call last):
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py", line 11, in <module>
    line: main()
    locals:
      main = <local> <function main at 0x7fd88b3a27a0>
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 735, in main
    line: init(command_line_options=argv[1:])
    locals:
      init = <global> <function init at 0x7fd88b3a24d0>
      command_line_options = <not found>
      argv = <local> ['/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py', '/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/returnn/training/ReturnnTrainingJob.V7zBp10cMkXR/output/returnn.config'], _[0]: {len = 142}
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 487, in init
    line: init_engine()
    locals:
      init_engine = <global> <function init_engine at 0x7fd88b3a2320>
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 294, in init_engine
    line: engine = Engine(config=config)
    locals:
      engine = <global> None
      Engine = <local> <class 'returnn.torch.engine.Engine'>
      config = <global> <returnn.config.Config object at 0x7fd88b333fd0>
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/torch/engine.py", line 60, in Engine.__init__
    line: self._mp_manager = torch.multiprocessing.Manager()
    locals:
      self = <local> <returnn.torch.engine.Engine object at 0x7fd79384aa40>
      self._mp_manager = <local> !AttributeError: 'Engine' object has no attribute '_mp_manager'
      torch = <global> <module 'torch' from '/usr/local/lib/python3.10/dist-packages/torch/__init__.py'>
      torch.multiprocessing = <global> <module 'torch.multiprocessing' from '/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/__init__.py'>
      torch.multiprocessing.Manager = <global> <bound method BaseContext.Manager of <multiprocessing.context.DefaultContext object at 0x7fd7a1d772e0>>
  File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in BaseContext.Manager
    line: m.start()
    locals:
      m = <local> <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>
      m.start = <local> <bound method BaseManager.start of <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>>
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 566, in BaseManager.start
    line: self._address = reader.recv()
    locals:
      self = <local> <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>
      self._address = <local> None
      reader = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
      reader.recv = <local> <bound method _ConnectionBase.recv of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in _ConnectionBase.recv
    line: buf = self._recv_bytes()
    locals:
      buf = <not found>
      self = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
      self._recv_bytes = <local> <bound method Connection._recv_bytes of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in Connection._recv_bytes
    line: buf = self._recv(4)
    locals:
      buf = <not found>
      self = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
      self._recv = <local> <bound method Connection._recv of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in Connection._recv
    line: raise EOFError
    locals:
      EOFError = <builtin> <class 'EOFError'>
EOFError
Unhandled exception <class 'EOFError'> in thread <_MainThread(MainThread, started 139836318601664)>, proc 96015.
torch.distributed
E0703 09:18:03.509000 139702419329472 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 96008) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 883, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 96009)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 96010)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 96011)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 96012)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 96013)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 96014)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 96015)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 96008)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

EDIT: fixed torch.distributed trace

(The torch.distributed output is somehow messed up. Do you have a version which is not messed up?)

From your logs, it does not seem like the error is in _TouchFilesThread.run?

It seems the error occurred in some sub proc. Can you give details on what sub proc this is?

You posted the stack of _TouchFilesThread, but it seems all fine? This one is not crashed and just idling?

But the main thread of the RETURNN main proc crashed, specifically in Torch Engine.__init__, in this line:

self._mp_manager = torch.multiprocessing.Manager()

So, this looks very much outside of the scope of RETURNN? This is either pure Python related or PyTorch related (but I assume just pure Python; torch.multiprocessing is basically just the standard multiprocessing module).

I assume some bad environment configuration. Maybe weird ulimits or so. Or wrong tmp path. I wonder a bit where it tries to bind the socket (self._socket.bind(address)), i.e. what is address here? Can you get those local vars?

This is the relevant code from the stack trace where the read fails: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/managers.py#L554-L570

The read goes over a connection.Pipe(), which communicates over unix sockets: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L552-L556

and the address comes from a temp dir: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L70-L82

which is created like this: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/util.py#L133-L145

For my account on the node the temp dir path does not look excessive, so maybe some strange user config @michelwi? Apparently what's failing here is the child process trying to bind to that socket.

Note, the temp dir logic of Python, i.e. where it would create those temp files/dirs:

def _candidate_tempdir_list():
    """Generate a list of candidate temporary directories which
    _get_default_tempdir will try."""

    dirlist = []

    # First, try the environment.
    for envname in 'TMPDIR', 'TEMP', 'TMP':
        dirname = _os.getenv(envname)
        if dirname: dirlist.append(dirname)

    # Failing that, try OS-specific locations.
    if _os.name == 'nt':
        dirlist.extend([ _os.path.expanduser(r'~\AppData\Local\Temp'),
                         _os.path.expandvars(r'%SYSTEMROOT%\Temp'),
                         r'c:\temp', r'c:\tmp', r'\temp', r'\tmp' ])
    else:
        dirlist.extend([ '/tmp', '/var/tmp', '/usr/tmp' ])

    # As a last resort, the current directory.
    try:
        dirlist.append(_os.getcwd())
    except (AttributeError, OSError):
        dirlist.append(_os.curdir)

    return dirlist

So, is TMPDIR (or some of the other) env var set? Where does it point to? If not, you should set it, and prefer /var/tmp.

But in any case, what is address actually? So we just know, and don't need to assume.

I just got bitten by the same error in a training not using the new dataset or caching mechanism.

I just got bitten by the same error in a training not using the new dataset or caching mechanism.

So what is address in your case?

what is address actually?

sorry I could not figure it out. Then my node was rebooted and the error is gone again.
As I cannot reproduce it any more it would be fine by me to close this issue for now. But if @NeoLegends wants to keep debugging we can keep it.

Got the address. It seems reproducible for me on g-16. I don't know the root cause yet why it's behaving strangely.

tempdir._get_default_tempdir returns the project folder, i.e. /home/mgunz/setups/2024-06-24--[redacted], leading to a path like:

Initializing SocketListener with /nas/models/asr/mgunz/2024-06-24--[redacted]/i6_core/returnn/training/ReturnnTrainingJob.mFRyGSpd0CuH/work/pymp-zwqk0y_y/listener-kxe8tmti
Process SyncManager-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/home/mgunz/setups/2024-06-24--[redacted]/work/i6_core/returnn/training/ReturnnTrainingJob.mFRyGSpd0CuH/output/returnn.config", line 12, in print_addr
    return orig_init(self, address, *args, **kwargs)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__
    self._socket.bind(address)
OSError: AF_UNIX path too long

>>> _candidate_tempdir_list()
['/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/mgunz/setups/2024-06-24--[redacted]']

I'm not sure why it won't take e.g. /var/tmp. I checked from the container, it is writable for me.

>>> os.open("/var/tmp/test2", os.O_RDWR | os.O_CREAT | os.O_EXCL)
3

Or this is something wrt. the child processes lacking some rights the parent process has?

I'm not sure what this is yet.

tempdir._get_default_tempdir returns the project folder, i.e. /home/mgunz/setups/2024-06-24--[redacted]

Yea that's a problem. It should definitely not use that.

>>> _candidate_tempdir_list()
['/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/mgunz/setups/2024-06-24--[redacted]']

I'm not sure why it won't take e.g. /var/tmp.

Yea that's strange. We should find out. Eg. just step-by-step debug through it.