multiprocessing: OSError: AF_UNIX path too long
michelwi opened this issue · 11 comments
Yesterday I started a training with DistributeFilesDataset and file caching which today crashed and consistently crashes after restarting with what I think is OSError: AF_UNIX path too long
in the _TouchFilesThread.run
.
The original files reside on a ceph cluster and are cached to an LVM volume with ext4 filesystem. The basename should be within reasonable length but the absolute path can be quite long.
I try to attach one version of each flavor of traces in order of appearence:
OSError
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server
server = cls._Server(registry, address, authkey, serializer)
File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__
self.listener = Listener(address=address, backlog=16)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__
self._listener = SocketListener(address, family, backlog)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__
self._socket.bind(address)
OSError: AF_UNIX path too long
_TouchFilesThread
Unhandled exception <class 'EOFError'> in thread <_MainThread(MainThread, started 139642623541696)>, proc 96008.
Thread <_TouchFilesThread(Thread-1, started daemon 139638043833920)>:
Process SyncManager-1:
File "/usr/lib/python3.10/threading.py", line 324, in Condition.wait
line: gotit = waiter.acquire(True, timeout)
locals:
gotit = <local> False
waiter = <local> <locked _thread.lock object at 0x7f6b43d94640>
waiter.acquire = <local> <built-in method acquire of _thread.lock object at 0x7f6b43d94640>
timeout = <local> 1.0
File "/usr/lib/python3.10/threading.py", line 607, in Event.wait
line: signaled = self._cond.wait(timeout)
locals:
signaled = <local> False
self = <local> <threading.Event object at 0x7f6b43d661d0>
self._cond = <local> <Condition(<unlocked _thread.lock object at 0x7f6b43d85e80>, 1)>
self._cond.wait = <local> <bound method Condition.wait of <Condition(<unlocked _thread.lock object at 0x7f6b43d85e80>, 1)>>
timeout = <local> 1.0
File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/util/file_cache.py", line 418, in _TouchFilesThread.run
line: if self.stop.wait(self.interval):
locals:
self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
self.stop = <local> <threading.Event object at 0x7f6b43d661d0>
self.stop.wait = <local> <bound method Event.wait of <threading.Event object at 0x7f6b43d661d0>>
self.interval = <local> 1.0
File "/usr/lib/python3.10/threading.py", line 1016, in Thread._bootstrap_inner
line: self.run()
locals:
self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
self.run = <local> <bound method _TouchFilesThread.run of <_TouchFilesThread(Thread-1, started daemon 140098675930688)>>
File "/usr/lib/python3.10/threading.py", line 973, in Thread._bootstrap
line: self._bootstrap_inner()
locals:
self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
self._bootstrap_inner = <local> <bound method Thread._bootstrap_inner of <_TouchFilesThread(Thread-1, started daemon 140098675930688)>>
Thread current, main, <_MainThread(MainThread, started 140103104491968)>:
(Excluded thread.)
Main
EXCEPTION
Traceback (most recent call last):
File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py", line 11, in <module>
line: main()
locals:
main = <local> <function main at 0x7fd88b3a27a0>
File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 735, in main
line: init(command_line_options=argv[1:])
locals:
init = <global> <function init at 0x7fd88b3a24d0>
command_line_options = <not found>
argv = <local> ['/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py', '/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/returnn/training/ReturnnTrainingJob.V7zBp10cMkXR/output/returnn.config'], _[0]: {len = 142}
File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 487, in init
line: init_engine()
locals:
init_engine = <global> <function init_engine at 0x7fd88b3a2320>
File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 294, in init_engine
line: engine = Engine(config=config)
locals:
engine = <global> None
Engine = <local> <class 'returnn.torch.engine.Engine'>
config = <global> <returnn.config.Config object at 0x7fd88b333fd0>
File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/torch/engine.py", line 60, in Engine.__init__
line: self._mp_manager = torch.multiprocessing.Manager()
locals:
self = <local> <returnn.torch.engine.Engine object at 0x7fd79384aa40>
self._mp_manager = <local> !AttributeError: 'Engine' object has no attribute '_mp_manager'
torch = <global> <module 'torch' from '/usr/local/lib/python3.10/dist-packages/torch/__init__.py'>
torch.multiprocessing = <global> <module 'torch.multiprocessing' from '/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/__init__.py'>
torch.multiprocessing.Manager = <global> <bound method BaseContext.Manager of <multiprocessing.context.DefaultContext object at 0x7fd7a1d772e0>>
File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in BaseContext.Manager
line: m.start()
locals:
m = <local> <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>
m.start = <local> <bound method BaseManager.start of <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>>
File "/usr/lib/python3.10/multiprocessing/managers.py", line 566, in BaseManager.start
line: self._address = reader.recv()
locals:
self = <local> <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>
self._address = <local> None
reader = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
reader.recv = <local> <bound method _ConnectionBase.recv of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in _ConnectionBase.recv
line: buf = self._recv_bytes()
locals:
buf = <not found>
self = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
self._recv_bytes = <local> <bound method Connection._recv_bytes of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in Connection._recv_bytes
line: buf = self._recv(4)
locals:
buf = <not found>
self = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
self._recv = <local> <bound method Connection._recv of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in Connection._recv
line: raise EOFError
locals:
EOFError = <builtin> <class 'EOFError'>
EOFError
Unhandled exception <class 'EOFError'> in thread <_MainThread(MainThread, started 139836318601664)>, proc 96015.
torch.distributed
E0703 09:18:03.509000 139702419329472 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 96008) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 883, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-07-03_09:18:03
host : g-28.apptek.local
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 96009)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-07-03_09:18:03
host : g-28.apptek.local
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 96010)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-07-03_09:18:03
host : g-28.apptek.local
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 96011)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-07-03_09:18:03
host : g-28.apptek.local
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 96012)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-07-03_09:18:03
host : g-28.apptek.local
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 96013)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2024-07-03_09:18:03
host : g-28.apptek.local
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 96014)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2024-07-03_09:18:03
host : g-28.apptek.local
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 96015)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-03_09:18:03
host : g-28.apptek.local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 96008)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
EDIT: fixed torch.distributed trace
(The torch.distributed output is somehow messed up. Do you have a version which is not messed up?)
From your logs, it does not seem like the error is in _TouchFilesThread.run
?
It seems the error occurred in some sub proc. Can you give details on what sub proc this is?
You posted the stack of _TouchFilesThread
, but it seems all fine? This one is not crashed and just idling?
But the main thread of the RETURNN main proc crashed, specifically in Torch Engine.__init__
, in this line:
self._mp_manager = torch.multiprocessing.Manager()
So, this looks very much outside of the scope of RETURNN? This is either pure Python related or PyTorch related (but I assume just pure Python; torch.multiprocessing
is basically just the standard multiprocessing
module).
I assume some bad environment configuration. Maybe weird ulimits or so. Or wrong tmp path. I wonder a bit where it tries to bind the socket (self._socket.bind(address)
), i.e. what is address
here? Can you get those local vars?
This is the relevant code from the stack trace where the read fails: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/managers.py#L554-L570
The read goes over a connection.Pipe()
, which communicates over unix sockets: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L552-L556
and the address comes from a temp dir: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L70-L82
which is created like this: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/util.py#L133-L145
For my account on the node the temp dir path does not look excessive, so maybe some strange user config @michelwi? Apparently what's failing here is the child process trying to bind to that socket.
Note, the temp dir logic of Python, i.e. where it would create those temp files/dirs:
def _candidate_tempdir_list():
"""Generate a list of candidate temporary directories which
_get_default_tempdir will try."""
dirlist = []
# First, try the environment.
for envname in 'TMPDIR', 'TEMP', 'TMP':
dirname = _os.getenv(envname)
if dirname: dirlist.append(dirname)
# Failing that, try OS-specific locations.
if _os.name == 'nt':
dirlist.extend([ _os.path.expanduser(r'~\AppData\Local\Temp'),
_os.path.expandvars(r'%SYSTEMROOT%\Temp'),
r'c:\temp', r'c:\tmp', r'\temp', r'\tmp' ])
else:
dirlist.extend([ '/tmp', '/var/tmp', '/usr/tmp' ])
# As a last resort, the current directory.
try:
dirlist.append(_os.getcwd())
except (AttributeError, OSError):
dirlist.append(_os.curdir)
return dirlist
So, is TMPDIR
(or some of the other) env var set? Where does it point to? If not, you should set it, and prefer /var/tmp
.
But in any case, what is address
actually? So we just know, and don't need to assume.
I just got bitten by the same error in a training not using the new dataset or caching mechanism.
I just got bitten by the same error in a training not using the new dataset or caching mechanism.
So what is address
in your case?
what is address actually?
sorry I could not figure it out. Then my node was rebooted and the error is gone again.
As I cannot reproduce it any more it would be fine by me to close this issue for now. But if @NeoLegends wants to keep debugging we can keep it.
Got the address. It seems reproducible for me on g-16. I don't know the root cause yet why it's behaving strangely.
tempdir._get_default_tempdir
returns the project folder, i.e. /home/mgunz/setups/2024-06-24--[redacted]
, leading to a path like:
Initializing SocketListener with /nas/models/asr/mgunz/2024-06-24--[redacted]/i6_core/returnn/training/ReturnnTrainingJob.mFRyGSpd0CuH/work/pymp-zwqk0y_y/listener-kxe8tmti
Process SyncManager-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server
server = cls._Server(registry, address, authkey, serializer)
File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__
self.listener = Listener(address=address, backlog=16)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__
self._listener = SocketListener(address, family, backlog)
File "/home/mgunz/setups/2024-06-24--[redacted]/work/i6_core/returnn/training/ReturnnTrainingJob.mFRyGSpd0CuH/output/returnn.config", line 12, in print_addr
return orig_init(self, address, *args, **kwargs)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__
self._socket.bind(address)
OSError: AF_UNIX path too long
>>> _candidate_tempdir_list()
['/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/mgunz/setups/2024-06-24--[redacted]']
I'm not sure why it won't take e.g. /var/tmp. I checked from the container, it is writable for me.
>>> os.open("/var/tmp/test2", os.O_RDWR | os.O_CREAT | os.O_EXCL)
3
Or this is something wrt. the child processes lacking some rights the parent process has?
I'm not sure what this is yet.
tempdir._get_default_tempdir
returns the project folder, i.e./home/mgunz/setups/2024-06-24--[redacted]
Yea that's a problem. It should definitely not use that.
>>> _candidate_tempdir_list() ['/tmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/mgunz/setups/2024-06-24--[redacted]']
I'm not sure why it won't take e.g. /var/tmp.
Yea that's strange. We should find out. Eg. just step-by-step debug through it.