EleutherAI/gpt-neox

too many .bin files for dataloader, crashed

exnx opened this issue · 0 comments

exnx commented

Hello, I am training with a very large dataset, 7T tokens, across 45 .bin files. When I try to use more than 32 gpus, I get an error that says too many files are open. I am wondering if anyone else has come across this? Here's the error I receive. Thanks so much!

GPUCA6E:     with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
GPUCA6E:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E:     fd, addr = self._accept()
GPUCA6E:     return recvfds(s, 1)[0]
GPUCA6E:                ^^^^^^^^^^^^^^
GPUCA6E:    OSError  : [Errno 24] Too many open files
GPUCA6E:        nfd = dup(fd)
GPUCA6E:           ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files
GPUCA6E:    ^^^^    return recvfds(s, 1)[0]
GPUCA6E:           ^ ^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 159, in recvfds
GPUCA6E: ^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 164, in recvfds
GPUCA6E:     raise EOFError    
GPUCA6E: EOFErrorTraceback (most recent call last):
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 145, in _serve
GPUCA6E: 
GPUCA6E: Exception in thread raise RuntimeError('received %d items of ancdata' %
GPUCA6E: Thread-4 (_pin_memory_loop):
GPUCA6E: Traceback (most recent call last):
GPUCA6E: RuntimeError: received 0 items of ancdata
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
GPUCA6E:     send(conn, destination_pid)
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 50, in send
GPUCA6E:     reduction.send_handle(conn, new_fd, pid)
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 183, in send_handle
GPUCA6E:     with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
GPUCA6E:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^self.run()^
GPUCA6E: ^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 975, in run
GPUCA6E:     nfd = dup(fd)
GPUCA6E:             self._target(*self._args, **self._kwargs) 
GPUCA6E:  ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files