Too many open files error
whucdf opened this issue ยท 17 comments
Issue description
While using the dataloader from pytorch 0.4.1:
With num_workers > 0 the workers store the tensors in shared memory, but do not release the shared memory file handles after they return the tensor to the main process and file handles are no longer needed. The worker will then run out of file handles, if one stores the tensor in a list.
Code example
from torch.utils.data import Dataset
class testSet(Dataset):
def __init__(self):
super(testSet,self).__init__()
def `__len__(self):`
return 1000000
def __getitem__(self,index):
return {"index":index}
import torch
test_data = testSet()
test_data_loader = torch.utils.data.DataLoader( dataset=test_data, batch_size=1, num_workers=1)
index = []
for sample in test_data_loader:
index.append(sample['index'])
The error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-5-cf6ed576bc1c> in <module>()
----> 1 for sample in test_data_loader:
2 #print(sample['index'])
3 index.append(sample['index'])
~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
328 while True:
329 assert (not self.shutdown and self.batches_outstanding > 0)
--> 330 idx, batch = self._get_batch()
331 self.batches_outstanding -= 1
332 if idx != self.rcvd_idx:
~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _get_batch(self)
307 raise RuntimeError('DataLoader timed out after {} seconds'.format(self.timeout))
308 else:
--> 309 return self.data_queue.get()
310
311 def __next__(self):
~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/queues.py in get(self)
335 res = self._reader.recv_bytes()
336 # unserialize the data after having released the lock
--> 337 return _ForkingPickler.loads(res)
338
339 def put(self, obj):
~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/reductions.py in rebuild_storage_fd(cls, df, size)
149 fd = multiprocessing.reduction.rebuild_handle(df)
150 else:
--> 151 fd = df.detach()
152 try:
153 storage = storage_from_cache(cls, fd_id(fd))
~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/resource_sharer.py in detach(self)
56 '''Get the fd. This should only be called once.'''
57 with _resource_sharer.get_connection(self._id) as conn:
---> 58 return reduction.recv_handle(conn)
59
60
~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/reduction.py in recv_handle(conn)
180 '''Receive a handle over a local connection.'''
181 with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
--> 182 return recvfds(s, 1)[0]
183
184 def DupFd(fd):
~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/reduction.py in recvfds(sock, size)
159 if len(ancdata) != 1:
160 raise RuntimeError('received %d items of ancdata' %
--> 161 len(ancdata))
162 cmsg_level, cmsg_type, cmsg_data = ancdata[0]
163 if (cmsg_level == socket.SOL_SOCKET and
RuntimeError: received 0 items of ancdata
System Info
- PyTorch
- OS: Ubuntu 16.04
- PyTorch version: 0.4.1
@whucdf Thanks for reporting this issue. It is expected because the default file_descriptor
share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch to file_system
strategy by adding this to your script.
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
Let me know if there is still any issue.
closing this now, please feel free to reopen it if needed
hi @weiyangfb
thanks for u help. it does solve the problem.
btw, will it slow down the traing speed?
Hey!
I am still getting the same error too many open files
.
Running on CPU on my Mac OSX.
traceback:
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
Traceback (most recent call last):
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-6-cc88ea5f8bd3>", line 2, in <module>
num_epochs=25)
File "<ipython-input-3-c38b0d739ba0>", line 23, in train_model
for inputs, labels in dataloaders[phase]:
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __iter__
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 545, in __init__
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/context.py", line 102, in Queue
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/queues.py", line 42, in __init__
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/context.py", line 67, in Lock
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/synchronize.py", line 163, in __init__
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/synchronize.py", line 60, in __init__
OSError: [Errno 24] Too many open files
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1863, in showtraceback
stb = value._render_traceback_()
AttributeError: 'OSError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 1095, in get_records
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 311, in wrapped
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 345, in _fixed_getinnerframes
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 1483, in getinnerframes
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 1441, in getframeinfo
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 696, in getsourcefile
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 725, in getmodule
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 709, in getabsfile
File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/posixpath.py", line 376, in abspath
OSError: [Errno 24] Too many open files
I did include the proper configurations:
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
thanks
Please use deep copy when appending dataloader output to a list. Take @whucdf 's code as example
test_data = testSet()
test_data_loader = torch.utils.data.DataLoader( dataset=test_data, batch_size=1, num_workers=1)
index = []
for sample in test_data_loader:
index.append(sample['index'])
index occupied output of data_loader and the connections among mutlprocessing.process could not be closed. So deepcopy is useful in this scenario.
import copy
test_data = testSet()
test_data_loader = torch.utils.data.DataLoader( dataset=test_data, batch_size=1, num_workers=1)
index = []
for sample in test_data_loader:
sample_cp = copy.deepcopy(sample)
del sample
index.append(sample_cp['index'])
@whucdf Thanks for reporting this issue. It is expected because the default
file_descriptor
share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch tofile_system
strategy by adding this to your script.import torch.multiprocessing torch.multiprocessing.set_sharing_strategy('file_system')
Let me know if there is still any issue.
I get the error
torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory
@whucdf Thanks for reporting this issue. It is expected because the default
file_descriptor
share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch tofile_system
strategy by adding this to your script.import torch.multiprocessing torch.multiprocessing.set_sharing_strategy('file_system')
Let me know if there is still any issue.
is this suppose to be run by the main process (the one doing mp.spawn
) or should EVERY process run it inside their run function?
Thanks!
ref: https://pytorch.org/docs/stable/multiprocessing.html#file-descriptor-file-descriptor
I applied
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
yet still getting the same error
For anyone else seeing such error even after setting torch.multiprocessing.set_sharing_strategy('file_system')
in their main thread, note that worker processes of the DataLoader
will not inherit this setting apparently. I had to use a worker_init_fn
such as:
sharing_strategy = "file_system"
torch.multiprocessing.set_sharing_strategy(sharing_strategy)
def set_worker_sharing_strategy(worker_id: int) -> None:
torch.multiprocessing.set_sharing_strategy(sharing_strategy)
loader = DataLoader(dataset, num_workers=4, worker_init_fn=set_worker_sharing_strategy)
This finally fixed it for me.
@brando90 This relates to your earlier question. I could confirm that the strategy is not set to the same strategy as in the main process by printing the value of torch.multiprocessing.get_sharing_strategy()
in worker_init_fn
.
@schuhschuh did your solution require you to change the setup()
function (I'm assuming you are doing distributed training/inference)
my current setup function looks like this
def setup(rank, world_size, port):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = f'{port}'
# initialize the process group
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
does the solution of using file_system
sharing strategy means that I must change
dist.init_process_group("nccl", rank=rank, world_size=world_size)
to something like
dist.init_process_group("nccl", init_method="file::/~/somefile", rank=rank, world_size=world_size)
Thanks!
@mdabbah No, I think the two settings (init_method
and sharing_strategy
) are not related.
I am using Ignite's launcher to start the distributed processes, so am not calling dist.init_process_group()
myself directly. But I had not changed the code that launches the distributed processes, and the sharing strategy is being set later on in the main function of each process.
@whucdf Thanks for reporting this issue. It is expected because the default
file_descriptor
share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch tofile_system
strategy by adding this to your script.import torch.multiprocessing torch.multiprocessing.set_sharing_strategy('file_system')
Let me know if there is still any issue.
Is it safe to use this snippet? Are there any side effects of this that I should be worried about?
My issue is that even with torch.multiprocessing.set_sharing_strategy('file_system')
, after some time (typically in the second half of training), my job crashes with RuntimeError: unable to open shared memory object </torch_2283204_110829360> in read-write mode
. This is much more likely to happen whenever I'm training more than one model in parallel on different GPUs. I verified that there is more than enough RAM and disk space available. Is there any other fix? Thank you.
On a slightly related note, in my training script, if I don't use the set_sharing_strategy('file_system')
, I also get the "too many open files" error.
But if I add it, then it all runs fine, but at the very end of my script, all the processes just hang and never terminate. Even if I add a torch.distributed.barrier()
or a torch.distributed.destroy_process_group()
.
I experience the same issue with the latest MacOs nightly build. I am able to chew through a couple of epochs but at some point, the number of open file descriptors becomes too large -- they are simply not being closed properly. The set_sharing_strategy
is not helping at all.
My dataset returns a dictionary of with 3 keys, two for float tensors and one for string.
class PhysicsDataset(Dataset):
def __init__(self, data_dir, transform=None):
super().__init__()
self.data_dir = data_dir
self.transform = transform
self.gt_spectra = list(self.data_dir.glob("*.npz"))
self.gt_parameters = json.load(
open(self.data_dir / "all_params.json", 'r'))
def __len__(self):
return len(self.gt_spectra)
def __getitem__(self, index):
with np.load(self.gt_spectra[index]) as data:
pdata = data['spectrum']
pdata = (pdata - pdata.min()) / (pdata.max() - pdata.min())
pdata = torch.from_numpy(pdata).float()
parameters = self.gt_parameters[self.gt_spectra[index].name.replace(
".npz", "")]
if self.transform:
pdata = self.transform(pdata)
# create output tensor with normalised weights
gt_tensor = torch.from_numpy(
np.asarray([(parameters[k] - KEYS[k]['min']) /
(KEYS[k]['max'] - KEYS[k]['min'])
for k in KEYS])).float()
return {
"spectrum": pdata,
"gt_tensor": gt_tensor,
"filename": self.gt_spectra[index].name
}
Any ideas why the fds are not closed after each epoch terminates? I suspect this may be due to that np.load
in __getitem__
but I have no idea how to fix that.
On a slightly related note, in my training script, if I don't use the
set_sharing_strategy('file_system')
, I also get the "too many open files" error.But if I add it, then it all runs fine, but at the very end of my script, all the processes just hang and never terminate. Even if I add a
torch.distributed.barrier()
or atorch.distributed.destroy_process_group()
.
same here. Have you figured out how to solve it? Thank you!
same here. Have you figured out how to solve it? Thank you!
Not sure how relevant this would be for you. In my case, I have my training dataset in a JSON-format (one that we've developed internally at our institute) similar to COCO-format. The dataset is open through a wrapper class that provide API for reading it, again, similar to COCO.
In my earlier attempts at distributed training, each process ended up opening the same JSON file on its own, and trying to read annotations from it with a bunch of workers (num_workers=16
).
Something like this, basically:
dataset = JSONDataset("/datasets/coco/annotations/train.json")
train_data = torch.utils.data.Dataset(dataset, ...)
train_loader = torch.utils.data.dataloader.DataLoader(train_data, num_workers=16, ...)
Instead, I made sure to first parse the entire dataset, read the full list of image files and the corresponding labels, and the only pass a list of files and labels to the torch.utils.data.Dataset
object, so the workers would only read the image files and not try to share the same JSON-file.
And then I don't touch the set_sharing_strategy
function at all, just leaving it at the default value, and just put a destroy_process_group()
at the end of the application.