pytorch/pytorch

Too many open files error

whucdf opened this issue ยท 17 comments

Issue description

While using the dataloader from pytorch 0.4.1:
With num_workers > 0 the workers store the tensors in shared memory, but do not release the shared memory file handles after they return the tensor to the main process and file handles are no longer needed. The worker will then run out of file handles, if one stores the tensor in a list.

Code example


from torch.utils.data import Dataset
class testSet(Dataset):
    def __init__(self):
        super(testSet,self).__init__()
    def `__len__(self):`
        return 1000000
    def __getitem__(self,index):
        return {"index":index}

import torch

test_data = testSet()
test_data_loader = torch.utils.data.DataLoader( dataset=test_data, batch_size=1, num_workers=1)
index = []
for sample in test_data_loader:
    index.append(sample['index'])

The error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-cf6ed576bc1c> in <module>()
----> 1 for sample in test_data_loader:
      2     #print(sample['index'])
      3     index.append(sample['index'])

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    328         while True:
    329             assert (not self.shutdown and self.batches_outstanding > 0)
--> 330             idx, batch = self._get_batch()
    331             self.batches_outstanding -= 1
    332             if idx != self.rcvd_idx:

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _get_batch(self)
    307                 raise RuntimeError('DataLoader timed out after {} seconds'.format(self.timeout))
    308         else:
--> 309             return self.data_queue.get()
    310 
    311     def __next__(self):

~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/queues.py in get(self)
    335             res = self._reader.recv_bytes()
    336         # unserialize the data after having released the lock
--> 337         return _ForkingPickler.loads(res)
    338 
    339     def put(self, obj):

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/reductions.py in rebuild_storage_fd(cls, df, size)
    149         fd = multiprocessing.reduction.rebuild_handle(df)
    150     else:
--> 151         fd = df.detach()
    152     try:
    153         storage = storage_from_cache(cls, fd_id(fd))

~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/resource_sharer.py in detach(self)
     56             '''Get the fd.  This should only be called once.'''
     57             with _resource_sharer.get_connection(self._id) as conn:
---> 58                 return reduction.recv_handle(conn)
     59 
     60 

~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/reduction.py in recv_handle(conn)
    180         '''Receive a handle over a local connection.'''
    181         with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
--> 182             return recvfds(s, 1)[0]
    183 
    184     def DupFd(fd):

~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/reduction.py in recvfds(sock, size)
    159             if len(ancdata) != 1:
    160                 raise RuntimeError('received %d items of ancdata' %
--> 161                                    len(ancdata))
    162             cmsg_level, cmsg_type, cmsg_data = ancdata[0]
    163             if (cmsg_level == socket.SOL_SOCKET and

RuntimeError: received 0 items of ancdata

System Info

  • PyTorch
  • OS: Ubuntu 16.04
  • PyTorch version: 0.4.1

@whucdf Thanks for reporting this issue. It is expected because the default file_descriptor share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch to file_system strategy by adding this to your script.

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

Let me know if there is still any issue.

closing this now, please feel free to reopen it if needed

hi @weiyangfb
thanks for u help. it does solve the problem.
btw, will it slow down the traing speed?

Hey!
I am still getting the same error too many open files.
Running on CPU on my Mac OSX.

traceback:

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-cc88ea5f8bd3>", line 2, in <module>
    num_epochs=25)
  File "<ipython-input-3-c38b0d739ba0>", line 23, in train_model
    for inputs, labels in dataloaders[phase]:
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __iter__
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 545, in __init__
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/context.py", line 102, in Queue
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/queues.py", line 42, in __init__
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/context.py", line 67, in Lock
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/synchronize.py", line 163, in __init__
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/synchronize.py", line 60, in __init__
OSError: [Errno 24] Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1863, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'OSError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 1095, in get_records
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 311, in wrapped
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 345, in _fixed_getinnerframes
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 1483, in getinnerframes
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 1441, in getframeinfo
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 696, in getsourcefile
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 725, in getmodule
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 709, in getabsfile
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/posixpath.py", line 376, in abspath
OSError: [Errno 24] Too many open files

I did include the proper configurations:

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

thanks

Please use deep copy when appending dataloader output to a list. Take @whucdf 's code as example

test_data = testSet() 
test_data_loader = torch.utils.data.DataLoader( dataset=test_data, batch_size=1, num_workers=1)  
index = []  
for sample in test_data_loader:  
    index.append(sample['index'])

index occupied output of data_loader and the connections among mutlprocessing.process could not be closed. So deepcopy is useful in this scenario.

import copy  
test_data = testSet()  
test_data_loader = torch.utils.data.DataLoader( dataset=test_data, batch_size=1, num_workers=1)  
index = []   
for sample in test_data_loader:  
    sample_cp = copy.deepcopy(sample)  
    del sample  
    index.append(sample_cp['index'])

@whucdf Thanks for reporting this issue. It is expected because the default file_descriptor share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch to file_system strategy by adding this to your script.

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

Let me know if there is still any issue.

I get the error

torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory

@whucdf Thanks for reporting this issue. It is expected because the default file_descriptor share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch to file_system strategy by adding this to your script.

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

Let me know if there is still any issue.

is this suppose to be run by the main process (the one doing mp.spawn) or should EVERY process run it inside their run function?

Thanks!

ref: https://pytorch.org/docs/stable/multiprocessing.html#file-descriptor-file-descriptor

https://discuss.pytorch.org/t/how-does-one-setp-up-the-set-sharing-strategy-strategy-for-multiprocessing/113302

I applied

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

yet still getting the same error

For anyone else seeing such error even after setting torch.multiprocessing.set_sharing_strategy('file_system') in their main thread, note that worker processes of the DataLoader will not inherit this setting apparently. I had to use a worker_init_fn such as:

sharing_strategy = "file_system"
torch.multiprocessing.set_sharing_strategy(sharing_strategy)

def set_worker_sharing_strategy(worker_id: int) -> None:
    torch.multiprocessing.set_sharing_strategy(sharing_strategy)

loader = DataLoader(dataset, num_workers=4, worker_init_fn=set_worker_sharing_strategy)

This finally fixed it for me.

@brando90 This relates to your earlier question. I could confirm that the strategy is not set to the same strategy as in the main process by printing the value of torch.multiprocessing.get_sharing_strategy() in worker_init_fn.

@schuhschuh did your solution require you to change the setup() function (I'm assuming you are doing distributed training/inference)

my current setup function looks like this

def setup(rank, world_size, port):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = f'{port}'

    # initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

does the solution of using file_system sharing strategy means that I must change
dist.init_process_group("nccl", rank=rank, world_size=world_size) to something like
dist.init_process_group("nccl", init_method="file::/~/somefile", rank=rank, world_size=world_size)

Thanks!

@mdabbah No, I think the two settings (init_method and sharing_strategy) are not related.

I am using Ignite's launcher to start the distributed processes, so am not calling dist.init_process_group() myself directly. But I had not changed the code that launches the distributed processes, and the sharing strategy is being set later on in the main function of each process.

@whucdf Thanks for reporting this issue. It is expected because the default file_descriptor share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch to file_system strategy by adding this to your script.

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

Let me know if there is still any issue.

Is it safe to use this snippet? Are there any side effects of this that I should be worried about?

My issue is that even with torch.multiprocessing.set_sharing_strategy('file_system'), after some time (typically in the second half of training), my job crashes with RuntimeError: unable to open shared memory object </torch_2283204_110829360> in read-write mode. This is much more likely to happen whenever I'm training more than one model in parallel on different GPUs. I verified that there is more than enough RAM and disk space available. Is there any other fix? Thank you.

Xonxt commented

On a slightly related note, in my training script, if I don't use the set_sharing_strategy('file_system'), I also get the "too many open files" error.

But if I add it, then it all runs fine, but at the very end of my script, all the processes just hang and never terminate. Even if I add a torch.distributed.barrier() or a torch.distributed.destroy_process_group().

I experience the same issue with the latest MacOs nightly build. I am able to chew through a couple of epochs but at some point, the number of open file descriptors becomes too large -- they are simply not being closed properly. The set_sharing_strategy is not helping at all.

My dataset returns a dictionary of with 3 keys, two for float tensors and one for string.

class PhysicsDataset(Dataset):
    def __init__(self, data_dir, transform=None):
        super().__init__()
        self.data_dir = data_dir
        self.transform = transform
        self.gt_spectra = list(self.data_dir.glob("*.npz"))
        self.gt_parameters = json.load(
            open(self.data_dir / "all_params.json", 'r'))

    def __len__(self):
        return len(self.gt_spectra)

    def __getitem__(self, index):
        with np.load(self.gt_spectra[index]) as data:
            pdata = data['spectrum']
        pdata = (pdata - pdata.min()) / (pdata.max() - pdata.min())
        pdata = torch.from_numpy(pdata).float()
        parameters = self.gt_parameters[self.gt_spectra[index].name.replace(
            ".npz", "")]
        if self.transform:
            pdata = self.transform(pdata)

        # create output tensor with normalised weights
        gt_tensor = torch.from_numpy(
            np.asarray([(parameters[k] - KEYS[k]['min']) /
                        (KEYS[k]['max'] - KEYS[k]['min'])
                        for k in KEYS])).float()
        return {
            "spectrum": pdata,
            "gt_tensor": gt_tensor,
            "filename": self.gt_spectra[index].name
        }

Any ideas why the fds are not closed after each epoch terminates? I suspect this may be due to that np.load in __getitem__ but I have no idea how to fix that.

On a slightly related note, in my training script, if I don't use the set_sharing_strategy('file_system'), I also get the "too many open files" error.

But if I add it, then it all runs fine, but at the very end of my script, all the processes just hang and never terminate. Even if I add a torch.distributed.barrier() or a torch.distributed.destroy_process_group().

same here. Have you figured out how to solve it? Thank you!

Xonxt commented

same here. Have you figured out how to solve it? Thank you!

Not sure how relevant this would be for you. In my case, I have my training dataset in a JSON-format (one that we've developed internally at our institute) similar to COCO-format. The dataset is open through a wrapper class that provide API for reading it, again, similar to COCO.

In my earlier attempts at distributed training, each process ended up opening the same JSON file on its own, and trying to read annotations from it with a bunch of workers (num_workers=16).

Something like this, basically:

dataset = JSONDataset("/datasets/coco/annotations/train.json")
train_data = torch.utils.data.Dataset(dataset, ...)
train_loader = torch.utils.data.dataloader.DataLoader(train_data, num_workers=16, ...)

Instead, I made sure to first parse the entire dataset, read the full list of image files and the corresponding labels, and the only pass a list of files and labels to the torch.utils.data.Dataset object, so the workers would only read the image files and not try to share the same JSON-file.

And then I don't touch the set_sharing_strategy function at all, just leaving it at the default value, and just put a destroy_process_group() at the end of the application.