pytorch/data

Calling __iter__ twice on DataLoader2 causes hang with MPRS

JohnHBrock opened this issue ยท 2 comments

๐Ÿ› Describe the bug

I'm aware torchdata isn't being maintained anymore, but thought I'd post this here for posterity:

When using iter twice for the same instance of DataLoader2, trying to iterate over the 2nd one results in a hang. One of the worker processes terminates due to an exception "Can not reset while we are still waiting response for previous request", although this isn't obvious unless you run a debugger. This exception occurs when one of the workers calls nonblocking_next() here. Once this worker dies, the data loader is deadlocked.

I noticed this when using Lightning with torchdata: Lightning's fit will run a few iterations of the validation loop as a sanity check before training, then do a training loop, followed by the validation loop again. This 2nd validation loop never finishes because of the hang.

Code to reproduce:

from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService
from torch.utils.data.datapipes.iter.sharding import SHARDING_PRIORITIES
from torchdata.datapipes.iter import IterableWrapper

def main():
	dp = IterableWrapper([1, 2, 3, 4, 5, 6, 7]*100).sharding_round_robin_dispatch(SHARDING_PRIORITIES.MULTIPROCESSING)
	reading_service = MultiProcessingReadingService(num_workers=2, main_prefetch_cnt=0, worker_prefetch_cnt=0)

	dataloader = DataLoader2(dp, reading_service=reading_service)
	print(next(iter(dataloader)))
	print(next(iter(dataloader)))
	print("done")

if __name__ == "__main__":
    main()

This results in the output:

1

and nothing else. The data loader processes continue to run, except for the one terminating worker that I mentioned above.

Versions

Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.4.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.26.4
Libc version: N/A

Python version: 3.8.17 (default, Jul 19 2023, 14:02:02) [Clang 14.0.3 (clang-1403.0.22.14.1)] (64-bit runtime)
Python platform: macOS-13.4.1-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==2.0.1
[pip3] torchdata==0.6.1
[conda] Could not collect

A possible workaround is to wrap DataLoader2.__iter__ so that it gets recreated from scratch each time, rather than just resetting the existing DataLoader2 instance, for example something like this:

from torchdata.dataloader2 import DataLoader2

class DataLoader2Workaround():
    def __init__(self, datapipe, reading_service):
        self.datapipe = datapipe
        self.reading_service = reading_service
        self.dataloader2 = None

    def _create_dataloader2(self):
        self.dataloader2 = DataLoader2(self.datapipe, reading_service=self.reading_service)

    def __getattr__(self, attr):
        if self.dataloader2 is None:
            self._create_dataloader2()
        return getattr(self.dataloader2, attr)

    def __iter__(self):
        self._create_dataloader2()
        return iter(self.dataloader2)

Possibly related to #1148.