Calling __iter__ twice on DataLoader2 causes hang with MPRS
JohnHBrock opened this issue ยท 2 comments
๐ Describe the bug
I'm aware torchdata isn't being maintained anymore, but thought I'd post this here for posterity:
When using iter
twice for the same instance of DataLoader2, trying to iterate over the 2nd one results in a hang. One of the worker processes terminates due to an exception "Can not reset while we are still waiting response for previous request", although this isn't obvious unless you run a debugger. This exception occurs when one of the workers calls nonblocking_next()
here. Once this worker dies, the data loader is deadlocked.
I noticed this when using Lightning with torchdata: Lightning's fit
will run a few iterations of the validation loop as a sanity check before training, then do a training loop, followed by the validation loop again. This 2nd validation loop never finishes because of the hang.
Code to reproduce:
from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService
from torch.utils.data.datapipes.iter.sharding import SHARDING_PRIORITIES
from torchdata.datapipes.iter import IterableWrapper
def main():
dp = IterableWrapper([1, 2, 3, 4, 5, 6, 7]*100).sharding_round_robin_dispatch(SHARDING_PRIORITIES.MULTIPROCESSING)
reading_service = MultiProcessingReadingService(num_workers=2, main_prefetch_cnt=0, worker_prefetch_cnt=0)
dataloader = DataLoader2(dp, reading_service=reading_service)
print(next(iter(dataloader)))
print(next(iter(dataloader)))
print("done")
if __name__ == "__main__":
main()
This results in the output:
1
and nothing else. The data loader processes continue to run, except for the one terminating worker that I mentioned above.
Versions
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 13.4.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.26.4
Libc version: N/A
Python version: 3.8.17 (default, Jul 19 2023, 14:02:02) [Clang 14.0.3 (clang-1403.0.22.14.1)] (64-bit runtime)
Python platform: macOS-13.4.1-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==2.0.1
[pip3] torchdata==0.6.1
[conda] Could not collect
A possible workaround is to wrap DataLoader2.__iter__
so that it gets recreated from scratch each time, rather than just resetting the existing DataLoader2 instance, for example something like this:
from torchdata.dataloader2 import DataLoader2
class DataLoader2Workaround():
def __init__(self, datapipe, reading_service):
self.datapipe = datapipe
self.reading_service = reading_service
self.dataloader2 = None
def _create_dataloader2(self):
self.dataloader2 = DataLoader2(self.datapipe, reading_service=self.reading_service)
def __getattr__(self, attr):
if self.dataloader2 is None:
self._create_dataloader2()
return getattr(self.dataloader2, attr)
def __iter__(self):
self._create_dataloader2()
return iter(self.dataloader2)
Possibly related to #1148.