prefetcher shutdown hang when running with multiprocess + distributed reading service

Question

prefetcher shutdown hang when running with multiprocess + distributed reading service

zhengwy888 opened this issue a year ago · 1 comments

🐛 Describe the bug

Prefetcher will hang indefinitely on shutdown(), the faulthandler stack traces indicates that main thread is blocked on https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/util/prefetcher.py#L113 while child thread is blocked on https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/util/prefetcher.py#L81, but I don't know why time.sleep could block on exit.

Repro:

@functional_datapipe("frame_slicer")
class FrameSlicer(IterDataPipe):
    def __init__(self, source_datapipe) -> None:
        self.source_datapipe = source_datapipe

    def __iter__(self):
        for fields in self.source_datapipe:
            video_id, seg_start, seg_end = fields
            for i in range(int(seg_start), int(seg_end)+1):
                yield (video_id, i)

def generate_entries():
    lines = []
    # start with a prime number to make sure we have uneven dataloaders
    random.seed(10)
    for i in range(37):
        frame_count = random.randint(5, 10)
        lines.append([f'video-{i}', 10, 10 + frame_count])
    return lines

def build_one_datapipe():
    entries = generate_entries()
    total_frames = sum([x[2] - x[1] + 1 for x in entries])
    dp = IterableWrapper(entries)
    dp = dp.shuffle()
    dp = dp.sharding_filter()
    dp = dp.frame_slicer()
    return dp, total_frames

def build_dataloader2():
    dp, total_frames = build_one_datapipe()

    mp_rs = MultiProcessingReadingService(num_workers=2)
    dist_rs = DistributedReadingService()
    rs = SequentialReadingService(dist_rs, mp_rs)

    dl = DataLoader2(dp, reading_service=rs)
    dl.seed(2)
    counter = 0
    video_ids = set()
    for data in dl:
        video_ids.add(data[0])
        counter += 1

    dl.shutdown()  # hang here

Versions

PyTorch version: 2.0.0a0+gite9ebda2
Is debug build: False
CUDA used to build PyTorch: 12.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 12.0.1 (https://github.com/conda-forge/clangdev-feedstock d44358f44aef33e9fa7c5f93e2481ee8f1a04ab6)
CMake version: version 3.19.1
Libc version: glibc-2.31

Python version: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-64-generic-x86_64-with-glibc2.10
Is CUDA available: False
CUDA runtime version: 12.0.140
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] mypy-protobuf==3.3.0
[pip3] numpy==1.23.5
[pip3] pytorch3d==0.6.2
[pip3] torch==2.0.1+1684801906.cuda120.cudnn891.nccl218.ap
[pip3] torch-mlir==1684442443
[pip3] torch-scatter==2.1.0
[pip3] torch-tb-profiler==0.4.1
[pip3] torchdata==0.7.0.dev20230601
[pip3] torchfile==0.1.0
[pip3] torchvision==0.15.1a0+42759b1
[conda] magma-cuda121             2.6.1                         1    pytorch
[conda] mkl                       2020.4             h726a3e6_304    conda-forge
[conda] mkl-include               2023.1.0         h84fe81f_48680    conda-forge
[conda] numpy                     1.23.5           py38h7042d01_0    conda-forge
[conda] pytorch3d                 0.6.2                    pypi_0    pypi
[conda] torch                     2.0.1+1684801906.cuda120.cudnn891.nccl218.ap          pypi_0    pypi
[conda] torch-mlir                1684442443               pypi_0    pypi
[conda] torch-scatter             2.1.0                    pypi_0    pypi
[conda] torch-tb-profiler         0.4.1                    pypi_0    pypi
[conda] torchfile                 0.1.0                    pypi_0    pypi
[conda] torchvision               0.15.1a0+42759b1          pypi_0    pypi

Answer 1 · 2024-01-29T03:00:31.000Z

I've met the same problem. Is there any help or workarounds?