Lightning-AI/torchmetrics

DataLoader worker is killed in Docker

Closed this issue ยท 4 comments

๐Ÿ› Bug

The problem is, that under some very specific circumstances, when I use the classification metric during the training in Docker image, my DataLoader's workers are unexpectedly killed. I'm even not sure if this is a bug in torchmetrics, as there are some very specific circumstances required to reproduce the issue. Due to the fact, that the described situation appeared after the torchmetrics version has been updated to 1.4.0, I guess that the cause of the issue is somehow related to the latest changes (there is no issue when using torchmetrics 1.2 or 1.3). After some longer investigation I still have no idea, what the issue really is, but as I am able to simply reproduce it I decided to report it.

To Reproduce

To reproduce the issue, the following code snippet has to be run in docker container with CUDA device available (cuda drivers and cuda container toolkit installed):

import os

import torch
import torchmetrics
from torch.utils.data import DataLoader

# setup envs
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "55555"

# setup torch distributed
torch.distributed.init_process_group("nccl", init_method="env://")
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))

# setup dataloaders
train_dl = DataLoader([1, 2, 3, 4, 5], batch_size=1, num_workers=1)
valid_dl = DataLoader([1, 2, 3, 4, 5], batch_size=1, num_workers=5)

# setup example metric
metric = torchmetrics.F1Score(task="multiclass", num_classes=3).cuda()

print("Iterate over train_dl")
# model.train()
for _ in train_dl:
    metric.update(
        torch.rand(1, 3, 40, 40, 40).cuda(),
        torch.randint(0, 3, (1, 40, 40, 40)).cuda()
    )

print("METRIC: ", metric.compute())

print("Iterate over valid_dl")
for _ in valid_dl:
    pass

print("Epoch end")

Example command to launch this script in docker, assuming the above script is in error.py file:

docker run --rm -it --gpus=all -v `pwd`:/error pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime /bin/bash -c "pip install torchmetrics==1.4.0 && python /error/error.py"

When tested with the above, official image pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime, the output is:

Iterate over train_dl
METRIC:  tensor(0.3338, device='cuda:0')
Iterate over valid_dl

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
error.py 36 <module>
for _ in valid_dl:

dataloader.py 628 __next__
data = self._next_data()

dataloader.py 1316 _next_data
idx, data = self._get_data()

dataloader.py 1282 _get_data
success, data = self._try_get_data()

dataloader.py 1133 _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e

RuntimeError:
DataLoader worker (pid(s) 104) exited unexpectedly

The same output has been observed for image pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime.

When tested with the custom image on PyTorch 1.12.1+cu113 and Python 3.10 (the official pytorch 1.12.1 image uses Python 3.7 while >3.8 is required by torchmetrics), there is also CUDA error reported before DataLoader workers death:

Iterate over train_dl
METRIC:  tensor(0.3327, device='cuda:0')
Iterate over valid_dl
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from insert_events at ../c10/cuda/CUDACachingAllocator.cpp:1423 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f7d7263520e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x23af2 (0x7f7d9aceaaf2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x257 (0x7f7d9acef9a7 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x4637b8 (0x7f7dc42677b8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f7d7261c7a5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x35f245 (0x7f7dc4163245 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x679b48 (0x7f7dc447db48 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f7dc447def5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: python() [0x596fce]
frame #9: python() [0x5bdb16]
frame #10: python() [0x4e3c13]
frame #11: python() [0x594cac]
<omitting python frames>
frame #13: python() [0x583769]
frame #16: python() [0x50c14e]
frame #18: python() [0x546b93]
frame #19: python() [0x583d8a]
frame #20: python() [0x5750bf]
frame #21: python() [0x500ab4]
frame #29: python() [0x509d08]
frame #37: python() [0x509d08]
frame #43: <unknown function> + 0xb45e (0x7f7ddf5ab45e in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #44: <unknown function> + 0xaae5 (0x7f7ddf5aaae5 in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #45: <unknown function> + 0x974e (0x7f7ddf5a974e in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #46: <unknown function> + 0xb6d7 (0x7f7ddf5ab6d7 in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #47: <unknown function> + 0x99ba (0x7f7ddf5a99ba in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #48: <unknown function> + 0x974e (0x7f7ddf5a974e in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #49: <unknown function> + 0xb6d7 (0x7f7ddf5ab6d7 in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #50: <unknown function> + 0x99ba (0x7f7ddf5a99ba in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #51: <unknown function> + 0x974e (0x7f7ddf5a974e in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #52: <unknown function> + 0x134f0 (0x7f7ddf5b34f0 in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #53: python() [0x50afcf]
frame #55: python() [0x50c14e]
frame #63: python() [0x50c3d7]


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
error.py 36 <module>
for _ in valid_dl:

dataloader.py 681 __next__
data = self._next_data()

dataloader.py 1359 _next_data
idx, data = self._get_data()

dataloader.py 1325 _get_data
success, data = self._try_get_data()

dataloader.py 1176 _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e

RuntimeError:
DataLoader worker (pid(s) 139) exited unexpectedly

This error does not appear when:

  • torchmetrics version is < 1.4.0
docker run --rm -it --gpus=all -v `pwd`:/error pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime /bin/bash -c "pip install torchmetrics==1.3.0 && python /error/error.py"
  • num_workers in train_dl or valid_dl is 0 (for sure);
  • some other, nonzero, random values of num_workers are configured (e.g., 1 and 3, no pattern has been observed).

Expected behavior

There should be no errors and "Epoch end" should be printed when running this code.

Environment

  • Linux Ubuntu VERSION="18.04.5 LTS (Bionic Beaver)"
  • NVIDIA-SMI 525.78.01 Driver Version: 525.78.01 CUDA Version: 12.0
  • Mentioned Docker images

Hi! thanks for your contribution!, great first issue!

Hi @karwojan, thanks for reporting this issue.
I cannot really comprehend what is going on here. My gut feeling is that this has nothing to do with torchmetrics and is more of a torch issue, but who knows. I tried it on my own machine and I could not really reproduce the issue.
Could you maybe report what happens if you run the code with the CUDA_LAUNCH_BLOCKING env set as suggested in the error message?

Hi @SkafteNicki, thanks for your answer.
Have you tried to run this script in docker using the following command?

docker run --rm -it --gpus=all -v `pwd`:/error pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime /bin/bash -c "pip install torchmetrics==1.4.0 && python /error/error.py"

Without Docker, I am also not able to reproduce the issue.
When I run this script with the CUDA_LAUNCH_BLOCKING env, result is exactly the same.

Hi @karwojan,
I tried some more and was finally able to reproduce the issue.
I then ran git bisect to narrow down when bug was introduced, with v1.3.2 set as a good commit and v1.4.0 set as a bad commit. In the end the bisecting resulted in this commit cd7ccfc from merging this PR #2468 being marked as the starting point of the bug.
I am still not sure what change in the PR actually is causing this, but I try to narrow it down.