Bayer-Group/tiffslide

Tiffslide errors when used in pytorch dataloader with `num_workers>1`

ap-- opened this issue · 4 comments

ap-- commented

Unfortunately, tiffslide fails again in parallel mode, this time using pytorch dataloaders. This is a very common technique used in WSI processing with pytorch, the only difference is that it uses process based parallelisation (rather than threads, as in the original bug report).

The symptoms are exactly the same:

  • using tiffslide and one dataloader process (num_workers=1) everything works fine
  • using tiffslide and more dataloader processes (e.g. num_workers=4) the processing fails
  • using openslide everything works fine regardless of the num_workers value

Tested using tiffslide version 1.0.0 and tifffile version 2022.2.9. Please see the attached minimalist example.

tiffslide-bug2.zip

Originally posted by @lukasii in #14 (comment)

ap-- commented

Thanks for the report @lukasii !

I created a new issue, because now it's multiprocessing related.

Please try moving the slide instantiation to a method that you call from worker_init_fn provided to the DataLoader, and report back if it solves your problem.

Cheers,
Andreas 😃

Thanks Andreas, that did the trick! Would you know why openslide does not need this kind of special treatment?


For the record, in the dataset class I added:

def worker_init(self, *args):
    self.slide = tiffslide.TiffSlide(self.wsi_file)

and then creating dataloader as:

dataloader = torch.utils.data.DataLoader(
    dataset, batch_size=32, shuffle=False, num_workers=4, pin_memory=True,
    worker_init_fn=dataset.worker_init
)
ap-- commented

Hi @lukasii

Great! I'm happy that it works for you now. ❤️

I'd have to investigate why exactly it doesn't work in the example you provided, but it could be that it's due to pytorch using fork instead of spawn to create new worker processes, and fsspec is not playing nicely with fork under some circumstances. Or it's something else related to multiprocessing and tifffile or the way tiffslide doesn't try to lock access to the zarr array.

I'll keep this issue open, until I've worked out the specific details, and either make it work, or crash with a verbose error message suggesting the fix above.

Have a great day, and happy training 🎉
Andreas

Thanks for explaining. In case more work on this issue is planned in the future I am attaching updates files. One is my original bug report file, which was incorrectly using a global variable ("self" was missing). Not a big deal as self.slide was just a reference to that global slide object anyway, so the results are exactly the same. The other file in the archive is the full workaround code.

Cheers!
tiffslide-bug2-updated.zip