hristo-vrigazov/mmap.ninja

Compatibility with DataLoaders and Multi-GPU on DDP

Yusepp opened this issue · 2 comments

In our institution, datasets are currently stored on a hard disk drive (HDD), causing data loading to be a bottleneck even with a multi-GPU configuration. I would like to inquire about the compatibility of using a memory-mapped dataset with a DataLoader, and whether multiple workers can access it simultaneously without encountering any issues. Additionally, I would like to know if this approach generates batches of the specified batch size.

Thanks for the question!

  1. Multiple workers can access it simultaneously without issues, provided there is no concurrent modification of the dataset. So it is advisable to first create the dataset once for the project, and then just use it in your Dataset class.
  2. Additionally, I would like to know if this approach generates batches of the specified batch size - this approach just stores your dataset on disk in a format, in which it is very fast to read a sample from. Batching should be handled by the DataLoader, in its collator function. If you are wondering about the batch_size parameter in RaggedMmap.from_generator, this is used only when creating the memory map, not when you read from it.

Feel free to ask me additional questions or clarify your questions

Closing due to inactivity, feel free to open the issue again