libffcv/ffcv

Is there a get_batch(indices) method + custom collate function?

Opened this issue · 1 comments

Hi everyone, I am wondering if it is possible for a user to create a custom batch of images with ffcv's speed.

  • Specifically I would like to know if there is something like get_batch(indices) that creates a batch from the input indices that is a method of ffcv.loader.Loader. The reason why I would like this is because I need an infinite random sampler - aka, if I have a dataset of 5000 images, I need to create batches of 100 images by random selection and an image can be drawn multiple times

Additionally, is there a way a batch of images can also contain other relevant information? I am wondering because it would be ideal if the batch could be a python dictionary with keys such as ['image', 'index'] where batch['image'] returns a list of tensors or something similar (as my images are not the same size) and batch['index'] returns the dataset index of each image.

  • I am assuming perhaps this is possible, you just write a custom pytorch dataset with these properties and then write it to the FFCV dataset format with a ffcv.writer.DatasetWriter?

Hi! The first feature (get_batch(indices)) is unfortunately not supported, but you can create a loader from a specific set of indices, so if the overhead isn't too big you can create a loader with one batch from those indices and iterate through that loader.

For the second question, this is very easy! You can indeed just create a PyTorch dataloader that serves indices and write that to a beton. If that's too expensive, you can design a custom transform to do this, for example see here: https://github.com/libffcv/ffcv/blob/3a12966b3afe3a81733a732e633317d747bfaac7/examples/docs_examples/transform_with_inds.py or the docs here: https://docs.ffcv.io/ffcv_examples/transform_with_inds.html