pytorch/data

Does Collator need to exist?

Opened this issue ยท 1 comments

lendle commented

๐Ÿ“š The doc issue

Docs for Collator leave a lot of questions.

Collates samples from DataPipe to Tensor(s) by a custom collate function
What does collate mean in this context? What is the collate function applied to? In the torch Dataloader docs, it's clear that collate_fn is meant to be applied to a batch of data, but that's not explained here at all. Looking at the implementation I think the input datapipe is supposed to be batched here too, but that's not clear.

What's the difference between this and Mapper? Sort of seems like the only difference is that the output of collate_fn is supposed to be tensors? Or collections of Tensors? I have used it with a function that returns a list of ints though, so there doesn't seem to be anything enforcing that the output is Tensors.

Suggest a potential alternative/fix

Get rid of Collator if it doesn't add anything over Mapper, it's confusing

If keeping it:

  • If it's basically Mapper with a default mapping function that converts things to tensors, don't allow specifying the function.
  • Or explain why this is different than mapper.
  • State that input should be batched
  • Document the conversion argument

I was under the impression that the dataloader could look for a Collator in the pipeline, and if one doesn't exist, it would just use the default_collate function from pytorch, but I could be wrong.