Sort samples by length in `TextToEmbeddingModelPipeline`

Question

Sort samples by length in `TextToEmbeddingModelPipeline`

avidale opened this issue 3 months ago · 0 comments

When encoding texts, currently, the pipeline (https://github.com/facebookresearch/SONAR/blob/main/sonar/inference_pipelines/text.py#L169) reads them in the provided order, groups them into batches, and collates each batch by padding each text in the batch to have the same length as the longest text in this batch. This sometimes produces batches where most tokens are just pad tokens, so the computation is wasted for them.

To avoid this waste and speed up the pipeline, we could sort the text by length before batching them.