Why we need to add size_mul here?

Hi, I don't quite understand the code here.

Lines 41 to 42 in 80d3602

    
           if (max_len * (len(batch_indices) + self._size_mul) 
        
                   > self._max_tok):

self._size_mul is used for partitioning, then why we need to add it when checking if the full token length is exceeded?

Hi @VisualJoyce ,

The sampler adds self._size_mul new items to the current batch it is forming. That batch should not exceed the batch_size from your config file, which here is self._max_tok. It must not exceed maximum number of tokens

Thank you for the answer!

In my case, I am trying to select a best BUCKET_SIZE and self._max_tok. I guess the value is empirically selected, I might need to change this on a different dataset, right?

Indeed, they are empirically selected, but I can provide you with my example based on VQA task.
I am currently training the uniter large pretrained on 1080 12GB VRAM

For batch_size 1024( in config), the sampler provides batches of 8 examples (self._size_mul)
For batch_size 3072( in config), the sampler provides batches of 24/32 examples (self._size_mul)
For batch_size 5120( in config), the sampler provides batches of 40/48/54 examples (self._size_mul) , but sometimes crashes with unable to allocate extra memory on gpu (as we are training single-gpu).

So for me, 3072 is the best, and I imagine you can find yours similarly

	if (max_len * (len(batch_indices) + self._size_mul)
	> self._max_tok):