Inconsistency in Encode Results with Different Batch Sizes
hertz-pj opened this issue · 3 comments
I have noticed that when using different batch sizes for the encode inference, the same data yields different results. Specifically, changing the batch_size parameter seems to affect the outcome even when the input data remains consistent.
I am unsure if this behavior is expected or indicative of a bug. It would be greatly appreciated if you could provide some insights or guidance on this matter. Understanding the expected behavior when varying batch sizes would be crucial for my continued use and trust in the tool's reliability.
Thank you for your attention to this matter and for your continued support of the community with Funcodec.
In summary, with different batch sizes, the outcome codecs should be very similar, different tokens should be less than 3 for each quantizer. In my case, I test 10 utterance from Librispeech test-clean subset under the batch_size of 1, 4 and 8, and the codec outputs are the same. There are some insights may help you figure out your problem:
-
To make the bachified inference, utterances in a mini-batch are padded at the end of each utterance with
wrap
mode in numpy. You can find more details at https://github.com/alibaba-damo-academy/FunCodec/blob/master/funcodec/bin/codec_inference.py#L260 and https://github.com/alibaba-damo-academy/FunCodec/blob/master/funcodec/modules/nets_utils.py#L65 -
To speed up the data loading at the inference stage, the multi-thread torch Dataloader worker is employed. Therefore, if you set the
num_workers
larger than 0 in theencoding_decoding.sh
script (default value is 4), the utterance order of outputs may be different due to the random of Dataloader worker. If you want to mantain the utterance order, please set thenum_workers
parameter to 0.
If your test cases are still much different after you check the above mentioned things, please provide an reproducible recipe, I will check it. Thanks.
The variations in outcomes are not significantly noticeable in terms of effects. I am just keen to understand the reasons behind the differences when using various batch sizes. From my understanding, proper masking should avoid inconsistencies caused by different batch sizes
Since there are only convolutions and uni-directional LSTM layers in the VAE-RVQ model, I didn't implement batchified inference with masking, instead, I use porper padding, wrap mode of numbpy. I think different padding length may cause the very limited inconsistencies for the ending codes, and the other codes should be identical for various batch sizes.