modelscope/FunCodec

Inconsistency in Encode Results with Different Batch Sizes

hertz-pj opened this issue · 3 comments

I have noticed that when using different batch sizes for the encode inference, the same data yields different results. Specifically, changing the batch_size parameter seems to affect the outcome even when the input data remains consistent.
I am unsure if this behavior is expected or indicative of a bug. It would be greatly appreciated if you could provide some insights or guidance on this matter. Understanding the expected behavior when varying batch sizes would be crucial for my continued use and trust in the tool's reliability.
Thank you for your attention to this matter and for your continued support of the community with Funcodec.

In summary, with different batch sizes, the outcome codecs should be very similar, different tokens should be less than 3 for each quantizer. In my case, I test 10 utterance from Librispeech test-clean subset under the batch_size of 1, 4 and 8, and the codec outputs are the same. There are some insights may help you figure out your problem:

  1. To make the bachified inference, utterances in a mini-batch are padded at the end of each utterance with wrap mode in numpy. You can find more details at https://github.com/alibaba-damo-academy/FunCodec/blob/master/funcodec/bin/codec_inference.py#L260 and https://github.com/alibaba-damo-academy/FunCodec/blob/master/funcodec/modules/nets_utils.py#L65

  2. To speed up the data loading at the inference stage, the multi-thread torch Dataloader worker is employed. Therefore, if you set the num_workers larger than 0 in the encoding_decoding.sh script (default value is 4), the utterance order of outputs may be different due to the random of Dataloader worker. If you want to mantain the utterance order, please set the num_workers parameter to 0.

If your test cases are still much different after you check the above mentioned things, please provide an reproducible recipe, I will check it. Thanks.

The variations in outcomes are not significantly noticeable in terms of effects. I am just keen to understand the reasons behind the differences when using various batch sizes. From my understanding, proper masking should avoid inconsistencies caused by different batch sizes

Since there are only convolutions and uni-directional LSTM layers in the VAE-RVQ model, I didn't implement batchified inference with masking, instead, I use porper padding, wrap mode of numbpy. I think different padding length may cause the very limited inconsistencies for the ending codes, and the other codes should be identical for various batch sizes.