Beam Search Decoding giving poor accuracy compared to Gready Search Decoding

Question

Beam Search Decoding giving poor accuracy compared to Gready Search Decoding

Gouranga95 opened this issue 3 years ago · 7 comments

Hello,

I have trained EfficientConformerTransducerSmall model using my own data. I have used the sentencepiece tokenizer and it is also trained on my data. While decoding using best model, gready search decoding is giving around 35% WER but beam search decoding is giving higher WER( aound 40%) with beam size 5 and 16. Also with beam size 1 the WER is close to 35% which is as expected. I did the experiment with and without LM but the behaviour is same(i.e. WER is increasing with beam size higher than 1).
Can you please explain why WER is inceasing for Beam Search decoding with beam size higher than 1?

Answer 1 · 2021-11-09T14:56:43.000Z

@burchim In case of CTC model (checkpoint provided by you in notebook) WER (with beam size = 16 and no LM) is 7.86 and for gready search wer is 7.93. So increase in beam size does help with accuracy but not in case of Efficient Conformer Transducer

Answer 2 · 2021-11-14T21:27:06.000Z

@burchim Any inputs regarding this peculiar observation?

Answer 3 · 2021-11-15T12:19:56.000Z

Hi,

You may need to play with the model output temperature and ngram alpha/beta hyperparameters in the config. Note that beam search decoding is supposed but not guaranteed to find an optimal solution. Also, a beam size of 1 will default to greedy search.

In our transducer experiments, while greedy search achieved around 3% in wer, we were able to achieve a gain of 0.2%~0.3% in WER without LM using beam sizes ranging from 4 to 20 and tmp from 1 to 2.5.

Answer 4 · 2021-11-19T07:50:25.000Z

@burchim Can you post screenshot of the tensorboard during EfficientConformerTransducerSmall model training?

Answer 5 · 2021-11-25T13:38:21.000Z

I can't post additional content but you should get similar results using the same setup with LibriSpeech.

Answer 6 · 2021-12-02T04:44:22.000Z

what is the WER of attention decoding w/o ngram and lm?
When using 12 encoder layer, what is the WER w/o ngram and lm?
Why bpe szie is 256, not 5000, or large?
Do this model need train 450 epoch? Do you have the WER in epoch less?

Answer 7 · 2021-12-03T01:58:49.000Z

Hi,

In the case of RNN-T, using a 1/2 layer Tranformer/Conformer decoder instead of LSTM achieved slightly better WER. ~0.1%/0.2% better WER after 100 epochs for the Efficient Conformer Small. This resulted in slower training/inference in our case so we decided to continue with LSTM but I encourage you to try.

We experimented using 10-12 layers with wider feature dim but performance wasn't as good.
A lot of resources were used in previous works to come up with an optimal architecture design.

Larger bpe size also comes with more parameters. A bpe size of 128/256 has been shown to be optimal for CTC in previous work. §4.3

No, you don't need to train as much epochs to reach good WER. 50 to 100 epochs may be sufficient depending on your needs.