loss for lf_mmi is high.
Closed this issue · 1 comments
We have similar results and here is the explanation.
During the training stage, the MMI loss is calculated in word-level. E.g., a mandarin character sequence ABC
may be considered as a word sequence A BC
if the word BC
is in the lexicon. We find training MMI loss in word-level, rather than the naive character-level, may provide slight CER improvement since the polyphonic problem of Mandarin is alleviated. However, we find the validation MMI loss would spike if we do this (maybe because of the over-fitting problem).
Note that, during decoding, the MMI can only work in character-level since the word segmentation information is missed. So it may be reasonable to compute the validation MMI loss in character-level. To achieve this, try to modify code like below and you would observe a validation MMI loss around 10.
change
e2e_lfmmi/snowfall/warpper/warpper_mmi.py
Lines 134 to 136 in a359d4b
as:
if self.training:
assert self.P.is_cpu
assert self.P.requires_grad is True
else:
# Never use segmentation in evaluation: to approximate the decoding stage
ys = [[self.char_list[c] for c in y if c != self.pad_id] for y in ys_pad]
ys = [" ".join(y).replace("<eos>", "") for y in ys]
You may also delete file data/dev/text_org
to exclude all word segmentation info in the validation set and re-generate your data.json
.
You may also ignore this problem. It would not have an impact on decoding results.
You are very welcome to report any bug or concern to help us improve. We are writing a journal paper on this work and would release a revised version later. So far, some problems, including this one, are solved but not updated in github due to the company policy. We are sorry for that.
regards