githubharald/CTCDecoder

Language model at word level

marcoleewow opened this issue · 4 comments

Hi, did you add word level language model for beam search?

Currently its easy to add character level bi-gram, but I find it much harder to add word level. I tried CTC token passing algorithm but its just way too slow comparing beam search.

  1. just checking if the words exist would be the easiest way to go:
    A. you could check if the words in a beam exist in your dictionary. Each time a labelling gets extended by a whitespace in function calcExtPr, you could check if the last word exists, if yes, assign a probability of 1 and 0 otherwise.
    B. or you could build a dictionary of prefixes of the dictionary words (e.g. Hello -> H, He, Hel, ...), by using a prefix tree. Then you know which beams can be extended by which characters.

  2. using word-level bigram LM is not that easy. You can only score neighbouring words by a bigram after both words have been fully added to the beam. But you could give it a try. Score the two last words of a beam as soon it is possible. This would at least remove beams that represent nonsense from a LM point of view, even if this scoring happens a bit late. I think a clever combination of word-level LM and a prefix tree could give good results and would be fast (reduce number of beams).

I have done 1.A together with long words penalty, but there is no word bi-gram level prior knowledge to this method which means it is only an autocorrect.

Example: "milk the cous" are all words in the dictionary but it does not make sense, whereas the true label we want is "milk the cows".

For 2, I have tried giving bi-gram scores whenever I see a space label, but then it will push the beam out of beam width and what I get is a long single word a lot of time.

Currently I am reading on WFSTpdf and trying to implement a CTC decoder using WSFT so that I can include bi-gram word level, have you tried these methods?

no, I haven't tried WFST yet.

I've implemented an algorithm which uses beam search on word-level (dictionary, unigrams/bigrams) and which runs faster than token passing: https://github.com/githubharald/CTCWordBeamSearch