Language model at word level

Question

Language model at word level

marcoleewow opened this issue 7 years ago · 4 comments

Hi, did you add word level language model for beam search?

Currently its easy to add character level bi-gram, but I find it much harder to add word level. I tried CTC token passing algorithm but its just way too slow comparing beam search.

Answer 1 · 2017-11-16T11:45:17.000Z

just checking if the words exist would be the easiest way to go:
A. you could check if the words in a beam exist in your dictionary. Each time a labelling gets extended by a whitespace in function calcExtPr, you could check if the last word exists, if yes, assign a probability of 1 and 0 otherwise.
B. or you could build a dictionary of prefixes of the dictionary words (e.g. Hello -> H, He, Hel, ...), by using a prefix tree. Then you know which beams can be extended by which characters.
using word-level bigram LM is not that easy. You can only score neighbouring words by a bigram after both words have been fully added to the beam. But you could give it a try. Score the two last words of a beam as soon it is possible. This would at least remove beams that represent nonsense from a LM point of view, even if this scoring happens a bit late. I think a clever combination of word-level LM and a prefix tree could give good results and would be fast (reduce number of beams).

Answer 2 · 2017-11-17T07:31:25.000Z

I have done 1.A together with long words penalty, but there is no word bi-gram level prior knowledge to this method which means it is only an autocorrect.

Example: "milk the cous" are all words in the dictionary but it does not make sense, whereas the true label we want is "milk the cows".

For 2, I have tried giving bi-gram scores whenever I see a space label, but then it will push the beam out of beam width and what I get is a long single word a lot of time.

Currently I am reading on WFSTpdf and trying to implement a CTC decoder using WSFT so that I can include bi-gram word level, have you tried these methods?

Answer 3 · 2017-11-19T21:35:45.000Z

no, I haven't tried WFST yet.

Answer 4 · 2018-03-01T10:45:22.000Z

I've implemented an algorithm which uses beam search on word-level (dictionary, unigrams/bigrams) and which runs faster than token passing: https://github.com/githubharald/CTCWordBeamSearch