Is there a way to list all possible ngrams for a given string?
alvations opened this issue · 0 comments
alvations commented
The model's vocab only returns the unigrams:
>>> import arpa
>>> x = arpa.loadf('big.arpa')
>>> x[0].vocabulary()
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '</s>', '<s>', '<unk>', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '~']
But the model contains 3-5 grams probabilities, it is possible to provide the 3-5grams available for the sentence?
E.g. given a string:
T h e _ P r o j e c t _ G u t e n b e r g _ E B o o k _ o f _ T h e _ A d v e n t u r e s _ o f _ S h e r l o c k _ H o l m e s