NLP project to use N-Grams and Markov Chains to generate text on a per-word basis. Implements Stupid Backoff, where if no n-gram of a given size is found, it will recurse to the (n-1)-gram until we reach a unigram.
If we use the same NTLK corpus as in our test case
corpus = nltk.corpus.gutenberg.raw('austen-sense.txt')
corpus = nltk.word_tokenize(corpus.lower())
... and called the function as so (because stocastic word generation is more fun 😉 )
finish_sentence(["she", "was", "not", "in"], 4, corpus, deterministic=False)
... we can observe how the function works (with a few additional print statements)
... in order to generate the following sentence, detokenized by to our trusty TreebankWordDetokenizer
function from the nltk.tokenize.treebank
package ❤️
she was not in the neighbourhood to which you will not allow me to prove
If we called the function with a word that does not exist in the corpus and/or with n-gram that is too long
finish_sentence(["ThisWordDoesNotExist", "unlike"], 4, corpus, deterministic=False)
... we can observe how the Backoff works
... in order to generate a sentence regardless
ThisWordDoesNotExist unlike yourself must be happy.
Now, try it out yourself!