/Markov-NGram-TextGenerator

NLP project to use N-Grams and Markov Chains to generate text on a per-word basis.

Primary LanguagePython

TextGenerator

NLP project to use N-Grams and Markov Chains to generate text on a per-word basis. Implements Stupid Backoff, where if no n-gram of a given size is found, it will recurse to the (n-1)-gram until we reach a unigram.

Visualize how it works

If we use the same NTLK corpus as in our test case

corpus = nltk.corpus.gutenberg.raw('austen-sense.txt')
corpus = nltk.word_tokenize(corpus.lower())

... and called the function as so (because stocastic word generation is more fun 😉 )

finish_sentence(["she", "was", "not", "in"], 4, corpus, deterministic=False)

... we can observe how the function works (with a few additional print statements)

Screen Shot 2021-10-13 at 12 04 00 PM

... in order to generate the following sentence, detokenized by to our trusty TreebankWordDetokenizer function from the nltk.tokenize.treebank package ❤️

she was not in the neighbourhood to which you will not allow me to prove

Stupid Backoff implementation

If we called the function with a word that does not exist in the corpus and/or with n-gram that is too long

finish_sentence(["ThisWordDoesNotExist", "unlike"], 4, corpus, deterministic=False)

... we can observe how the Backoff works

Screen Shot 2021-10-13 at 12 33 57 PM

... in order to generate a sentence regardless

ThisWordDoesNotExist unlike yourself must be happy.

Now, try it out yourself!