N-gram language models

Language modeling — that is, predicting the probability of a word in a sentence — is a fundamental task in natural language processing. It is used in many NLP applications such as autocomplete, spelling correction, or text generation.

Currently, language models based on neural networks, and especially transformers, are the state of the art: they predict very accurately the next word based on previous words. However, in this project, I will revisit the most classic of language models: the n-gram models.

Project structure

Part 1: Unigram model (code, write-up)
Part 2: Higher n-gram models (code, write-up)
Part 3: Expectation-maximization algorithm to combine n-gram models (code, write-up)

khanhnguyendata/ngram

N-gram language models

Project structure