N-Gram-Language-Model

Includes:

  • Index words
  • Store ngrams in a Trie data structure
  • Efficiently extract ngrams and their frequencies
  • Compute out-of-vocabulary (OOV) rate
  • Compute ngram probabilities with absolute discounting with interpolation smoothing.
  • Compute Perplexity

Introduction

A statistical language model is the development of probabilistic models to predict the probability of a sequence of words. It is able to predict the next word in a sequence given a history context represented by the preceding words.

The probability that we want to model can be factorized using the chain rule as follows:

where equation is a special token to denote the start of the sentence.

In practice, we usually use what is called N-Gram models that use Markov process assumption to limit the history context. Examples of N-Grams are:

Training

Using Maximum Likelihood criteria, these probabilities can be estimated using counts. For example, for the bigram model,

equation

equation

However, this can be problamatic if we have unseen data because the counts will be 0 and thus the probability is undefined. To solve this problem, we use smoothing techniques. There are different smoothing techniques and the one that we used is called absolute discounting with interpolation.

Perplexity

To meausre the performance of a language model, we compute the perplexity of the test corpus using trained m-Grams:

Results

Model was tested on europarl dataset (dir data):

Test PP with bigrams = 130.09

Test PP with trigrams = 94.82