N-Gram_Language_Modeling: A Python repository from anishkasachdeva

Introduction to NLP - Assignment 1

python3 language_model.py <smoothing_type> <path_corpus>

Perplexity is calculated in the following:

The corpus is divided into test set and training set using random.shuffle.
Then the language model is created on training test.
Then each sentence in the test set is evaluated.
Probability of each sentence is calculated by the two smoothing methods.
Then each probability is written in the file along with the "tokenized sentence".
At last the average perplexity score is put in the file.
Perplexity is calculated using the following formula :
1. float(1)/float(math.exp(float(probability)/float(n)))
2. Here probability = probablity of each sentence in the test set.
  1. Probability of each sentence = exp(math.log(p1) + math.log(p2) + math.log(p3) + .... + math.log(pN))
3. Here n = length(sentence) - 3