Introduction to NLP - Assignment 1
Language-Modelling
Anishka Sachdeva (2018101112)
1st February, 2021
Steps to execute the code
python3 language_model.py <smoothing_type> <path_corpus>
smoothing_type = k for Kneser Ney Smoothing and
smoothing_type = w for Witten Bell Smoothing
Files generated
Perplexity is calculated in the following:
- The corpus is divided into test set and training set using random.shuffle.
- Then the language model is created on training test.
- Then each sentence in the test set is evaluated.
- Probability of each sentence is calculated by the two smoothing methods.
- Then each probability is written in the file along with the "tokenized sentence".
- At last the average perplexity score is put in the file.
- Perplexity is calculated using the following formula :
- float(1)/float(math.exp(float(probability)/float(n)))
- Here probability = probablity of each sentence in the test set.
- Probability of each sentence = exp(math.log(p1) + math.log(p2) + math.log(p3) + .... + math.log(pN))
- Here n = length(sentence) - 3