pixel-z/Smoothing

Implementation of Kneser Ney and Witten Bell smoothing

Jupyter Notebook

Smoothing

Implementation of smoothing methods Kneser Ney and Witten Bell.

A good read: An Empirical Study of Smoothing Techniques for Language Modeling

Tokenizer

Tokenizer is made from scratch which handles following cases:

Word Tokenizer
Punctuation
URLs: Replaced by <URL>
Hashtags: Replaced by <HASHTAG>
Mentions: Replaced by <MENTION>
". . .. . ." Replaced by "." (similarly for other symbols)

Example:

It has been run on twitter corpus (set of 2000 tweets) and the output is stored in the file 2019111031_tokenize.txt.

Kneser-Ney Smoothing

Language model 1 & Language model 3 are smoothed with Kneser Ney smoothing.

LM1:
- Corpus: Europarl corpus
- Avg. Train Perplexity: 4.284909580178163
- Avg. Test Perplexity: 13229.186428230216
LM3:
- Corpus: Medical corpus
- Avg. Train Perplexity: 3.4184223445091204
- Avg. Test Perplexity: 20906.023541314356

Witten-Bell Smoothing

Language model 2 & Language model 4 are smoothed with Witten-Bell Smoothing.

LM2:
- Corpus: Europarl corpus
- Avg. Train Perplexity: 2.754752175818304
- Avg. Test Perplexity: 331.37547330408955
LM4:
- Corpus: Medical corpus
- Avg. Train Perplexity: 2.2151893632899857
- Avg. Test Perplexity: 665.634549591011

Instructions

The Jupyter file uses data from my GDrive, the data can also be found in directory data.
Other than that, the code is self-explanatory.
At the end, the jupyter file also takes input from the user for the type of smoothing and path of corpus followed by an input sentence.
The output is the probability of that sentence using the two smoothing mechanisms.