- The data directory contains 2 csvs, one is the unfiltered dataset given originally, and one after all the pre-processing.
- pre-processing.py contains the code for pre-processing the comments in a parallelized fashion.
- Utils.py contains some utility functions that are used in the N-gram language model.
- ngrams.py is the file that contains the implementation of the N-gram language model class and its methods.
- models.py is the main experiment file where we instantiate the model for different n values and calculate the perplexity and log(perplexity).
- plotting.py is used to plot the perplexity values for inference and analysis.
- Smoothing_Comparison.txt stores the result of models.py which is a comparison between perplexities of different smoothing techniques on n-gram models.
- The repo also contains the final documentation of the assignment in the pdf format.