Language modeling is the task of predicting the next word or character in a document. This technique can be used to train language models that can be applied to a wide range of natural language tasks like text generation, classification, and question answering. The common language modeling techniques involve N-gram Language Models and Neural Langauge Models. A model's language modeling capability is measured using cross-entropy and perplexity.
N-gram Language model:
Implement an N-gram language model with Laplace smoothing and sentence generation.
i) A smoothing function for applying Laplace smoothing to n-gram frequency distribution.
ii) A model function to create a probability distribution for the vocabulary of the training corpus. Please note if you use a unigram model, then probabilities will be simple frequencies of each token with the entire corpus. Otherwise, the probabilities are Laplace-smooth relative frequencies.
iii) A perplexity function to evaluate the language model on the test dataset.
iv) A generate-sentence function to generate a sentence using the language model.
Implement a neural network-based language model, and calculate the perplexities for each input sentence based on the trained model. Please split your train data between 90% and 10% to create a dev set and fine-tune the model based on the dev dataset. Please prefer to use a PyTorch framework to implement the above code. Please use Word2Vec/Glove for the initialization of the model.
Perplexity (For evaluating both the models)
English: https://drive.google.com/drive/folders/1lthDz3GKLxES130Ly6jdSyieOJZHe77R?usp=sharing
Submission Materials: Python file, Google drive link for the trained model, a doc-file with results, and your observation.