This project is a simple implementation of an N-Gram Language Model using the Natural Language Toolkit (NLTK) in Python.
The project allows users to predict the next word of a given sentence using either a trigram or bigram model. The following features are included in the project:
- Tokenization of text
- Calculation of frequency of co-occurrence of words
- Transformation of counts to probabilities
- Prediction of next word
- Clone the repository to your local machine.
- Install the required dependencies using the command pip install -r requirements.txt.
- To use the N-Gram Language Model, simply import the Ngram class from the ngram.py file and create an instance of it with a text string as a parameter.
- To predict the next word using the trigram model, call the predict_trigram() method with two words as parameters.
- To predict the next word using the bigram model, call the predict_bigram() method with one word as a parameter.
from ngram import Ngram
# create an instance of the Ngram class with a text string as a parameter
text = Ngram("This is a foobar.This is a bee bar. This is a 7 day trial. this is the option you have")
# predict the next word using the trigram model
text.predict_trigram('this', 'is') # output: probability of trigram: {'a': 0.6666666666666666, 'the': 0.3333333333333333}
# predict the next word using the bigram model
text.predict_bigram('this') # output: probability of next word: {'is': 0.6666666666666666, 'a': 0.3333333333333333}
The N-Gram Language Model implemented in this project has the following known issues and limitations:
- It is based on a small corpus, so its accuracy may be limited.
Contributions to this project are welcome! If you find any bugs or want to add new features, feel free to submit a pull request.
Concordance is released under the MIT License. See LICENSE.txt for more information.