N-gram

Implementation of n-gram from scratch.

The N-Gram language model is a statistical language model widely used in natural language processing and computational linguistics. It predicts the probability of a word or sequence of words based on the previous N-1 words in a given text. The "N" in N-Gram refers to the number of words or tokens considered in the context. For example, a 3-Gram model predicts the next word based on the previous two words.

The N-Gram model relies on the assumption of Markov property, which states that the probability of a word only depends on a fixed number of preceding words, irrespective of the entire history. This assumption allows for efficient estimation of probabilities and simplifies the modeling process.

Usage

Clone the repository:

git clone https://github.com/rumbleFTW/generative-language-models.git

Install the dependencies:

pip install -r requirements.txt

Navigate to the model directory:

cd <model_name>

Train the model:

python train.py --data <path_to_data_file> --epochs <num_epochs> [optional]--char

Generate text:

python generate.py --generate --seed_text "<seed_text>" --output_length <output_length> --l <level>

Example

To train the model on a data file and generate text:

cd n-gram
# Optional --char flag to train with character level encoding
python main.py --data ./data/gita_chap1.txt --epochs 10 --l char
# Optional --char flag to generate at character level
python main.py --generate --seed_text "The weather" --output_length 100 --l char

rumbleFTW/n-gram

N-gram

Usage

Example