
Adaptive Language Models in Python

Adaptive Language Models in Python

⚠️ A python re-interepretation of the PPM JS Library. Original found at https://github.com/google-research/google-research/tree/master/jslm - see the original for more code comments.

This directory contains collection of simple adaptive language models that are cheap enough memory- and processor-wise to train in a browser on the fly.

Language Models

Prediction by Partial Matching (PPM)

Prediction by Partial Matching (PPM) character language model. See the bibliography below.


Histogram Language Model

Very simple context-less histogram character language model.


Pólya Tree (PT) Language Model

Context-less predictive distribution based on balanced binary search trees. Tentative implementation is here.


Please see a simple example usage of the model API in example.py.

The example has no command-line arguments. To run it using Python invoke

> python example.py


  • Something is wrong with my PPM library. Its continually predicting the same ids no matter the context. I don't get it.

Test Utility (and Demo of Character or Word prediction)

A simple test driver language_model_driver.py can be used to check that the model behaves using Python 3+. The driver takes three parameters: the maximum order for the language model, the training file and the test file in text format. Currently only the PPM model is supported. Note we show in this how you can do next letter and next word predictions use max_length of around 30 Be warned too: training is fast. running test_model can take a long time for the word models. Look at the code - you will need a larger max_length for words_


> python language_model_driver.py 30 training_small.txt training_small_test.txt
Results: numSymbols = 54, ppl = 13.268624243648365, entropy = 3.7299468876181376 bits/char
Top 5 character predictions for 'he': ['l', ' ', 'e', 't', 'o']
Results: numSymbols = 54, ppl = 9.575973715690846, entropy = 3.2594191923509106 bits/char
Top 5 word predictions for 'Hello ': ['<OOV>', 'everyone', 'sequence', 'test', 'world']

Example train and test files to use


hello world hello everyone hello there hello world
this is a test this is a trial this is a sequence
welcome to the model test welcome to the world
Gorgeous Doris Day is lovely. One day i went to the beach. 
Today I was at the shops. What day is it today?


hello world this is a test sequence
welcome to the test