Lossless Text Compression with Transformer-based Language Model

This repo is to demo the concept of lossless compression with Transformers-based language model as encoder and decoder.

Contributors: Shangmin Guo (@Shawn-Guo-CN), Ze Peng (@Raphaelhpze)

The modules are:

compress.py: The script to compress a text file by the arithmetic encoding algorithm with a Transformer model
data_loader.py: The DataLoader class for loading the text file
decompress.py: The script to decompress a binary file by the arithmetic decoding algorithm with a Transformer model identical to the compression model
model.py: The Transformer model class
trainer.py: The Trainer class for updating the model parameters and predicting next-token with a Transformer model
tokenizer.py: The Tokenizer class
utils.py: Utility functions

TODOs

Many features in the current version are for demonstration purposes only. The following are part of the future work:

Implement th I/O streams for large files, the current version reads the whole file into memory
Update the compressing/decompressing and the training of LLM to a batch-wise manner, the current version assumes batch size = 1
Support tracking the progress of the compression/decompression and the corresponding negative log-likelihood of the data (which represents the compression ratio)

python compress.py --input_file <input_file> --output_file <output_file> --config_file <config_file>

input_file: The path to the input text file, e.g. data/demo.txt
output_file: The path to the output compressed file, e.g. data/demo_encode_out.txt
config_file: The path to the configuration file in the YAML format, e.g. config/global/demo.yaml

Tokenize the input text
Calculate the probability of a token given the previous tokens by the forward pass of Transformer model
Encode the token with the probability and the arithmetic coding algorithm
Output the arithmetic code to a text file for readability

python decompress.py --input_file <input_file> --output_file <output_file> --config_file <config_file>

input_file: The path to the input text file, e.g. data/demo_encode_out.txt
output_file: The path to the output compressed file, e.g. data/demo_decode_out.txt
config_file: The path to the configuration file in the YAML format, e.g. config/global/demo.yaml

Read the arithmetic code from the input file
Decode the arithmetic code to the token while getting probability from Transformer model and updating the parameters of the model
Detokenise the tokens
Output the detokenised text to the output file