An implementation of LLMzip using GPT-2. This is an algorithm for super-compressed text data using an LLM to encode the data.
Here is the official implementation using LLAMA. You might also want to look at NNCP, a prior LLM based compressor.
pip install transformers
Use these instructions to install pytorch with GPU support.
To zip:
python gpt_zip.py -z textfile.txt
To unzip:
python gpt_zip.py -u zipfile.gpz
@misc{ch2023llmzip,
title={LLMZip: Lossless Text Compression using Large Language Models},
author={Chandra Shekhara Kaushik Valmeekam and Krishna Narayanan and Dileep Kalathil and Jean-Francois Chamberland and Srinivas Shakkottai},
year={2023},
eprint={2306.04050},
archivePrefix={arXiv},
primaryClass={cs.IT}
}
This program performs at 1.75 bits/character on the book referenced in the paper, prepared as in the paper to be only lowercase letters and space. This is 62% the file size of the text compressed with zlib alone.