pySmaz
pySmaz is a Python port of the SMAZ short string text compression algorithm by Salvatore Sanfilippo.
SMAZ works best on small ASCII English strings up to about 100 bytes. Beyond that length it is outperformed by entropy coders (bz2, zlib). This makes SMAZ ideal for applications like English names and most URLs.
SMAZ throughput on small strings is approximately equal to bz2 and zlib, due to the high setup cost per call to the entropy coders.
SMAZ by Salvatore Sanfilippo, Python port by Max Smith.
Installation
pip install pySmaz
Usage
from smaz import compress, decompress
print compress("Hello, world!")
print compress("https://github.com/antirez/smaz")
print decompress(compress("Salvatore"))
Versions
- 1.0.0 - original release (dict based tree structure)
- 1.0.1 - throughput improvements approx 10%
Background
For additional background, check out the original C implementation by Salvatore Sanfilippo here.
Compatibility and Performance
pySmaz is Python 2.x, PyPy (and mostly Python 3.x) compatible. I've tested with the latest versions, if you do find an issue with an earlier version, please let me know, and I'll address it.
The original C implementation used a table approach, along with some hashing to select the right entry. My first attempt used the original C-style approach and barely hit 170k/sec on CPython and a i7.
The trie based approach gets closer to one megabyte per second on the same setup. The difference is performance is largely due to the inner loop not always checking 7 characters per character - i.e. O(7n) vs O(n). I've tried to balance readability with performance, hopefully it's clear what's going on.
Decompression performance is limited by the single byte approach, and reaches 4.0 megabytes per second. To squeeze more performance it might be worth considering a multi-byte table for decoding.
After eliminating the O(n^2) string appends, PyPy performance is very impressive.
How should you use it ?
SMAZ works best on small ASCII English strings up to about 100 bytes. Beyond that length it is outperformed by entropy coders (bz2, zlib).
SMAZ Throughput on small strings is approximately equal to bz2 and zlib, due to the high setup cost per call to the entropy coders.
Strings 1-8 Bytes
SMAZ(CPython) | SMAZ(PyPy) | bz2 | zlib | |
---|---|---|---|---|
Comp throughput | 0.5 mb/s | 4.0 mb/s | 0.2 mb/s | 0.43 mb/s |
Decomp throughput | 1.4 mb/s | 14.0 mb/s | 0.5 mb/s | 2.6 mb/s |
On larger strings the relative advantages drop away, and the entropy coders are a better bet. Interestingly SMAZ isn't too far off bz2... but zlib crushes it.
Strings 5 Megabytes
SMAZ(CPython) | SMAZ(PyPy) | bz2 | zlib | |
---|---|---|---|---|
Comp throughput | 0.9 mb/s | 2.0 mb/s | 2.0 mb/s | 74.0 mb/s |
Decomp throughput | 4.0 mb/s | 19.5 mb/s | 30.3 mb/s | 454.6 mb/s |
Compression varies but a reduction to 60% of the original size is pretty typical. Here are some results from some common text compression corpuses, the text messages and the urls individually encoded are pretty strong. Everything else is dire.
COMPRESSION RESULTS
-------------------
Original SMAZ* bz2 zlib SMAZc ** SMAZcp ***
NUS SMS Messages 2666533 1851173 4106666 2667754 1876762 1864025
alice29.txt 148481 91958 43102 53408 92405
asyoulik.txt 125179 83762 39569 48778 84707
cp.html 24603 19210 7624 7940 19413
fields.c 11150 9511 3039 3115 10281
grammar.lsp 3721 3284 1283 1222 3547
lcet10.txt 419235 252085 107648 142604 254131
plrabn12.txt 471162 283407 145545 193162 283515
ACT corpus (concat) 4802130 3349766 1096139 1556366 3450138
Leeds URL corpus 4629439 3454264 7246436 5011830 3528446 3527606
* SMAZ with back-tracking
** SMAZ classic (original algorithm)
*** SMAZ classic with pathological case detection
If you have a use-case where you need to keep an enormous amount of small (separate) strings that isn't going to be limited by pySmaz's throughput, then congratulations!
The unit tests explore pySmaz's performance against a series of common compressible strings. You'll notice it does very well against bz2 and zlib on English text, URLs and paths. In the Moby Dick sample SMAZ is best out to 54 characters (see unit test) and is often number one on larger samples out to hundreds of bytes. The first paragraph of Moby Dick as an example, SMAZ leads until 914 bytes of text have passed!
On non-English strings (numbers, symbols, nonsense) it still does better with everything under 10 bytes (see unit test) And ignoring big wins for zlib like repeating sub-strings, out to 20 bytes it is dominant. This is mostly thanks to the pathological case detection and backtracking in the compress routine.
Backtracking buys modest improvements to larger strings (1%) and deals with pathological sub-strings, again - you are better off using zlib for strings longer than 100 bytes in most cases.
Notes on Python 3.x compatbility: Basically the split between bytes and strings is problematic for this codebase (in particular the unit tests !) I'll try and get it fixed-up over the next few months.
Licenses
BSD license per original C implementation at https://github.com/antirez/smaz Except for text samples which are in the public domain and from:
-
The ACT Corpus http://compression.ca/act/act-files.html Jeff Gilchrist
-
The Canterbury Corpus http://corpus.canterbury.ac.nz/ Maintained by: Dr. Tim Bell, Matt Powell, Joffre Horlor, Ross Arnold
-
NUS SMS Corpus http://wing.comp.nus.edu.sg:8080/SMSCorpus/overview.jsp Tao Chen and Min-Yen Kan (2012). Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus. Language Resources and Evaluation. Aug 2012. [doi:10.1007/s10579-012-9197-9]
-
Leeds Collection of Internet Corpora http://corpus.leeds.ac.uk/internet.html Serge Sharoff