/compression

Primary LanguageJupyter Notebook

Compression: Huffman encoding

Intro

In this repo I explore a basic approach for compression: Huffman coding. And test it with a smaller version of enwik9 from Hutter Prize competition.

Disclaimer: This is not an attempt whatsoever to create a decent compressor, just exploration.

Usage

For encoding and decoding enwik5:

python main.py

Outputs:

Orig text: 
 <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSch ... utism and as such resist it, dubbed ''anti-cure'', and the many people caught in the middle of the t
Decoded text 
 <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSch  ...  utism and as such resist it, dubbed ''anti-cure'', and the many people caught in the middle of the t
Is text equal: True
Size of orig text:  100.148 KB
Size of encoded text:  61.032125 KB (488257 bits)
Approx size of hufman dict: 3.712 KB
Approx size of node dict: 8.775 KB
Compression ratio (no encoding dict):  1.640906325971773
Optimal entropy given the split 4.862219963642222
Encoding entropy 4.882569999999999

Run tests:

pytest test.py

Huffman coding and tree

https://en.wikipedia.org/wiki/Huffman_coding

Shannon entropy

Shannon entropy is the theoretical minimum value that a encoding can achieve given certain tokenization. It is defined as $$H(X) = -\sum_{i=1}^{n} P(x_i) \log{P(x_i)}$$ Entropy of the encoding $$L(C)= l_i * P(x_i) = l_i*w_i$$ Being $l_i$ the length of the codeword ($l=3$ for $c_i=101$). And $w_i$ is the weight of the character or the probability. $L(C)&gt;=H(X)$

https://en.wikipedia.org/wiki/Entropy_(information_theory)

Plan

TODO:

  • Implement Huffman encoding
  • Implement Huffman decoding
  • Implement test (both texts equal and huffman and right huffman tree)
  • Check entropies are similar
  • Optimize tokenization
  • Use actual bits for encoded text (currently they are chars)
  • Check why decoding takes so much time?