/Compression

compression == intelligence

Primary LanguageJupyter Notebook

Compression = Coding + Modelling

Coding: How to store symbols.
Modelling: Which symbols are more probable.

We want to code the most probable symbols in the least amount of bits. And vice versa.

Coding is solved. Modelling is proved unsolvable.

On first 1000 bytes of wikipedia
gzip :: 508 bytes
16 bit n-gram :: 522 bytes
16 bit n-gram + backoff :: 513 bytes
16 bit learned weighted n-gram :: 587 bytes

You can test your models too! Docs

TODO:

  • Code with more bits to improve precision
  • Test compression ratio w/ more data

Results:

averaged on 3 random samples of 1000 bytes from enwik9

Model Compressed Size Theoretical Compression
zip 532.3333333333334 N/A
Ngram 722.3333333333334 717.4846796689859
Backoff 685.0 680.7137133275579
LearnedWeighted 718.3333333333334 712.3682375572251