Bored during Covid, so doing a Huffman encoding routine for fun (yes, I'm an interminable nerd). Inspired by Tom Scott's video.
Wikipedia explanation here.
Essentially, you scan the text and count the number of occurrences of each character. Then, you create a tree where each leaf node is one of the characters. The tree is constructed in such a way that the most common characters are nearest the root node and the least common are furthest away.
You can then encode any character by using 0
and 1
to represent left or right in navigating the tree. In the simple example above, w
is reached by going right, right, left, right or 1101
. So the character is encoded in only 4 bits rather than the usual 8 (or more).
For me, creating the Huffman tree was much easier to reason about when using a Priority Queue.
Have been slowly optimising the code.
See Benchmarks.md for details of how various commits improved things. This uses the console app.
See Benchmark-dotnet-Results.md for initial and current best results using BenchmarkDotNet.
Downloaded some public domain novels from Project Gutenberg. Here's how they compress.
- A Tale of Two Cities
Original size: 792,968 bytes.
Compressed size: 454,568 bytes.
Ratio: 57.32%
- Frankenstein or the Modern Prometheus
Original size: 448,844 bytes.
Compressed size: 253,295 bytes.
Ratio: 56.43%
- Great Expectations
Original size: 1,035,187 bytes.
Compressed size: 595,235 bytes.
Ratio: 57.50%
- Les Misérables
Original size: 3,325,127 bytes.
Compressed size: 1,921,391 bytes.
Ratio: 57.78%
- Pride and Prejudice
Original size: 790,334 bytes.
Compressed size: 432,705 bytes.
Ratio: 54.75%
- War of the Worlds
Original size: 363,744 bytes.
Compressed size: 207,483 bytes.
Ratio: 57.04%
Seems to pretty much hover around the 57% mark.
H. G. Wells closed all of his open brackets.