Huffman

Bored during Covid, so doing a Huffman encoding routine for fun (yes, I'm an interminable nerd). Inspired by Tom Scott's video.

Principle

Wikipedia explanation here.

Essentially, you scan the text and count the number of occurrences of each character. Then, you create a tree where each leaf node is one of the characters. The tree is constructed in such a way that the most common characters are nearest the root node and the least common are furthest away.

You can then encode any character by using 0 and 1 to represent left or right in navigating the tree. In the simple example above, w is reached by going right, right, left, right or 1101. So the character is encoded in only 4 bits rather than the usual 8 (or more).

Notes

For me, creating the Huffman tree was much easier to reason about when using a Priority Queue.

Improvements

Have been slowly optimising the code.

See Benchmarks.md for details of how various commits improved things. This uses the console app.

See Benchmark-dotnet-Results.md for initial and current best results using BenchmarkDotNet.

Initial Results

Downloaded some public domain novels from Project Gutenberg. Here's how they compress.

A Tale of Two Cities

Original size: 792,968 bytes.
Compressed size: 454,568 bytes.
Ratio: 57.32%

Huffman Tree Visualisation

Frankenstein or the Modern Prometheus

Original size: 448,844 bytes.
Compressed size: 253,295 bytes.
Ratio: 56.43%

Huffman Tree Visualisation

Great Expectations

Original size: 1,035,187 bytes.
Compressed size: 595,235 bytes.
Ratio: 57.50%

Huffman Tree Visualisation

Les Misérables

Original size: 3,325,127 bytes.
Compressed size: 1,921,391 bytes.
Ratio: 57.78%

Huffman Tree Visualisation

Pride and Prejudice

Original size: 790,334 bytes.
Compressed size: 432,705 bytes.
Ratio: 54.75%

Huffman Tree Visualisation

War of the Worlds

Original size: 363,744 bytes.
Compressed size: 207,483 bytes.
Ratio: 57.04%

Huffman Tree Visualisation

Seems to pretty much hover around the 57% mark.

Fun Observations

H. G. Wells closed all of his open brackets.

stevehjohn/Huffman

Huffman

Principle

Notes

Improvements

Initial Results

Fun Observations