A compression tool is a piece of software that is used to store a file data in a compressed format. It is useful for space optimization i.e it helps to store the same amount of information using less memory space. In this application Huffman coding algorithm is used to compress the file content.
The main steps of the algorithm are explained below -
- Frequency Count: First and most important step of this algorithm is to count the frequency of each character present in the source file.
- Build Huffman Tree: Next step is to build a Huffman tree based on the frequency table.
- Encode Tree: We assign the left edge a value of
0
and right1
for each node from root node till we reach a child node. After this process we can obtain the encoded bit string for each character. Since characters are present in the child node only, the route from root to a particular child node is the encoded bit string for that particular character.
Let's understand the Algorithm with the help of an example 🧠
-
Suppose the file content is
halamadrid
-
The frequency table would look like this.
Character Frequency h
1 a
3 l
1 m
1 d
2 r
1 i
1 -
The Huffman Tree (🌲) for the corresponding table would be
- The following steps are used to build the Huffman Tree from the frequency table 🪜
- Initially insert all nodes into a
priority queue
. - A node have to have at least two fields
characters
- representing the characters the node holds.frequency count
- frequency count of the chracters.
- Select two characters with least frequencies, remove those two nodes from the list, add their frequency and insert the newly created node into the list.
- Step (iii) is repeated until the list contains only one character.
- Initially insert all nodes into a
- Once the
Huffman Tree
is built, not its time to prepare the tree for encoding. For each non-terminal node, assign the left child edge a value of0
and right child edge a value of1
.
-
After the tree is build and edges are assigned values, the bit encoding of a character is equal to the path from root node to a child node. The encodings(🪢) are shown below -
Character Encoding h
000 a
11 l
100 m
1011 d
01 r
1010 i
001 -
So, the encoded bit sequence for
halamadrid
will be000111001110111101101000101
Since we can't open a lock without it's key 🔑, here also if we only store the encoded bit sequence to a file, we wont be able to decode the file content later. To be able to decode the content we should either have access to the frequency table or the character encodings table directly. To store this indirect information, we could insert a header section in our target file, which will hold this importalt information. Obviously, the file size will increase depending on the size of the table compared to only storing the encoded bit sequence, but these are very much importalt to retrieve the original information.
Here are some variables that i used to identify Header section in my application.
var (
HeaderBodySeparator string = "\n~<=>-\n"
KeyValuePairSeparator string = "%->"
KeyValueSeparator string = "-<>~"
)
And the compressed file file content for the text halamadrid
is
000-<>~h%->001-<>~i%->01-<>~d%->100-<>~l%->1010-<>~r%->1011-<>~m%->11-<>~a
~<=>-
�ïh�
Where �ïh�
is the actual encoded content.
To decode the compressed file content, we would first need to parse the header part to retrieve the character encodings table. In the above example the header part contains
000-<>~h%->001-<>~i%->01-<>~d%->100-<>~l%->1010-<>~r%->1011-<>~m%->11-<>~a
In which each key-value pair is separated by %->
. Now if we store the k-v pairs in an array, the array would look like
[
000-<>~h,
001-<>~i,
01-<>~d,
100-<>~l,
1010-<>~r,
1011-<>~m,
11-<>~a
]
Now using the key-value separator i.e. -<>~
, we can build the character encodings table as following.
Character | Encoding |
---|---|
h |
000 |
a |
11 |
l |
100 |
m |
1011 |
d |
01 |
r |
1010 |
i |
001 |
Since, we have now access to both the encoded bit sequence and the character encodings table, we can now easily retrieve the original file content.
Just like all good things, it comes with its caveat. This application also has its caveat. The application only works great with large size files. The header overhead sometimes becomes much more than the original file content for small-sized files.
- First build the application by running
make
. - Run the application using
./gc
or by runninggo run main.go
directly.
In the above picture, it can be seen that the compressed file gc_comp.txt
occupies less memory than the source file gc_source.txt
.