Encoding and decoding are common operations on data. Given data in the form of symbols (e.g. text), it can be encoded by translating each symbol into a unique code, possibly consisting of many symbols. Decoding applies this process in reverse. The unique codes may be made of a different set of symbols (e.g. the original symbols may be characters, but the codes are bits). We call this set βcoding symbolsβ.
The main component of encoding and decoding can be thought of as a symbol -> code dictionary. The dictionary represents a coding scheme. For example, consider that all input messages are made of symbols {π,π,π,π,π} , and the coding symbols are {0,1}. An example dictionary could be {πβ10,πβ1100,πβ1101,πβ01,πβ00}
. Then, an input message βdabβ will be encoded as β01101100β by replacing the symbols βdβ, βaβ, βbβ with their respective codes from the dictionary. Similarly, an encoded string β11001001β will be decoded to βbadβ using the same dictionary.
Coding schemes may have some unique characteristics. For example, certain encoding schemes generate codes such that no code is a prefix of another code (e.g. 10 and 101 cannot both be codes, because the first is a prefix of the second). Such schemes are called prefix codes, and this property can be used to decode faster. You can verify that the dictionary in the above example shows a prefix coding scheme.
This repository containes prefix decoding called Huffman decoding scheme
symbol | code |
---|---|
a | 100 |
b | 00 |
c | 01 |
d | 11 |
e | 101 |
Formally, the original symbol set is {π,π,π,π,π} and the code symbols are {0,1}. We can see that the above code is a prefix code (no code is a prefix of another).
This coding scheme can be represented as a tree π:
All the original symbols are leaves in this tree. The others are transition nodes. Since there are only two code symbols, this is a binary tree. Each edge of the tree is annotated (for illustration) with a code symbol (left is 0, right is 1).
Given an encoded message 10001101 , we can decode it as follows:
-
Start at the root of the tree.
-
Read the next symbol.
-
Turn βleftβ or βrightβ depending on the read symbol.
-
If a leaf is reached, output the character at that leaf, restart at the root.
-
Go to step 2
This process is illustrated for the decoding 101 into βeβ:
Applying this process to the encoded message 10001101: we use the first three symbols 100 to arrive at βaβ, the next two symbols 01 to arrive at βcβ and the last three symbols 101 to arrive at βeβ. Thus the encoded message 10001101
decodes to the string βaceβ.
A complete coding tree is full: each transition node has children to its full capacity (2 above, since the tree is binary)
credit goes to Northeastern University