Padding and ordering
ctubbsii opened this issue ยท 5 comments
I noticed that every time I encode a number of bytes which is not 0 (mod 5)
, it is terminated by ๐. That makes sense since each each emoji encodes 10 bits, and that means it takes the same number of emojis (4) to encode 4 bytes as it does 5 bytes, and you need to distinguish between the two. However, it also means that the encoded form of 4 bytes is longer than that of 5 bytes, and that's unusual. Is ๐ always used to terminate the encoding of input sequences which do not align to 5 byte boundaries?
The boundary problem is one reason for padding in some encodings. It's not strictly needed in base64, because every input byte length always encodes to a unique output length, but padding is still useful in marking boundaries between two encodings stored sequentially without a delimiter. If you select another emoji as a terminator for inputs whose lengths are multiples of 5 bytes, it can serve this purpose and solve the surprising length issue I mentioned in the previous paragraph.
Another property of encoding to consider is sort order. It would be nice if the emoji sort order preserved the sort order of the input sequence. This is something that base64 originally did not have, but variants of base64 have added and is quite useful. Any terminator/padding character should not affect the sort order.
I think 5 special emoji characters can be used for padding. One emoji character that represents no data. Four emoji characters that are only used in the last position, when the last input byte is not present. The four chars represent the last two bits of the 4th input character.
What encoding would make the following three encoded strings sort properly (assuming there is padding)?
encString1 = concatenate(encode("33"),encode("222"))
encString2 = concatenate(encode("33"),encode("444"))
encString3 = encode("33333")
Maybe sorting concatenated is doable, but I wasn't thinking about that. I was thinking about:
- preserving sort order with terminating/padding characters (and the order of the mapping character set) without concatenation.
# this example works, but there might be a counter example which doesn't
echo -e 12345\\n1234 | sort
echo 1234 | ecoji
echo 12345 | ecoji
echo -e $(echo 12345 | ecoji)\\n$(echo 1234 | ecoji) | sort
- supporting padding to be able to concatenate and avoid garbled messages.
# works
echo $(echo -n 12345 | ecoji)$(echo 6789 | ecoji) | ecoji -d
# fails because no padding
echo $(echo -n 123456 | ecoji)$(echo 6789 | ecoji) | ecoji -d
I looked at the updated docs, and while it no longer has the garbling issue in version 1.0.0, as highlighted in my above comment, the docs still aren't clear as to how to tell the difference between 4 bytes of input vs. 5 bytes of input. How does the encoding ensure no collisions between these two possibilities?
Oh, never mind. I took a closer look and think I understand now. There are only 4 options (two bits) in the last emoji, in the case of 4 bytes of input. In that case, you pick one of 4 different emojis that is not in the normal set. In the case of 5 bytes of input, the emoji is selected from the larger set which does not include these 4.
The 4 emojis selected for the job of the last emoji encoding a 4-byte string are specifically chosen to preserve the sort order, so the encoding of any 4-byte input will sort immediately prior to any encoding of that same 4-byte input followed by a 5th input byte. Very cool. I haven't yet proven to myself that these 4 will serve the intended purpose, but I will have to dive deeper another time to prove to myself that the ordering works ๐บ