Binary keys
Closed this issue · 9 comments
GoogleCodeExporter commented
Hello,
It seems there is no way to store binary data containing 0 bytes in DAWG, at
least it seems that Completer doesn't support it (completer.key() and
completer.length() returns values truncated at first 0 byte).
My use case is to store binary payload as a part of keys. The keys are
<utf8-encoded unicode key> + chr(255) + <binary_value>
this will allow implementing mapping from unicode to a list of possible binary
values using DAWG where values will be compressed just like keys; in order to
get all values Completer class may be used.
Zero bytes are not an issue for the utf8 keys because utf8 text is guaranteed
to not include them. chr(255) also can't be a part of utf8-encoded string so it
is fine as a separator. But zeroes in binary data renders this scheme
non-working.
Is there a workaround? Thanks.
Original issue reported on code.google.com by kmik...@gmail.com
on 21 Aug 2012 at 10:19
GoogleCodeExporter commented
Small clarification: I was not correct about utf8: utf8 may include zero byte
because NUL unicode codepoint is represented as zero byte in utf8. There are no
other possible zero bytes in UTF8 (multibyte codepoints have some bits set in
all bytes) so this is usually not an issue.
Original comment by kmik...@gmail.com
on 25 Aug 2012 at 8:31
GoogleCodeExporter commented
In a Python wrapper I've resolved this by encoding <binary_value> to base64.
Encoding/decoding speed doesn't become bottleneck (when using
http://libb64.sourceforge.net/ decoding is orders of magnitude faster than
Completer iteration speed).
Original comment by kmik...@gmail.com
on 4 Sep 2012 at 9:38
GoogleCodeExporter commented
I'm sorry that I couldn't help you.
As you mentioned, Completer uses '\0' as the sentinel and it seems to be
impossible to fix this without changing the structure of its auxiliary data.
Also, the change will increase the size of the auxiliary data.
So, now I'm planning to implement a new Completer for binary keys.
Original comment by susumu.y...@gmail.com
on 7 Sep 2012 at 12:10
- Changed state: Accepted
GoogleCodeExporter commented
Thanks! This would be a very useful feature.
Original comment by kmik...@gmail.com
on 7 Sep 2012 at 12:42
GoogleCodeExporter commented
I'm really sorry.
It has turned out that it is impossible to support binary keys without changing
its core data structure (major update).
When I posted the last comment, I expected that dawgdic can support binary keys
without changing Dictionary.
However, I've found that not only Completer but also Dictionary require
modification.
In addition, the modification degrades the space efficiency and the time
efficiency of dawgdic.
This drawback may be unacceptable.
In short, base64-encoding may support binary values better than natively
supporting binary keys.
Original comment by susumu.y...@gmail.com
on 14 Sep 2012 at 12:16
GoogleCodeExporter commented
Thanks for all your work on this. Binary keys would be good but base64 works
good enough in practice. So maybe the limitation should be documented somewhere?
Original comment by kmik...@gmail.com
on 14 Sep 2012 at 12:20
GoogleCodeExporter commented
Thanks for your comments.
I'll update dawgdic to reject binary keys and then update documentation.
Original comment by susumu.y...@gmail.com
on 14 Sep 2012 at 12:40
GoogleCodeExporter commented
Looks like a good plan.
Original comment by kmik...@gmail.com
on 14 Sep 2012 at 12:52
GoogleCodeExporter commented
I've finished the update of dawgdic.
Also, the documentation has been updated.
Original comment by susumu.y...@gmail.com
on 15 Sep 2012 at 6:51
- Changed state: WontFix