vshymanskyy/muon

Consider using `prefix code` and `zigzag` encoding for variable-length integers

vshymanskyy opened this issue · 2 comments

Consider using `prefix code` and `zigzag` encoding for variable-length integers

Proposal: can we broaden this issue to discuss what variable-length integer encoding in general would be most appropriate?

I was unfamiliar with zigzag encoding and what advantages it would have over the current method, so I searched for "leb128 vs zigzag". I found these two related discussions about LEB128 being used in WASM, both of which raised a number of points that seemed relevant to the topic:

https://news.ycombinator.com/item?id=11263378

WebAssembly/design#601

A quick summary of what stood out to me:

  1. zigzag encoding/decoding is allegedly faster for signed integers than plain signed LEB128
  2. implementations for prefix-based varints are generally 2x-3x faster than LEB128
  3. LEB128 allows for locating the start and end with random access, prefix-based varints don't
  4. In the context of UTF8, having non-canonical representations has lead to security issues.
    • Technically LEB128 also lacks canonical encodings for numbers: zero for example would typically be encoded as 00, but technically 80 00, 80 80 00, ... 80 80 ... 80 00 also work. Should this be forbidden? (in practice this means disallowing LEB128 to have trailing 80s followed by 00 for positive values, and similarly FFs + 7F for negative values. Can't think of the top of my head if this applies to zigzag encoding)

I have no strong opinions on any of this, and I do not know what the most relevant criteria are to decide on a varint encoding for muon anyway. But these points they seemed relevant enough to bring up (even if muon sticks to plain LEB128 you can at least say that you looked at alternatives, right?)

PS: for anyone else unfamiliar with zigzag encoding, this blog and this SO question helped me out

I think i'll give it another try