vshymanskyy/muon

Question regarding numbers being passed between JS and Python versions and what it means for their types

JobLeonard opened this issue · 2 comments

Apologies for the wall of text, it's because I really like the idea of Muon :)

TL;DR: key questions are bolded in the text below

So bit of context: I'm trying to write a JavaScript implementation, since the format is so elegantly simple that I feel I can achieve a basic version of it. My first goal is to have "feature parity" with how JavaScript handles JSON. That is: being able to roundtrip any object that you could also send through JSON.stringify and get back from JSON.parse, and ignoring the things that it can't. After that I'll worry about the extra things Muon supports.

Having said that, the fact that Muon has more ways to encode numbers was too tempting to not play around with for size savings. The way I have handled numbers so far is assuming that everything is a double unless explicitly made a BigInt (so basically how JavaScript handles numbers), and reserving i64, u64 and LEB128 for those BigInt values. This lets me use all the other number types the AX and BX rows to always pick the minimum number of bytes necessary to encode numbers, e.g.

[8, 16, 1/16, 0.1] => 90 A8 B4 10 B9 00 00 80 3D BA 9A 99 99 99 99 99 B9 3F 91
                       |  |  |     |              |                          |
                       |  |  |     |              |                          List end
                       |  |  |     |              f64 approximation of 0.1
                       |  |  |     f32 encoding of 1/16
                       |  |  u8 encoding of 16
                       |  direct encoding of 8
                       List start

This is not a problem when just roundtripping JS-to-JS, since all values just gets promoted back to doubles in the end.

But now imagine we're sending data between Python and JS code through Muon. Python uses variable sized numbers under the hood, right? One could say all integers are "BigInt" and all floats are doubles (I think), unless one is working with NumPy. The example Python encoder from the slides either directly encodes 0-9, or uses LEB128 for all other integers.

Imagine we have a list of integers between 0 and 1000 [some value bigger than Number.MAX_SAFE_INTEGER] in Python that we encode this way, then decode in my JS implementation. We would end up with an array of mixed doubles and BigInt values. So one number type gets converted to two different ones.

One way to handle this would be to say that a JS implementation of Muon has to convert LEB128 back to doubles if it safely fits in a double, but that also potentially leads to issues: say that I start in JavaScript with a list that contains BigInts, some of which could also be safely converted to doubles. First we serialize this list. Let's assume this will use LEB128 encoding because of the BigInt type, like I have so far. Now we deserialize this list in JS. Because of the rule we just established some of the BigInt values will turn into doubles - we change the number types again!

So we basically have two needs that are a little bit at odds:

  1. serialization/deserialization within the same language should not result in type changes
  2. serialization in one language and deserialization in another should result in predictable number types

I think the best summary of this question is: how should Muon handle the different ways languages handle number types when transmitting data between these languages?

For now, for my own implementation I will prioritize 1 over 2 (because it's a toy implementation and I'm not planning to interact with Python in my own use-cases).

PS: I'm sure this question has come up with other encodings that have support for more than just doubles, so maybe it's worth looking up what the arguments + conclusions were in those situations?

TypedArrays mostly resolve this issue, and List is defined as a sequence of abstract objects, so they can have an arbitrary type.
Muon only distinguishes ints vs floats, but JS has no such distinction. So I would suggest such rules:

  • TypedArrays preserve their types whenever possible
  • Standalone Integer values (with any encoding) in range [Number.MIN_SAFE_INTEGER, Number.MAX_SAFE_INTEGER] are treated as Number. Outside this range they become BigInts.
  • Floats/Doubles are treated as Number (this is a direct match)

Ok! So that would mean that outside of typed arrays, BigInts in the [Number.MIN_SAFE_INTEGER, Number.MAX_SAFE_INTEGER] range can turn back into "plain" Number values after a roundtrip through muon. That's probably an acceptable trade-off as long as we're explicit about this - it's the easiest one to explain for starters.

I'll just make sure to add that as a potential gotcha in the documentation of my implementation.