tc39/proposal-bigint

More precise parseInt and toString

thejoshwolfe opened this issue · 18 comments

BigInt seems to be mimicking the parseInt and toString semantics of Number, but I don't understand why so much implementation-defined behavior is allowed for BigInt. For Number there are practical issues with loss of precision and exponential notation, but these are not concerns for BigInt.

This particular excerpt is concerning (from here relevant to BigInt through here):

However, if R is 10 and Z contains more than 20 significant digits, every significant digit after the 20th may be replaced by a 0 digit, at the option of the implementation; and if R is not 2, 4, 8, 10, 16, or 32, then mathInt may be an implementation-dependent approximation to the mathematical integer value that is represented by Z in radix-R notation.

It seems to defeat the purpose of arbitrary-precision BigInt if implementations can discard precision during parseInt.

And here's a concerning exceprt regarding toString (from here when radix is not 10):

The precise algorithm is implementation-dependent, however the algorithm should be a generalization of that specified in 3.1.4.1.

Why are we not precisely defining an abstract algorithm?

That's a very good point. There is a related source of ambiguity in https://tc39.github.io/ecma262/#sec-runtime-semantics-mv-s

Otherwise, the rounded value must be the Number value for the MV (in the sense defined in 6.1.6), unless the literal includes a StrUnsignedDecimalLiteral and the literal has more than 20 significant digits, in which case the Number value may be either the Number value for the MV of a literal produced by replacing each significant digit after the 20th with a 0 digit or the Number value for the MV of a literal produced by replacing each significant digit after the 20th with a 0 digit and then incrementing the literal at the 20th digit position.

I think it'd be a good idea to make a uniform decision and fix all of these, for both Number and BigInt. I don't think I'll be able to do so by the September meeting, though.

For BigInt in particular, there should never be such a 20-digit limitation. The exact answer should be given regardless of the number of digits and the radix. I guess that's a separate spec text fix which will be easier than the Number one.

Not sure if this is directly related, but parseInt also has some funny behavior with respect to radix prefixes: it allows "0x" prefixes if radix is either not passed or 16, but disallows it (or rather, returns 0) otherwise. It's not clear to me that this legacy behavior is something we want to bring forward, unless the intention of this particular method is to match the behavior of the legacy parseInt.

@ajklein Thanks for raising that (rather separate) question. I don't see a big harm coming from that stripPrefix behavior, but maybe it'd make sense to clean it up. @cxielarko , you implemented this functionality and tests for it, do you have an opinion here?

I pushed a patch to fix @thejoshwolfe 's issue, but the issue from @ajklein might have different points of view, so I'll do that as a pull request to get more feedback.

As long as we're cleaning up parseInt, we could also do a few more things beyond what @ajklein is suggesting:

  • When the radix is invalid, throw a RangeError rather than a SyntaxError. (I believe @cxielarko suggested this in the past, but somehow I didn't understand at the time.)
  • Throw an exception if there is more non-whitespace text after the recognized numeric part, rather than ignoring it.

I'll put together all three of these changes into a PR and make a matching test262 PR.

We settled on removing the parseInt function at the November 2017 TC39 meeting.

What?! How is one supposed to parse BigInts with custom radixes? Or is that simply not a supported use case any more?

@jakobkummerow

How is one supposed to parse BigInts with custom radixes?

const alphabet = '0123456789abcdefghijklmnopqrstuvwxyz'.split('');
function parseBigInt(str, radix = 10) {
  if (radix < 2 || radix > alphabet.length || Math.floor(radix) !== radix) {
    throw new RangeError('radix out of range');
  }
  let val = 0n;
  for (const c of ('' + str).split('')) {
    const index = alphabet.indexOf(c);
    if (index < 0 || index >= radix) {
      throw new RangeError('character out of range');
    }
    val = val * radix + BigInt(index);
  }
  return val;
}

(Modulo casing, _ separators, etc)

Something like that can be added to the language later, of course, but I don't think it's that big of a deal to leave it out initially, given that it's doable in userland.

The idea would be to leave it for user libraries. This proposal already leaves several other things to user libraries, such as bitcasting Numbers to 64-bit ints, writing and reading arbitrarily large BigInts from ArrayBuffers, and bit instructions such as popcount, leftmost/rightmost set/clear bit, etc. All of those could be implemented by user code or be faster as built-ins, but are left out in the interest of minimalism; this proposal could join the club.

given that it's doable in userland.

Of course it is; with TypedArrays and | 0 arithmetic you can implement the entire BigInt proposal (except for the overloaded operators) in userland.

In a quick benchmark based on our unit test's inputs, @bakkot 's polyfill (with an added radix = BigInt(radix) to make it work) is 20-30x slower than the native version. I am surprised that the committee is now demanding that we delete that native implementation.

IMHO it's weird to have .toString(radix) but no reverse function.

You can get the native implementation to do the parsing for the 4 common radixes with the BigInt() constructor:

function parseBigInt(str, radix = 10) {
  switch (radix) {
    case 2:  return BigInt("0b" + str);
    case 8:  return BigInt("0o" + str);
    case 10: return BigInt(str);
    case 16: return BigInt("0x" + str);
  }
  // fallback to a userland implementation...
}

This neglects lots of corner cases like empty input and leading whitespace.

Python has a similar asymmetry between parsing and stringification, but amusingly it's the reverse situation. int() provides parsing with any radix, and str(), hex(), oct(), and bin() provide stringification for the 4 common radixes.

IIRC @ajklein was pushing more for removal for now; maybe you could clarify.

I discussed this offline with @jakobkummerow, filling him in on the discussion around a hypothetical fromString method at TC39. Given the context of that conversation, and his agreement that the use-case for arbitrary parsing-in-radix of BigInts is unclear, my impression was that he didn't strictly object to parseInt's removal. It seems his biggest complaint is the lack of symmetry with toString.

Yes, I think there should either be both toString(radix) and fromString(..., radix) or neither of them (by which I mean that toString() should not accept a radix parameter then). I don't feel strongly about how the latter function is called; "fromString" is a nice way to break free from the legacy behavior associated with the "parseInt" name.

(I find none of "it can be polyfilled in userland", "in the interest of minimalism", or "the use-case is unclear" particularly convincing arguments, because they could be used to shoot down the entire proposal.)

OK, the committee didn't seem necessarily opposed to adding fromString, except that it might be weird to have fromString on BigInt and not Number. Would it be OK to add a matching Number.fromString in this proposal? Cc @ljharb

I think it would be excellent to add both; and would address any consistency argument from only adding one.

After discussing further with the @jakobkummerow and @ajklein offline, we concluded to stick with the November 2017 committee decision and pursue BigInt.fromString together with Number.fromString as a follow-on proposal, and remove Decimal.parseInt from this proposal.

Decimal, or BigInt?

BigInt (edited the comment above)--sorry, silly typo. If we have a follow-on Decimal proposal, it will follow the pattern established by this fromString proposal.