whatwg/infra

Incorrect code unit/code point substring definitions

triple-underscore opened this issue · 6 comments

The code unit substring from index start to the end of a string string is the code unit substring from start to string’s length − 1 inclusive within string.

For example, given the empty string, the result of
“the code unit substring from index 0 to the end of the empty string”
should be the empty string, but by definition, which is:
“the code unit substring from 0 to − 1 inclusive within the empty string”,
wihch is by definition:
“the code unit substring from 0 with length − 1 within the empty string”,
violating the assertion "start and length are nonnegative".

I think there is a confusion in the use of the term “indices” that the “indices” in
“The code unit substring from indices start to end inclusive”
means indices between characters, while the other indices mean character position itself.

So this is kind of working as intended, but there is a bug in the definition of substring by indices inclusive.

The intent of the current text was:

  • "inclusive" is about character position. E.g. "substring from 1 to 2 inclusive within 'abc' is 'bc'", or "substring from 0 to 0 inclusive within 'a' is 'a'". This is buggy in the current spec as we actually have exclusive behavior; the previous examples give 'b' and the empty string.

  • It is a spec bug (i.e. should trigger an assert) to take a substring starting or ending at an index that doesn't exist. So e.g. "substring from 0 to the end of the empty string" is a spec bug since there is nothing at index 0 in the empty string.

I see a few options here:

  1. Just fix up substring-by-indices-inclusive to work as intended. I believe this means changing end - start to end - start + 1.

  2. Fix substring-by-indices and substring-to-end to be more forgiving of empty string cases, instead of having them assert whenever you reference index 0 of an empty string.

  3. Change substring-by-indices to be exclusive, not inclusive, matching most programming languages (e.g. JavaScript's String.prototype.substring()). And change substring-to-end to be more forgiving of empty string cases, returning the empty string.

The reason (3) might be more attractive than (2) is that "the substring from indices 0 to 0 inclusive within the empty string" seems to me like a spec bug; both "from 0" and "0 inclusive" are indicating that you really should get the code point/unit at index 0 (which doesn't exist). Whereas "the substring from indices 0 to 0 exclusive within the empty string" seems like it could return the empty string; in particular, the "0 exclusive" overrides the "from 0" and gives an empty string result.

(1) is probably bad because we can't really expect calling specs to always know if they have the empty string. They would be forced to branch before calling into Infra.

Any thoughts? /cc @annevk @wanderview as well.

What about substring to define more forgiving way? such as:
“the code unit substring from index start with length up to length within string string”,
— e.g. substring from 1 with length up to 10 within "foo" returns "oo".
This accepts any string shorter than start + length and might be more convenient in some cases.

I think that is another good option, but it doesn't solve all the problems. We still need to figure out what "substring from 0 to 0 within the empty string" should do, or "substring from 0 to the end of the empty string". In both cases those are referring to an index 0 that does not exist, as long as we stick with the interpretation of indices as talking about code points/units instead of talking about positions between code points/code units.

Would the ECMA-262 approach work?

The phrase "the substring of S from inclusiveStart to exclusiveEnd" (where S is a String value or a sequence of code units and inclusiveStart and exclusiveEnd are integers) denotes the String value consisting of the consecutive code units of S beginning at index inclusiveStart and ending immediately before index exclusiveEnd (which is the empty String when inclusiveStart = exclusiveEnd). If the "to" suffix is omitted, the length of S is used as the value of exclusiveEnd.

Yes, that's my (3). I'll take that as a vote in favor of (3) as opposed to the other two options? :)

3 sounds best. I actually hadn't appreciated that the current definition wouldn't work if you passed in the length of a string as the second argument.