whatwg/encoding

Should not prepend gb18030 second and third when range decode fails upon fourth

hsivonen opened this issue · 5 comments

Consider a run of byte pairs where each pair starts with a byte that's valid as first or third and the second byte is valid as second or fourth in a four-byte gb18030 sequence.

The first two pairs form a four-byte sequence that fails the gb18030 ranges lookup. The spec, as written, realigns the decoder such that the second pair of the first out-of-range four byte sequence is next treated as the first half of another four-byte sequence.

The test case linked to above decodes to four replacement characters in Firefox, Chromium and into four question marks in IE and Edge. (Didn't test Safari.) However, per spec, the run decodes into "�9¶ĭ¶�9".

One error causing the decoder to misalign such that subsequent non-error characters come into existence where there previously were none in browsers can't be good.

Suggested fix:
When processing the fourth byte in a sequence, first check it it is in the range 0x30 to 0x39 inclusive. If it isn't, prepend the second, third and fourth bytes in the sequence per current spec and return error.

Then set code point to the index gb18030 ranges code point. If code point is null, prepend only the fourth byte and return error.

This would deviate from the current browser behavior by unmasking the last ASCII byte in the sequence the way we always unmask failed ASCII in two-byte sequences. However, unlike in the current spec, the ASCII byte that's second in the sequence would get swallowed into a REPLACEMENT CHARACTER. This is fine, because we know that ASCII byte had non-ASCII bytes before and after.

(Trying to emit two errors with an ASCII character in between as one step would be a novel behavior for the spec and would totally mess up the assumptions I've made about error emission in implementation.)

(Trying to emit two errors with an ASCII character in between as one step would be a novel behavior for the spec and would totally mess up the assumptions I've made about error emission in implementation.)

Nevermind. This can be addressed by adding more state to the decoder.

Nevermind. This can be addressed by adding more state to the decoder.

On third thought: Having two errors would then be unprecedented in another way, which even with the added state from the previous comment would mess up encoding_rs's mechanism for designating which bytes a given U+FFFD corresponds to (not currently made use of in Firefox).

On third thought: Having two errors would then be unprecedented in another way, which even with the added state from the previous comment would mess up encoding_rs's mechanism for designating which bytes a given U+FFFD corresponds to (not currently made use of in Firefox).

Hmm. On fourth thought, maybe the machinery I put in place for ISO-2022-JP could work here.

Let's do what's best for the Web and I'll sort out the error reporting.

@jungshik any thoughts with regards to ICU?

For four byte sequences we're dealing with this:

  1. Non-ASCII
  2. ASCII digit
  3. Non-ASCII
  4. ASCII digit

Now if 4 is not an ASCII digit it sounds like @hsivonen is suggesting it's reasonable to unwind, U+FFFD 1, and reprocess from 2.

If 4 is an ASCII digit, but there's no code point, the question is what to do then:

  1. U+FFFD 1-2-3 and create a theoretical issue for formats where ASCII digits are delimiters of sorts.
  2. U+FFFD 1 and 3, and emit 2 and 4 (in their original order).
  3. U+FFFD 1-2-3-4 which is what browsers do today and don't worry about ASCII digits getting consumed.

90 31 90 40 yields "�1怈" in Chrome/Firefox/Safari.
FD 31 FD 7F yields "�1��" Chrome/Firefox/Safari (although the last code point does not display in Chrome and Safari, it does seem to copy-and-paste).
FD 31 FD 21 yields "�1�!" in Chrome/Firefox/Safari.

So given that they do seem to consistently unwind there. So the main question here is what to do when the pointer lookup goes astray.