whatwg/encoding

Add "replacement" as a label for the replacement encoding

hsivonen opened this issue · 21 comments

I've written "The name of an encoding is also one of its labels, except in the case of the replacement encoding whose name is not one of its labels." inconveniently many times when either documenting code that implements the Encoding Standard or when otherwise explaining the concepts.

Also, I've written code that does something like "if input string is 'replacement', don't run get an encoding, otherwise, run get an encoding" when working with interfaces that were designed (four links there, but GitHub's styling makes it unobvious) before there was clarity of what strings are labels and what strings are names used where an enum value or a reference to a singleton object representing an encoding would be more appropriate software design (when those interfaces potentially have callers in add-ons that I can't fix).

All this would become simpler if get an encoding for the name of an encoding always returned the encoding itself. However, it's kinda sad to expose another Web-exposed label just to make implementing and explaining stuff easier, so I'm not sure if I should request this.

But at least this deserves some discussion.

/cc @cdumez who I noticed recently updated WebKit's encoding labels in WebKit/WebKit@f203fd0

I assume it would be a bad option to rename "replacement" to something like, say, "csiso2022kr"?

@domenic Despite having done the update in WebKit recently, I am actually not familiar with this area enough to comment. I know that we do have the "replacement" label for the replacement encoding in WebKit though.

I know that we do have the "replacement" label for the replacement encoding in WebKit though.

Are you sure? Testing indicates otherwise.

I assume it would be a bad option to rename "replacement" to something like, say, "csiso2022kr"?

That would be pretty confusing, so I think that's a bad option.

@hsivonen: I wrote the patch very recently so you would need to try a WebKit nightly build.

@cdumez someone did try a nightly. It appears that WebKit (per spec and other browsers) does not alias "replacement" to the behavior of the replacement encoding.

Oh my bad, there was some last minute review feedback on my patch and it appears it killed the replacement alias. Sorry about the bad information.

We added replacement in https://www.w3.org/Bugs/Public/show_bug.cgi?id=21057. I don't think we really discussed adding it as a label before. I'm certainly open to adding it as it does simplify the system a little bit, but also not a whole lot.

@jungshik thoughts?

FWIW, I concur with the OP - frequent special cases in the code and tests, but sad about adding it to the web just to make implementations cleaner. So no vote either way.

I'm going to close this since everyone is on the fence. Feel free to reopen though if you feel strongly since it doesn't seem like it would be a hard sell.

I'm going to reopen this to add "replacement" as a label as the special cases apparently continue to cause problems (at least in Gecko while integrating encoding-rs) for no real gain.

As nobody objected and @hsivonen now favors this approach I hope that is acceptable, but I'll leave some time for feedback just in case.

I'm surprised this is such a big deal in code as there's nothing in the standard (or any standards that use this standard) to my knowledge that trips over this.

Nevertheless, I created the PR, review appreciated. Note that before landing it I should probably:

  1. Update tests.
  2. File bugs against Firefox, WebKit, and Chromium. Not sure about Edge as they haven't made an effort to comply thus far I think.

Shouldn't we get another implementation interested before landing?

I considered @inexorabletash's reply above as such, but happy to wait for something more explicit.

It involves deleting code on our side so I'm okay with the change.

FWIW, I put up a Blink change: https://chromium-review.googlesource.com/c/559973/ - I'll wait for test updates to hit WPT and roll into Blink, though.

Sanity check: we would now expect an HTML file with <meta http-equiv="content-type" content="text/html; charset=replacement"> to render as � yes?

@inexorabletash yeah, that wouldn't be any different from it saying iso-2022-kr or some such. Note that it would have to appear within the first 1024 bytes.

@hsivonen do you want to review the change?

Bugs:

To my knowledge Edge hasn't made an effort yet so I'm not including them for now.

https://developer.microsoft.com/en-us/microsoft-edge/platform/status/encodingstandard/ lists them as "In Development" so they might appreciate a bug.

FYI, Blink change landed.

Worth noting: @annevk's WPT changes didn't trip any failures when rolled into Blink's CI since the tests currently exercise the encoding labels via the API, and replacement encodings already threw just like unknown labels.

We've got blink-specific tests that use XHR and data: URLs to verify various encodings and specifically that the replacement ones yield U+FFFD. I was lazy and just added "replacement" to the list. We should probably tidy up and upstream those (among other fun cases like UTF-7).