tc39/ecmarkup

HTML character reference processing is inconsistent

gibson042 opened this issue · 4 comments

#481 introduced processing of named character references, but is inconsistent with the HTML spec (cf. https://html.spec.whatwg.org/multipage/parsing.html#tokenization and specifically https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state )—it matches without a terminal semicolon but also fails to handle numeric references or the prefix matching that renders e.g. I'm &notit; I tell you as I'm ¬it; I tell you, and the inner logic lowercasing ≤ (which is its own bug and should be <) and & only with a terminal semicolon is inconsistent with it being optional in the initial match and also inconsistent with the lack of e.g. AMp; in https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references .

I think it should instead be either unconditionally strict1 or else better aligned with HTML2, and in either case should probably also be updated for numeric references and correct case handling.

Footnotes

  1. e.g.,

    --- src/formatter/text.ts
    +++ src/formatter/text.ts
    @@ -9,1 +9,1 @@
    -  text = text.replace(/&[a-zA-Z0-9]+;?/g, m => {
    +  text = text.replace(/&[a-zA-Z0-9]+;/g, m => {
    
  2. e.g.,

    --- src/formatter/text.ts
    +++ src/formatter/text.ts
    @@ -11,2 +11,4 @@
    -    if ({}.hasOwnProperty.call(entities, m) && entities[m] !== null) {
    -      return entities[m];
    +    const prefix = Array(m.length).fill().map((_, n) => n ? m.slice(0, -n) : m)
    +      .find(prefix => ({}).hasOwnProperty.call(entities, prefix) && entities[prefix] !== null);
    +    if (prefix) {
    +      return entities[prefix] + m.slice(prefix.length);
    

matches without a terminal semicolon

Only for legacy entities, which is consistent with HTML. The list of entities matched is taken directly from the HTML spec.

also fails to handle numeric references

I don't think we have any uses of that form to check if the decision to use that form was deliberate (e.g. the numeric value of the code point is relevant to the reader), so I didn't implement it. If there's some usages out there which you think are definitely not deliberate, I'd be happy to accept a PR adding that.

the prefix matching that renders e.g. I'm &notit; I tell you as I'm ¬it; I tell you

Omitting that case was a deliberate decision, which I stand by. Rather than formatting those, I would prefer to have a lint rule for unknown entity-like things which will warn on that case.

and the inner logic lowercasing ≤ (which is its own bug and should be <) and & only with a terminal semicolon

Fixed in #489, thanks.

and also inconsistent with the lack of e.g. AMp;

I don't think this is literally ever going to come up, so I don't want to spend time implementing it.

I don't think we have any uses of that form to check if the decision to use that form was deliberate (e.g. the numeric value of the code point is relevant to the reader), so I didn't implement it. If there's some usages out there which you think are definitely not deliberate, I'd be happy to accept a PR adding that.

I think it makes as much sense to convert numeric references as it does to convert named references, so I'll try to submit that.

the prefix matching that renders e.g. I'm &notit; I tell you as I'm ¬it; I tell you

Omitting that case was a deliberate decision, which I stand by. Rather than formatting those, I would prefer to have a lint rule for unknown entity-like things which will warn on that case.

and also inconsistent with the lack of e.g. AMp;

I don't think this is literally ever going to come up, so I don't want to spend time implementing it.

Perfect, that's the kind of context I was looking for. It would incline me towards requiring the semicolon for replacement and catching mistakes by linting (i.e., the first suggestion above). WDYT?

It would incline me towards requiring the semicolon for replacement and catching mistakes by linting

I prefer errors to be automatically fixed when possible, which only the formatter is capable of at the moment. &amp is unambiguously supposed to be &, so it would be nice for the tooling to fix it rather than just pointing it out. This is not true of e.g. &notit;, where it's not clear what the user intended, so that needs to be a lint rule.

(It would be nice to make the linter support --fix, but that's a bunch more work.)

So I'm inclined to stick with the current strategy unless there's some improvement to the user experience from the strategy you propose.

Works for me.