commonmark/cmark

Making escaped characters first class citizens...

Closed this issue · 10 comments

When using certain characters, for example @ or # in GitHub or GitLab, they trigger special features, such as linking to a user or an issue.

But if I want to specifically use the syntax without the expansion, it's not possible, even when backslashing the character:

  • Referring to point #2 --> Referring to point #2
  • Referring to point \#2 --> Referring to point #2
  • Referring to point #2 --> Referring to point #2
  • Referring to point \#2 --> Referring to point #2

handle_backslash adds it as a normal text node to the AST. There is no way for us to tell in the output that this used to be an escaped character and should be ignored in any additional processing. commonmark.js leaves the text node in the AST so you can guess that it might have been escaped, but cmark collapses text nodes, so even that hint is gone.

I outline a few ideas in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/45922

But I think the best would be to make it a first class citizen, meaning adding a new node type for it, like CMARK_NODE_ESCAPED or CMARK_NODE_LITERAL. Default rendering would just output the character, but it would allow a different renderer to make a better decision how to render that character. Maybe by wrapping it in a span tag, for example, which would bypass a special character scanner.

I just read about Roundtripping issues with escaped entities so I'm sure yet how that fits in.

A couple currently open issues:

wdyt?

jgm commented

Con:

We're creating an abstract syntax tree, not a concrete syntax tree. In commonmark \# and # in the middle of a line are just two different ways to represent the same thing. So conceptually, I think the way we're doing it is correct. We don't represent *a* and _a_ differently in the syntax tree, so why represent \# and # differently? It seems to me that the special treatment of # in gitlab and github amounts to a syntax extension and might better be handled with an extension to the parser.

Pro:

Considerations of practicality might override the conceptual argument above. It would be fairly easy to do this.

For a related issue (regarding entities), see commonmark/commonmark-spec#442. I like the idea of having special AST nodes for entities, and one might make a similar case for escapes. There are some complications noted there, and some of them apply to escapes as well.

jgm commented

@kivikakk @nwellnhof I'd be interested in your thoughts on this issue.

Thanks for the quick response @jgm!

We don't represent *a* and _a_ differently in the syntax tree, so why represent \# and # differently?

I think because escaping, to me, really is behavior, like italics - I interpret it to mean "I don't care what context I'm in or what you think I should be doing, I am this character". It adds a behavior to that character. And you typically wouldn't use it unless you wanted that specific behavior, to override some other behavior.

It seems to me that the special treatment of # in gitlab and github amounts to a syntax extension and might better be handled with an extension to the parser.

In terms of the fact that we consider that it special and what to do about it, I agree. But I still think it would be helpful to let the users of the AST know that the user specifically escaped this character (they wanted that specific behavior), same way we let them know the user marked something as italics.

I broadly agree with @jgm's assessment: the escape character feels like it doesn't belong in an abstract syntax tree, but, practically speaking I think it'd enable library consumers to do something they want to do. I don't imagine GitHub or GitLab will be trying to move their HTML pipeline into cmark itself any time soon, so exposing \ in the AST would make "escaping" references like \#123 much more doable.

We don't represent *a* and _a_ differently in the syntax tree, so why represent \# and # differently?

You could also argue that *a* and _a_ should be represented differently, just like in other situations where the AST omits information, for example #225. Personally, I'm in favor of all changes that make the AST more faithful. In many cases, this shouldn't be difficult, except for things like whitespace or continuation lines.

But we should keep in mind that adding a new node type requires a new parser option to avoid breaking the API.

You could also argue that a and a should be represented differently

Indeed. Creating new nodes and generating a new parser seems a lot of work and overhead. Nonetheless access to the markup flavour can be very useful and allow context-specific customisations:

  • *a* and _a_ could allow the differentiated use of <I> and <EM>
  • Using the backtick or the tilde characters for fenced code could allow diffentiated inlined examples and source code
  • #255 and lists, etc.

The parser only collects and transmits the information. Knowledge of the flavour is irrelevant for parsing and is only intended for consumers.

In the best of worlds I would image this as an extension of the cmark_node struct with, for instance a uint16_t flavour entry. Does this make sense?

jgm commented

Maybe there could be another issue with a request to make more information about the concrete syntax available in the AST.

As for this particular topic, I think the OP's original issue has to do with GitLab and GitHub's treatment of # and @. Would adding a new AST node even help with this problem? If GitLab and GitHub are simply postprocessing the HTML produced by cmark (or a fork), then it won't really matter how the # is represented in the AST. They would have to implement the automatic linking of issue numbers and mentions at the level of the HTML renderer for this to be useful. Does anyone know how they do, in fact, do it?

You're right, we (GitLab) do post processing on the HTML using a set of pipelines and filters. For example, the UserReferenceFilter handles @user references.

The problem is, of course, there is no way in the html to know if an @ was escaped or not.

But using commonmarker we're able to walk the AST and modify it, or replace elements of the renderer very easily. Which we already do in https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/banzai/renderer/common_mark/html.rb

So if we have the information in the AST (through a new node or a flavor on a node or whatever), then we could determine whether a user escaped a character. And I think it's important to be able to preserve that CommonMark feature through to the final rendering, no matter what post processing is done on html.

The easiest solution would be to surround an escaped character with <span>. At the moment this is enough to keep the filter from recognizing the reference, both on GitLab and GitHub:

  • Referring to point #2 --> Referring to point #2
  • Referring to point <span>#</span>2 --> Referring to point #2

I'm no longer at GitHub, but I think I'd be correct in saying they still use the same process they did when I was there working on it: they convert the GFM into HTML using Commonmarker, using cmark's builtin HTML renderer. They then parse that HTML into a DOM and transform it in successive stages using an internal library mostly based on Gumbo, which calls into Ruby when fragments of the DOM match filters.

Without other changes, only modifying how the escaped character was represented in the AST wouldn't do much, but they could possibly modify cmark-gfm to do something else with it at the output stage.

I think I can bring this issue to a close. We solved this by pre-processing the markdown, and then post-processing it. Here's the comment from the code explaining this:

    # In order to allow a user to short-circuit our reference shortcuts
    # (such as # or !), the user should be able to escape them, like \#.
    # CommonMark supports this, however it removes all information about
    # what was actually a literal.  In order to short-circuit the reference,
    # we must surround backslash escaped ASCII punctuation with a custom sequence.
    # This way CommonMark will properly handle the backslash escaped chars
    # but we will maintain knowledge (the sequence) that it was a literal.
    #
    # We need to surround the character, not just prefix it.  It could
    # get converted into an entity by CommonMark and we wouldn't know how many
    # characters there are.  The entire literal needs to be surrounded with
    # a `span` tag, which short-circuits our reference processing.
    #
    # We can't use a custom HTML tag since we could be initially surrounding
    # text in an href, and then CommonMark will not be able to parse links
    # properly.  So we use `cmliteral-` and `-cmliteral`
    #
    # https://spec.commonmark.org/0.29/#backslash-escapes
    #
    # This filter does the initial surrounding, and MarkdownPostEscapeFilter
    # does the conversion into span tags.