jgm/commonmark-hs

[fuzz result] code span vanishes when link destination is `

notriddle opened this issue · 5 comments

This markdown:

[link](`)`x`

In most engines I've tried with, including GitHub, it does this:

linkx

commonmark-hs generates this:

<p><a href="`">link</a>`x`</p>

Events from pulldown-cmark:

"[^](`)`|`\n" -> [
  Start(Paragraph)
    Start(Link { link_type: Inline, dest_url: Borrowed("`"), title: Borrowed(""), id: Borrowed("") })
      Text(Borrowed("^"))
    End(Link)
    Code(Borrowed("|"))
  End(Paragraph)
]

Events from pandoc:

"[^](`)`|`\n" -> [
  Start(Paragraph)
    Start(Link { link_type: Inline, dest_url: Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }), title: Inlined(InlineStr { inner: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 0 }), id: Borrowed("") })
      Text(Boxed("^"))
    End(Link)
    Text(Boxed("`|`"))
  End(Paragraph)
]

Events from commonmark.js:

"[^](`)`|`\n" -> [
  Start(Paragraph)
    Start(Link { link_type: Inline, dest_url: Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }), title: Inlined(InlineStr { inner: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 0 }), id: Borrowed("") })
      Text(Boxed("^"))
    End(Link)
    Code(Boxed("|"))
  End(Paragraph)
]

The code that looks like the culprit is here:

else (Chunk (Parsed (ranged
(rangeFromToks missingtoks newpos)
(str (untokenize missingtoks))))
newpos missingtoks :)

In this case, the end paren is in the middle of a "parsed code" chunk, which gets split into the link destination and some plain text.

        v end paren
[link](`)`x`
      :^^^-- parsed chunk
      :| code span
      : suffix pos

The simplest solution is to make inline code bind tighter than link destinations, but this wouldn't match the reference implementation. The other simple options are to re-parse with the inline syntax parsers, or to interleave bracket matching with inline code span parsing.

jgm commented

None of those are simple and obvious fixes, unfortunately.
Reparsing seems ugly but maybe it's the best idea.
I'm not sure what the last option would involve -- it sounds like a pretty major architectural change, though!

The three options that come to my mind are:

  1. You can't have unescaped ` in link destinations. This is the easiest, and doesn't match cmark.c.
  2. Re-parsing is the least invasive, but still ugly, and I'm not convinced it's actually correct.
  3. Add another kind of Chunk, for ` runs, then turn it into code spans in processBs. This matches closest with how pulldown-cmark does it (MaybeCode, MaybeLinkOpen, and MaybeImage are all prepared in the tree building class, then handle_inline_pass1 turns them into spans, then handle_emphasis_and_hard_break ... handles emphasis and hard line breaks in a second inline pass), and since it means there's a clean separation between "definitely a code span" and "a valid delimiter that might become a code span", it seems more likely to be correct.
jgm commented
  1. You can't have unescaped ` in link destinations. This is the easiest, and doesn't match cmark.c.

This would require a spec change.

As for options 2 and 3, I'm not sure. I agree that 2 is ugly. So 3 has some appeal, but I'd have to see what is actually involved in going this way.

I implemented option 2 in #137, but this implementation has potentially quadratic behavior.

The trouble with option 3 is that the extensions also need redone, because $math$ is just as susceptible to this problem as `code` is.