jgm/djot.js

URLs with parenthesis break parser

Opened this issue · 11 comments

Wikipedia links often have ( in them, which breaks djot. Consider this example:

[Paxos](https://en.wikipedia.org/wiki/Paxos_(computer_science))

[Paxos](<https://en.wikipedia.org/wiki/Paxos_(computer_science)>)

doc
  para
    link destination="https://en.wikipedia.org/wiki/Paxos_(computer_science"
      str text="Paxos"
    str text=")"
  para
    link destination="<https://en.wikipedia.org/wiki/Paxos_(computer_science)>"
      str text="Paxos"

Three problems here:

  • Neither one works. It perhaps would be nice if the first one just worked.
  • We don’t document how to escape URLs in the syntax description docs
  • The behavior of two examples seems self-inconsistent. We parse angle bracketed structure, but then we use it literally as href.

Maybe just documenting that brackets need to be URL-escaped, i.e. () = %28%29?

Maybe just documenting that brackets need to be URL-escaped, i.e. () = %28%29?

I don't think we can reasonably expect this. My users would trip on this frequently. Thanks Wikipedia.

I don't like having to parenthesis count, but I can't think of a better solution.

jgm commented

I agree, it would be better to change "contain the link destination (URL) in parentheses" to "contain the link destination (URL) in balanced parentheses" (or something to that effect) in the syntax description and alter the parser to handle this. (Also add a test.)

jgm commented

Odd. This works with the current djot.js parser:

[hi](url_(u))
<p><a href="url_(u)">hi</a></p>
jgm commented

But:

[hi](url_(u_u))
<p><a href="url_(u_u">hi</a>)</p>
jgm commented

So, bottom line is that we've been trying to support balanced parens in URLs, but the potentially matching underscores are confusing the parser.
So this is just a bug in djot.js.

jgm commented

What's happening here is that we're using the delimiter stack to match parens, and when the two _ get matched as emphasis (prior to detecting that it's a link), the ( between them is removed from the delimiter stack.

I hope I'm not dragging in a dead horse to beat it, but it really bothers me that emphasis markers are somehow dragged into URLs. I wish djot had three clear kinds of elements (blocks, inline text, raw text) and URLs were in the latter where no inline parsing or matching happens.

Admittedly this worsens the current issue if URLs are raw to the point of not parsing parentheses, though that can be lifted at the syntax level (e.g. with (<…>) in the OP or escaping in the same way ticks are escaped in raw inline spans, by adding more consecutive delimiters (parentheses here).

jgm commented

Well, the issue is that in commonmark.js we try to do a one-pass parse without backtracking. So, when we hit [label](..., we still don't know if it's going to be a link until it is closed. Until then, we have to keep track of potential emphasis.

If we didn't care about backtracking, we could do things differently. In fact, I do do things differently in my Haskell djot parser: I just try to parse the link destination, and if it fails I backtrack.

Anyway, there are a lot of different parsing strategies, but as far as I can see this is not an issue with djot's syntax.

jgm commented

I think the best fix here would be not to abuse the delimiter stack to keep track of matching parens (since this really only has a purpose in links), but to create a new data structure for this.

Well, the issue is that in commonmark.js we try to do a one-pass parse without backtracking. So, when we hit [label](..., we still don't know if it's going to be a link until it is closed. Until then, we have to keep track of potential emphasis.

If we didn't care about backtracking, we could do things differently. In fact, I do do things differently in my Haskell djot parser: I just try to parse the link destination, and if it fails I backtrack.

Anyway, there are a lot of different parsing strategies, but as far as I can see this is not an issue with djot's syntax.

From where I sit, I see an issue with djot's syntax that manifests in this issue and a few others (which I discussed in detail in jdm/djot#247), and that I (maybe wrongly) linked to inline parsing leaking into raw text parsing.

In other words, we can have link parsing without backtracking, because we can know that [label](… starts a link, exactly like we currently know that ``… starts a inline code span, and it would be obvious if both these were of the same nature, distinct from inline text elements.

Anyway, I've already made my case in the discussion linked above, I promise I will try harder to not bother you with it again.