rexxars/commonmark-react-renderer

Inline HTML handling is sloppy

rexxars opened this issue · 18 comments

The fact that foo <strong>bar</strong> gets turned into foo <span><strong></span>bar<span></strong></span> is obviously horribly broken. It's tied to the fact that the walker gives <strong> as an inline HTML element, then gives bar as a text node, followed by </strong> as another inline html element. Not 100% how to best handle this yet.

Stumbled upon this exact issue in remarkjs/react-markdown#17.

Because of the way the commonmark parser is built, this one is pretty tricky to solve, unfortunately. Would love some help on this if someone has time.

I guessed that this comes from the parser. Any hint where I should look for this issue?

Basically, when encountering foo <strong>bar</strong>, the parser yields four tokens:

  1. Text with the literal foo
  2. InlineHtml with the literal <strong>
  3. Text with the literal bar
  4. InlineHtml with the literal </strong>

Because of how React works, we can't just inject "half a tag" somewhere. If the parser gave us <strong>bar</strong>, things would be a lot simpler, but this doesn't work because of things like the following: foo <strong>*bar*</strong>. In other words, markdown inside of inline HTML.

Yeah, I see. Seems like have to fall back to the plain Markdown processor and inject the output dangerously.

I also ran into this issue in remarkjs/react-markdown#67

Is this an issue in the commonmark parser, or the way it is used by commonmark-react-renderer? Has anyone looked into alternative commonmark parsers?

The issue is basically that you'd need to pull in an HTML parser in order to properly handle this.
Let's imagine you have this HTML:

<span id=unquoted class="foo">Some *bold* <span class='outlined'>text</span></span>

In an ideal world, it would have to convert this to the following tree:

{
  tag: 'span',
  attrs: {id: 'unquoted', className: 'foo'},
  children: [
    'Some ',
    {tag: 'strong', children: ['bold']},
    {tag: 'span', attrs: {className: 'outlined'}, children: ['text']}
  ]
}

Which is non-trivial to implement with a parser that only emits tokens without having any notion of "depth". You'd have to keep track of the structure you're inside of, so that for the opening <span>, you'd note that we're now entering a node and need to record every new token as a child, until you run into a new opening tag or a closing tag for the span. Then you'd also have to consider cases like optional closing tags (good old <p>, for instance), incorrect HTML structures (<span>Foo<em></span>Muh</em>) and the likes. It really is a nightmare.

What are the objections to pulling in an HTML parser? Is it the size? Does the commonmark parser not help with this? Would changing parsers solve this?

The way things are now, inline html doesn't work, even for very simple and properly formed html:

<span>Hello</span>

Here is what gets rendered:

<p><span><span></span></span><!-- react-text: 89 -->Hello<!-- /react-text --><span></span></p>

I suggest that the project README reflect this situation until it can be solved and it should also be mentioned in the react-markdown readme. I suspect this might save others a bunch of time figuring out why things aren't rendering right.

Well, even if you pull in an HTML parser (which will add significant weight), you'll still have to figure out how to deal with all the edge cases, as I outlined above (broken HTML, optional closing tags etc).

It's not a task I want to venture out on - my time is already stretched way too thin as it is. I'd be more than happy to accept a pull request that would improve the situation. Last time I thought about this problem, I came across html-to-react, which would solve much of this problem if only the commonmark parser returned blocks of HTML instead of just tokens for individual HTML fragments (start and end tags as individual tokens). So you'd still be left with trying to match up start and end tags manually.

I'll try to update the readmes to reflect this state.

Thanks for the explanation, and I understand being stretched too thin. I was hoping to get a clear picture of how you would solve this so that someone else could take on the work and create a PR you would potentially accept. It would also help to know what you don't want to see (adding significant weight to project, etc.).

I am stretched thin and don't have the time to solve this problem either. I'm working around the issue in my current project, but it keeps rearing it's ugly head and I may end up needing to work on a real solution.

Hi there! I'm evaluating this project right now. Honestly, I wouldn't expect the contents of HTML to be parsed as markdown. Couldn't things be drastically simplified if we say that once you get into HTML-land we no longer parse it as markdown? Kinda like switching context from JSX to expressions and back again using { and }?

Hi Kent!

For sure, things would be simpler if inline HTML was just emitted as one big chunk. I don't think this is consistent with other markdown parsers though, and might not necessarily be what you want. It also doesn't really solve the problem unless we switch to a different parser.

As outlined earlier, the parser itself outputs individual tokens. If it emitted a single token, it would be a different story.

Note that while <span>*Something*</span> is, as you say, parsed by Remarkable as three tokens (so Markdown can be used inside), this is not true for <div>*Something*</div>.

In that case, Remarkable just takes everything inside as pure html -- which is why that is output as a single htmlblock token, rather than htmltag, text, and htmltag.

So it's kind of odd to me why the authors of Remarkable did this -- it seems kind of inconsistent. They have span allowing Markdown inside it, whereas div does not.

EDIT: Just realized this project uses the Commonmark parser instead of the Remarkable one; however, I suspect they may have the same difference in handling of <div> and <span>.

I'm going to try to look into this more in the next two weeks. One question; if we could detect whether the node was an opening/closing tag of inline HTML with something like if (!props.isBlock && props.literal.match(/<([A-Z]+)[^>]+>/i)) how would we then skip creating a react element for that node? Or would it be better to group an opening tag, text node and closing tag together and then create a react element for that?

First of all, thanks a ton for showing interest in helping with this! <3

Been a while since I looked at this, so I don't have a clear picture of the issue in my head right now. Having said that, here's the first few thoughts that pops into my head when we're dealing with this stuff:

  • If you are using react-markdown, beware that I've been wanting to switch to a different parser for some time. I've got a semi-working prototype going that uses remark-parse as the parser. I'd focus my efforts on that branch. If you're using react-commonmark, or just this parser in and of itself, this is the right place to address it.

  • Simply matching the opening and closing tags might be a good start, but keep in mind that to properly create React elements from the HTML, you'll want to also extract the attributes in the opening tag and apply them to the created element.

  • Things I'm worried about being difficult to handle properly:

    • Invalid HTML (What happens if you put a <p> in there somewhere, and never close it? What if you open a <p>, then a <span> and close the tags in the wrong order?)
    • Properly parsing HTML without a real HTML parser
  • Having said that, my first suggestion (without having taken a stab at it) would be to:

    1. Keep an array of open tags/nodes that you append to every time you hit an InlineHtml node
      • If you hit an empty element, parse it and produce a react element right away
    2. Keep an array of all the other nodes produced since the last open html tag
    3. When you encounter an InlineHtml node, parse it, see if it's a closing tag. If so, check if you have a corresponding open tag. If so, close all the open nodes that have been encountered after that tag, and add all those elements along with the array of produced nodes previously recorded as children of the current tag.

As far as I know, once a React element has been created, there isn't a way to add children to it, so you'll have no choice but to group the children together as you suggested.

Good luck!

Oh, very interesting. I'm familiar with remark-parse. I used a lot of remark plugins for https://github.com/sparkartgroup/quality-docs

That md-ast branch on react-markdown is 17 ahead and 54 behind master. What are your thoughts on rebasing that and then opening a PR with the current progress that we could use to track discussion?

Sure, I'm super busy at work these days, but I'll see if I can find the time to get it up to speed sometime soon.

Huge divergence from master, so I decided to just do a clean branch from master.
Anyway, check out https://github.com/rexxars/react-markdown/tree/remark if you're interested.