bustle/mobiledoc-kit

Pasting HTML - wrappers cause unexpected results

Closed this issue · 7 comments

I've searched as best I could but can't figure out if this behaviour is a bug or expected.

This is reproducible for me both by copy-pasting into the demo editor and programatically using the DomParser and Mobiledoc Renderer.

Paste: <p>Hello</p><p>World</p>

Resulting Mobiledoc.sections:

[ [ 1, 'p', [ [ 0, [], 0, 'Hello' ] ] ], [ 1, 'p', [ [ 0, [], 0, 'World' ] ] ] ]

Paste: <div><p>Hello</p><p>World</p></div>

Resulting Mobiledoc.sections:

[ [ 1, 'p', [ [ 0, [], 0, 'HelloWorld' ] ] ] ]

I'll try to add a minimal test example to show this in action, but I wanted to raise the issue in advance to check if this is expected behaviour?

If not, I'd be very happy to work towards a fix but might need a little pointer in the right direction. I feel like 9b4231e might be related?

I'm working on a HTML -> Mobiledoc converter, which is where I'm running into this problem.

I've setup tests which show this problem with a very minimal reproduction case:

https://github.com/TryGhost/Ghost-SDK/blob/master/packages/html-to-mobiledoc/test/converter.test.js#L24

This is kind of a long, in depth comment. I'm committed to finding a solution here if I can, although I'm obviously lacking an awful lot of context around the what and why. I would ❤️ any input or feedback!


Basic problem as I understand it

I've stepped through what happens in each case with the debugger, and it seems like, with the exception of if there's a Google Docs wrapper, Mobiledoc assumes that what it is processing is always a NodeList that represents sections. As far as I can see, this is a fundamental assumption.

So, in the first instance, SectionParser is called twice, and in the second instance, we only ever enter the SectionParser once, and then recursively iterate over the nodes.

This means that copy-pasting HTML, or trying to convert it as in my case, will always end up in this kind of misinterpretation unless you preprocess it to pass only the sections.

In more detail

It seems to me that issue comes from the following logic:

  1. In the dom parser, iterate over all children and call parseSections
  2. parseSections assumes we are in a section, does no logic to check, and calls the sectionParser
  3. The sectionParser is then a little smarter: as long as the node isn't preprocessed by a plugin, and also isn't a list, it calls parseNode.
  4. parseNode is also smart, it checks whether it has an element or a text node, and either calls the recursive parseElementNode or directly calls parseTextNode, this allows iteration through section elements to get the actual text and keep that around.

When we wrap sections, the top level wrapper is treated as a section. Even though parseElementNode is recursive and can iterate through the underlying Nodes, we end up collecting text nodes underneath 1 section, not 2, and creating a single MarkupSection.

Because the section we end up with is a <div> and that's not valid, there's even logic called that ends up determining that we're not in a valid section, and converting the combined text into a <p>, which is the default tag: https://github.com/bustle/mobiledoc-kit/blob/master/src/js/parsers/section.js#L262.

It feels like there are so many pieces in place, but DomParser is missing an important piece. We know what sections should look like, but we don't pay attention to this until far too late. I realise completely that by assuming that the editor gets a list of Sections, it vastly simplifies the logic, but at the same time in terms of copy-pasting HTML, or converting HTML, this assumption almost never holds.

A potential approach

Again, I'm painfully aware that I'm lacking a lot of context, but implementing a solution into Mobiledoc's DOM Parser seems much more elegant and useful to the community than me adding DOM pre-processing into my HTML-to-mobiledoc utility, and either way, I need to understand the problem properly regardless of where the solution gets implemented.

My suggestion for an approach would be instead of assuming that the top level items are sections, do the following:

  • Walk the DOM of each top-level child until we find either a valid section or a text node.
  • If a valid section is found, call the SectionParser with that node.
  • If a text node is found, assume the parent node is the section, call the SectionParser with the parent node.
  • If no valid section and no text node is found, ignore that child, try the next one.

I think this approach would then support DOMs that include wrappers, multiple wrappers, sections with variable numbers of wrappers, and other structures E.g.

<div><p>Hello</p><p>World</p></div>

<div><div><p>Hello</p><p>World</p></div></div>

<div><section><p>Hello</p></section><div><p>World</p></div></div>

<div><p>Hello</p><div><p>World</p></div></div>

Implementation

Having got a little more familiar with the code base, I'm not entirely sure whether this logic should live in the DOM Parser or the Section Parser.

I think it makes sense to only call the Section Parser with something we expect to be parsed as a section, plus the DOM Parser already has a concept of walkMarkerableNodes. Therefore, I think it makes sense to add something like walkSectionNodes to do the detection.

I also think that doing this would allow for the removal of the detectRootElement and custom isGoogleDocsContainer logic.

So, from the conversation on my PR it's become apparent I should have included my original natural example here as well as the minimal test case. I started with a real world problem and then boiled it down to the minimal problem.

A good example of the real world issue is a medium post like this one:

https://medium.com/p/6dd1d4aa01ad

It's slightly trickier to reproduce than the minimal case, but the steps are:

  1. open my medium post scroll to the bottom, copy the last 3 paragraphs
  2. paste into the demo editor all is well - you'll get 3 paragraphs and the markup
  3. go back to the medium post this time select all the content including the feature image at the top and the author bio
  4. paste into the demo editor and you will get a big ol blob of text

The line between working/not working is difficult to find. Sometimes if you select just the feature image and not the bio it works, sometimes it doesn't, it's obviously dependent on whether or not you catch a wrapper div in your selection.

My proposal is essentially that no amount of wrapping divs should prevent me being able to copy and paste content into the editor successfully.

After playing with #651 some more, and realising it's really not possible to fix all the existing test cases and that I don't think there's really a way to move forward with that approach without making a fairly fundamental change to how section parsing works. So I've closed that PR and gone back to the drawing board.


I now have 2 new proposals for how to move forward, to explain them, I'm going to go back and re-explain the problems I'm seeing when running external HTML through the mobiledoc parser (the same problems appear in both a paste or a conversion context).

Problem 1: Wrapper divs.

This is the easier one to understand/communicate and to fix. Essentially, if you have any number of wrappers around your sections, the sections are not parsed correctly. I think there is agreement all round that this case should work.

<div><p>Hello</p><p>World</p></div> -> results in <p>HelloWorld</p>

To solve this, I have a simple, hopefully light touch fix, that looks for the innermost <div> or <section> element that has only one child, and uses this as the root element.

I have raised this fix as a PR here: #655

I believe this fix is entirely safe. It only solves a few cases, but is already a massive improvement.

Problem 2: Siblings / trees

The second set of problems surround siblings and nesting - i.e trees of markup. The simplest case that doesn't work as I would expect is:

<div><p>Hello</p></div><div><p>World</p></div> -> results again in <p>HelloWorld</p>

To fix this, I have the same approach as for problem 1, but I do the root element resolution inside the foreach child node function.

I have raised this fix as a PR here: #654

This fix slightly changes the behaviour, but I believe it is still safe - especially as there were no unexpected test fails - so all known existing cases are still working as expected.

With this change, a vast array of HTML structures will be correctly interpreted and paragraphs will be kept separate instead of accidentally merged.


The first fix can definitely be merged if we want to play it super safe, but IMO the second fix is significantly better.

I really hope this isn't confusing - I've raised 2 PRs to try to make it clearer the difference between the two proposals both in terms of the size of the change and the cases that are solved.

So @kevinansfield has pointed out that Proposal 2 currently has a very big impact on the parser plugins.

An example case is something like

<section class="article">
    <p>...</p>
    <div class="image">
        <img />
        <span class="caption">...</span>
    </div>
</section>

In mobiledoc right now, the parser plugin will get passed the <p> and then the <div>.

With proposal 1 this still holds.

With proposal 2, the parser plugin would be passed <p> and then <img> and then <span> having no chance to do anything with the <div> such as treat it as a figure.

In a very similar vein we discovered #656.


The main purpose of what I'm trying to achieve right now is reasonable handling of medium's markup. There is already some explicit handling in mobiledoc for Google Docs, but I didn't really want to end up having to do custom implementations per source.

It seems like the concepts of determining what is a section, and when to run a parser plugin are at war with each other right now.

@mixonic you mentioned having to write your own parser as well. Is there anything from that experience that we could use, or do you have any other thoughts on if there is a way to rework this section handling conceptually into something more robust?

For context, here's an example of what the structure of a medium document can look like, although there is a lot of flexibility:

<section class="section section--body section--first">
    <div class="section-divider"><hr class="section-divider"></div>
    <div class="section-content">
        <div class="section-inner sectionLayout--insetColumn">
            <h3>Title</h3>
        </div>
        <div class="section-inner sectionLayout--fullWidth">
            <figure><img></figure>
            <ul><li></li></ul>
        </div>
        <div class="section-inner sectionLayout--insetColumn">
           <h4>Subtitle</h4>
            <p></p>
            <p></p>
        </div>
    </div>
</section>
<section class="section section--body section--last">
    <div class="section-divider"><hr class="section-divider"></div>
    <div class="section-content">
        <div class="section-inner sectionLayout--insetColumn">
            <h4>Subtitle</h4>
            <p></p>
        </div>
    </div>
</section>
  • most posts are made of multiple sections, these are layout wrappers
  • within sections are divs, each of these is also a layout wrapper
  • inside the divs, we find what mobiledoc would consider a section.

There's still a logical structure to the document not dissimilar from what mobiledoc supports, there are just a bunch of section & div wrappers that need peeling away to get to the heart.

For the usecase of wanting to convert, of course I can pre-process, but this would prevent mobiledoc from being able to handle medium content on paste, which I think is desirable.

I wish I had understood that the section parser was meant to handle wrappers 😞 Turns out all of this bad behaviour is a bug / the section parser being incomplete.

#658 fixes this issue completely without any behavioural changes.