fb55/htmlparser2

Changed behaviour of parseDocument

sbruinsje opened this issue · 1 comments

Given the following example:

const html = '<html><body><p><div><span>1</span></div></p></body></html>';
parseDOM(html, { xmlMode: false }); // note that parseDOM has changed into parseDocument() by now.

In versions prior to 4.0.0

The example html string would be parsed into a DOM tree that looks like:

|- html
    |- body
        |- p
            |- div
                |-span
                    <textNode>

Which is what I would expect. The body DOM element printed out looks like:

<ref *1> {
  type: 'tag',
  name: 'body',
  children: [
    {
      type: 'tag',
      name: 'p',
      children: [Array],
      parent: [Circular *1]
      ...
    }
  ],
  parent: {
    type: 'tag',
    name: 'html',
    children: [ [Circular *1] ],
    parent: null,
    ...
  },
  ...
}

In versions 4 or later

The example html string would be parsed into a DOM tree that looks like:

|- html
    |- body
        |- p
        |-div
        |-p

The body DOM element looks like:

<ref *1> Element {
  type: 'tag',
  parent: Element {
    type: 'tag',
    parent: null,
    children: [ [Circular *1] ],
    name: 'html',
    ...
  },
  children: [
    Element {
      type: 'tag',
      parent: [Circular *1],
      children: [],
      name: 'p',
      ...
    },
    Element {
      type: 'tag',
      parent: [Circular *1],
      children: [Array],
      name: 'div',
      ...
    },
    Element {
      type: 'tag',
      parent: [Circular *1],
      children: [],
      name: 'p',
      ...
    }
  ],
  name: 'body',
  ...
}

What has changed after version 3.9.2 to get this change in behaviour and is it correct?

fb55 commented

Unless the xmlMode option is enabled, htmlparser2 now has rudimentary support for tags closing other tags, similar to how browsers handle this. In your case, div tags are supposed to close p tags; see here.