Floki using the built in parser does not handle the optional closing p tag

Question

Floki using the built in parser does not handle the optional closing p tag

derek-zhou opened this issue 3 years ago · 5 comments

Description

According to HTML5 spec, closing </p> tag is optional. ie:

<p>p1
<p>p2

is equivalent to:

<p>p1</p>
<p>p2</p>

However, Floki with the builtin parser does not handle this correctly.

To Reproduce

Using Floki v0.32.0
Using Elixir v1.12.3
Using Erlang OTP v24
With this code:

Floki.parse_document("<p>p1<p>p2")
{:ok, [{"p", [], ["p1", {"p", [], ["p2"]}]}]}
iex(5)> Floki.parse_document("<p>p1</p><p>p2</p>")
{:ok, [{"p", [], ["p1"]}, {"p", [], ["p2"]}]}

It looks like Floki fills in the missing </p> at the end of the document.

Expected behavior

<p> tag shall not contain another <p>

Answer 1 · 2022-03-24T16:20:04.000Z

Yeah, this is a bug :/
It won't be fixed easily because of #37
But at least we are half way there https://github.com/philss/floki/projects/2

Answer 2 · 2022-03-24T18:16:03.000Z

Do you mean that the mochiweb is too fragile to fix, and a brand new parser is on the way?

Answer 3 · 2022-03-24T23:02:19.000Z

@derek-zhou It's not that is too fragile, but I think the HTML parsing state machine is too damn complicated to fix when the parser never followed the specs 😅

I plan to finish the built-in parser one day. But in the meanwhile, I suggest you to give it a try to the html5ever parser https://github.com/philss/floki#using-html5ever-as-the-html-parser, now that comes with precompiled NIFs (you don't need Rust to use it anymore).

Answer 4 · 2022-03-25T00:18:22.000Z

I am not afraid of a little of rust tool chain. However, I need to do some ad-hoc XML parsing in the same application and I am afraid if the html5ever parser could be too strict on things.

Answer 5 · 2022-03-25T00:58:58.000Z

@derek-zhou I see. You can use both if you need. Just pass the parser as an option to parse_document.