philss/floki

Floki doesn't parse nested comments properly

mohammedgad opened this issue ยท 8 comments

Description

When we use any parsing function like parse_document, parse_fragment, or parse.
for comments we expect to get a tuple {:comment, "this is a comment"} it works fine if we have a comment but doesn't work with nested comments.

for example:

html_body = """
              <!--  
                Hello, world! I am a comment and I am
                displayed in multiple lines!
              -->
    """

Floki.parse(html_body)

This will produce a tuple as expected

{:comment, "  \n            Hello, world! I am a comment and I am\n            displayed in multiple lines!\n          "}

it doesn't work well if we have nested comments

nested comments code example: (bug)

    html_body = """
              <!--  
                Hello, world! I am a comment and I am
                displayed in multiple lines!
                <!--  
                  Hello, world! I am a comment and I am
                  displayed in multiple lines!
                -->
              -->
"""

Floki.parse(html_body)

This will produce:

[
  {:comment,
   "  \n                Hello, world! I am a comment and I am\n                displayed in multiple lines!\n                <!--  \n                  Hello, world! I am a comment and I am\n                  displayed in multiple lines!\n                "},
  "\n              -->\n"
]

this generates an array instead of a tuple, as you can see there's an extra "\n -->\n " string after the comment tuple, which shouldn't exist.

To Reproduce

Steps to reproduce the behavior:

  • Using Floki v0.30.0
  • Using Elixir v1.9.0
  • Using Erlang OTP v23
  • With this code:
html_body = """
            <!--  
              Hello, world! I am a comment and I am
              displayed in multiple lines!
              <!--  
                Hello, world! I am a comment and I am
                displayed in multiple lines!
              -->
            -->
"""

Floki.parse(html_body)

Expected behavior

The code should generate a tuple {:comment, "comment body"}

@mohammedgad thank you for opening the issue! ๐Ÿ’œ
This is a problem in the current tokenizer. For some weird reason FastHTML and HTML5ever also don't parse it correctly.

I'm planning to introduce a transitional parser that uses the tokenizer from #204 (which is parsing it correctly) and the tree construction from the existing mochiweb parser. The idea of a transitional parser is to be more correct than the current one but still not fully in compliance with WHATWG specs. I'm not sure if it will work correctly tough.

This is probably going out in the next week. I will let you know.

The first --> closes the comment, so the second one is parsed as non-comment text. I don't believe you can nest comments.

https://www.w3.org/TR/html52/syntax.html#comments

Comments must have the following format:

  1. The string "<!--"

  2. Optionally, text, with the additional restriction that the text must not start with the string ">", nor start with the string "->", nor contain the strings "<!--", "-->", or "--!>", nor end with the string "<!-".

  3. The string "-->"

Thank you so much @philss I am really glad to hear about the transitional parser.

Hey @mischov, I didn't know about nested comments till week ago, the idea behind it, Microsoft wanted to handle conditional styling for outlook and IE. So we can have a comment with IE or outlook styling, inside it you can have other comments for specific IE or outlook versions.

Reference: https://msdn.microsoft.com/en-us/library/ms537512(v=vs.85).aspx

As your link indicates

Important As of Internet Explorer 10, conditional comments are no longer supported by standards mode.

Which makes sense because nested comments aren't supported by the HTML5 standard.

@philss said "For some weird reason FastHTML and HTML5ever also don't parse it correctly," but they are parsing it correctly by the HTML5 standard. I don't know that it makes a lot of sense for parsers to parse in a manner that is against the HTML5 standard to support a feature Microsoft stopped supporting (by default) in IE10 (which came out in 2012).

@mischov yea it's not supported anymore for IE 10 and for sure this has nothing to do with HTML standards, it's just a comment, but it's still in use for old browsers.
I believe we don't need to support nested comments by generating a node in the html_tree but we need to treat these nested comments like any other comments, the output will be the parent comment node and the string part of this node will contain all the nested comments.

but we need to treat these nested comments like any other comments

I disagree. Treating nested comments as a single comment is incorrect behavior for like 99.87% of browser users (caniuse suggests only 0.13% of browser users are on IE9 or earlier).

Floki does currently parse nested comments properly for the overwhelming majority of users, which is to say it parses them like the user's browser would.

It's @philss decision of course, but I don't think this is a bug.

@mischov you are correct this isn't a bug, it's exactly how the parsing should happen, the current behavior is exactly what the browser will do with the comments. it will print ---> as text,
Even the IE nested comments are not actually comments, they are just one big comment. for example:
<!--[if true]><![if IE 7]><p>This nested comment is displayed in IE 7.</p><![endif]><![endif]-->
So nested comment doesn't actually exist, I am gonna close this Issue.
Thank you @mischov and @philss

Yeah, @mischov is correct and I was wrong when I said that the new tokenizer is parsing as you expected. It is parsing just like the others. I think I misread it. Sorry about that, @mohammedgad :/

So, definitely this is not a bug ๐Ÿ‘
Thank you @mischov and @mohammedgad :)