philss/floki

Floki.parse differs when using html5ever

andyleclair opened this issue · 4 comments

Description

Mochiweb Floki will produce different output than html5ever, namely, the output of Floki.parse will be wrapped in <html><head></head><body>...</body></html>

To Reproduce

Steps to reproduce the behavior:

  • Using Floki v0.23.0
  • Using html5ever
  • Using Elixir v1.9.3
  • Using Erlang OTP v21.3.8.9
  • With this code:
defmodule TestCases do
  @test_cases [
    {
      ~s[<a href="javascript:alert('XSS');">Click here</a>],
      ~s[<a href="#">Click here</a>]
    },
    {
      ~s[<a href="whatever" onclick="alert('XSS');">Click here</a>],
      ~s[<a href="whatever">Click here</a>],
    },
    {
      ~s[<body onload="alert('XSS')"><p>Hello</p></body>],
      ~s[<body><p>Hello</p></body>],
    },
    {
      ~s[<img src="javascript:alert('XSS');">],
      ~s[<img src="#"/>],
    },
    {
      ~s[<script>alert('XSS');</script>],
      ~s[],
    },
    {
      ~s[<body background="javascript:alert('XSS');"><p>Hello</p></body>],
      ~s[<body background="#"><p>Hello</p></body>],
    },
    {
      ~s[<style>body { background-image: expression('alert("XSS")'); }</style>],
      ~s[<style>body { background-image: removed_by_strip_js('alert("XSS")'); }</style>],
    },
    {
      ~s[<style>body { background-image: url('javascript:alert("XSS")'); }</style>],
      ~s[<style>body { background-image: url('removed_by_strip_js:alert("XSS")'); }</style>],
    },
    {
      ~s[<style><script>alert('XSS')</script></style>],
      ~s[<style><script>alert('XSS')</script></style>],
    },
    {
      ~s[<style> h1 > a { color: red; } </style>],
      ~s[<style> h1 > a { color: red; } </style>],
    },
    {
      ~s[<],
      ~s[&lt;],
    },
    {
      ~s[>],
      ~s[&gt;],
    },
    {
      ~s[],
      ~s[],
    },
  ]

  def test_cases, do: @test_cases
end

TestCases.test_cases |> Enum.map(fn {ins, _outs} -> Floki.parse(ins) end)

[                                                                                                                                                                                                                                                                                         
  [                                                                                                                                                                                                                                                                                       
    {"html", [],                                                                                                                                                                                                                                                                          
     [                                                                                                                                                                                                                                                                                    
       {"head", [], []},                                                                                                                                                                                                                                                                  
       {"body", [],                                                                                                                                                                                                                                                                       
        [{"a", [{"href", "javascript:alert('XSS');"}], ["Click here"]}]}                                                                                                                                                                                                                  
     ]}                                                                                                                                                                                                                                                                                   
  ],                                                                                                                                                                                                                                                                                      
  [                                                                                                                                                                                                                                                                                       
    {"html", [],                                                                                                                                                                                                                                                                          
     [                                                                                                                                                                                                                                                                                    
       {"head", [], []},                                                                                                                                                                                                                                                                  
       {"body", [],                                                                                                                                                                                                                                                                       
        [
          {"a", [{"href", "whatever"}, {"onclick", "alert('XSS');"}],
           ["Click here"]}
        ]}
     ]}
  ],
  [
    {"html", [],
     [
       {"head", [], []},
       {"body", [{"onload", "alert('XSS')"}], [{"p", [], ["Hello"]}]}
     ]}
  ],
  [
    {"html", [],
     [
       {"head", [], []},
       {"body", [], [{"img", [{"src", "javascript:alert('XSS');"}], []}]}
     ]}
  ],
  [
    {"html", [],
     [{"head", [], [{"script", [], ["alert('XSS');"]}]}, {"body", [], []}]}
  ],
...
]

Expected behavior

I'd expect that the output would match the the output of calling this without the html5ever parser, namely, that it'd just be the fragments themselves.

@andyleclair Thank you for opening the issue.

This is a problem that we have because we don't consider parsing fragments as something different, when we should. html5ever's parses fragments as full documents because we (floki) don't distinguish this when calling it.

I'm planning to add a Floki.parse_fragment to differ from the standard Floki.parse because the HTML specs treats them as different algorithms, and with this we can call the correct functions on html5ever's side.

This should be fixed once I finish the work on the internal parser (#204).

I see that this report got closed. Was there any resolution? We are currently handling the specific case of a fragment wrapped in the default wrapper, but I'd love to tear that code out

@andyleclair it was not fixed. It's a known issue. I kept the issue fixed in the issues list, but I will let it open too.

Is it really a problem from floki? After reading code I start to think it's from html5ever_elixir.