philss/floki

traverse_and_update is not applied to text nodes

ppatrzyk opened this issue · 4 comments

Description

When using traverse_and_update, specified function is not applied to text nodes, these are passed as is:

def traverse_and_update(text, acc, _fun) when is_binary(text), do: {text, acc}

To Reproduce

My setup:

  • Ubuntu 20.04
  • Elixir 1.11.2
  • Erlang/OTP 23
  • floki 0.30.0
  • fast_html 2.0

config.exs:

config :floki,
  html_parser: Floki.HTMLParser.FastHtml
> html = """
<body>
    <div></div>
    <div>
         
    </div>
</body>
"""
"<body>\n    <div></div>\n    <div>\n         \n    </div>\n</body>\n"

> {:ok, doc} = Floki.parse_document(html)
{:ok,
 [
   {"html", [],
    [
      {"head", [], []},
      {"body", [],
       [
         "\n    ",
         {"div", [], []},
         "\n    ",
         {"div", [], ["\n         \n    "]},
         "\n\n"
       ]}
    ]}
 ]}

> Floki.traverse_and_update(doc, fn
  text when is_binary(text) ->
    case Regex.replace(~r/\n|[[:space:]]/, text, "") == "" do
      true -> nil
      false -> text
    end
  other -> other
end)
[
  {"html", [],
   [
     {"head", [], []},
     {"body", [],
      [
        "\n    ",
        {"div", [], []},
        "\n    ",
        {"div", [], ["\n         \n    "]},
        "\n\n"
      ]}
   ]}
]

Expected behavior

I'd like to remove all empty text nodes from given document, i.e. to get the following:

[
   {"html", [],
    [{"head", [], []}, {"body", [], [{"div", [], []}, {"div", [], []}]}]}
 ]

Is the current behavior a bug or is there some reason why it was designed like that? Alternatively, is there a better way to achieve what I want here? Thank you for any suggestions!

maybe as one additional comment: I'm aware that others have exactly the opposite needs (#75) and that I can get ok results automatically when using Floki.HTMLParser.Mochiweb, I'm curious if it's possible to go with fast_html in my case

@ppatrzyk thanks for opening the issue! 💜

I don't think this is a bug. I would say it was a decision from the time we added the feature, since you could update text nodes when capturing the HTML tags (there is an example below). Unfortunately this won't work for nodes that are in the root of the tree :/

The addition of this feature would be a breaking change, so I'm going to think about it.

For your case, you could do this:

Floki.traverse_and_update(doc, fn
  {tag, attrs, children} ->
    {tag, attrs,
     Enum.reject(children, fn child ->
       is_binary(child) && Regex.replace(~r/\n|[[:space:]]/, child, "") == ""
     end)}

  other ->
    other
end)

WDYT?

@philss thanks for your reply!

I'll adapt my code such that it works, thanks for your suggestion.

At minimum what would be helpful is mentioning this behavior in the documentation which currently reads:

This function returns a new tree structure that is the result of applying the given fun on all nodes

(emphasis mine)

@ppatrzyk good point! I changed in fd88a28. Thanks!