moovweb/gokogiri

Trim surrounding whitespace for Node.Content()

aggrolite opened this issue · 2 comments

Hi,

Does gokogiri currently support trimming leading and trailing whitespace from a Node's Content (text content) before it is returned?

For example, I've used a Perl library in the past that allows me to do something like $node->as_text and $node->as_trimmed_text. The latter function automatically removes surrounding whitespace (s/^\s*|\s*$/) before returning the node's text content.

If this desired behavior is not currently built in, I would happily provide a patch. For example, a new field for type Node could be introduced, say ContentTrimmed.

Thanks!

You're better off simply calling strings.TrimSpace on the returned content unless you want to fight with XML whitespace rules and the idiosyncrasies of node aggregation. For example, do you still want to trim if an element is marked xml:space="preserve", or when dealing with CDATA nodes?

You have to particularly careful when dealing with mixed content document formats like HTML or DOCX - leading or trailing whitespace is generally significant and can cause issues if you try and strip it out. That's why XSLT forces you to use the normalize-space function instead of conveniently doing it for you.

Ah, I did not consider the points you raised. I think it's best if I read up on XML and whitespace before going any further. In the meantime I will use the strings package.

Closing issue, and thanks for the response.