Trim surrounding whitespace for Node.Content()
aggrolite opened this issue · 2 comments
Hi,
Does gokogiri currently support trimming leading and trailing whitespace from a Node
's Content
(text content) before it is returned?
For example, I've used a Perl library in the past that allows me to do something like $node->as_text
and $node->as_trimmed_text
. The latter function automatically removes surrounding whitespace (s/^\s*|\s*$/
) before returning the node's text content.
If this desired behavior is not currently built in, I would happily provide a patch. For example, a new field for type Node
could be introduced, say ContentTrimmed
.
Thanks!
You're better off simply calling strings.TrimSpace
on the returned content unless you want to fight with XML whitespace rules and the idiosyncrasies of node aggregation. For example, do you still want to trim if an element is marked xml:space="preserve"
, or when dealing with CDATA nodes?
You have to particularly careful when dealing with mixed content document formats like HTML or DOCX - leading or trailing whitespace is generally significant and can cause issues if you try and strip it out. That's why XSLT forces you to use the normalize-space
function instead of conveniently doing it for you.
Ah, I did not consider the points you raised. I think it's best if I read up on XML and whitespace before going any further. In the meantime I will use the strings
package.
Closing issue, and thanks for the response.