trimWhiteSpace only trims newlines at the start of content
Closed this issue · 5 comments
The documentation for the trimWhiteSpace
option of HtmlExtractor indicates:
If set to true, white space at the very beginning and at the end of the content will get removed
The actual implementation, below, only trims newline characters from the start of string, but all whitespace characters from the end.
if (options.trimWhiteSpace) { content = content.replace(/^\n+|\s+$/g, ''); }
I believe that should be content.replace(/^\s+|\s+$/g, '')
.
I actually implemented it this way on purpose. This way a conflict with the preserveIndentation
option is avoided. Let's look at an example:
<p>
Lorem ipsum
dolor sit amet
</p>
The raw content of the element is:
\n Lorem ipsum\n dolor sit amet\n
With trimWhiteSpace = true
and preserveIndentation = false
(the default) we will get:
Lorem ipsum\ndolor sit amet
With trimWhiteSpace = true
and preserveIndentation = true
we get:
Lorem ipsum\n dolor sit amet
Now I agree that the documentation doesn't fully describe how the option works.
I would appreciate if you could share your use case with me so I can determine whether it makes sense to release a new version or if it is enough to update the documentation. Thanks!
I see what you mean. I think it would be nice if it used the current regex or the one I'm recommending above contextually based on whether trimWhiteSpace && preserveIndentation
or trimWhiteSpace && !preserveIndentation
, but doing that would probably cause a lot of headaches in the current userbase.
Maybe instead just a documentation update to clarify how trimWhitespace
behaves?
As for my use case -- I was just trying to generate "clean" message IDs. Once I discovered this quirk, I was able to work around it. The challenge is that I'm consuming my language files inside a web browser and I need my JS to extract message IDs from markup the same way gettext-extractor
does so they line up. I was trying to replicate trimWhiteSpace = true
in the browser by using theMarkup.trim()
and running into issues when the markup started with a tab. Once I encountered this problem, it was simple enough for me to switch to trimWhiteSpace = false
and preserveIndentation = true
and let both the browser and gettext-extractor
use the raw, untouched content.
Thanks for your answer. If your goal is to generate "clean" message IDs why not just use the same regex for removing indentation during runtime?
I'm still contemplating what to do here and having message ids like this doesn't seem very desirable to me:
Lorem ipsum
dolor sit
amet
Because I expected trimWhiteSpace
to behave the same as String.prototype.trim
. Once I dug into the code and realized it didn't, I felt it was better to just leave things untouched on both sides. Copying the regex out of gettext-extractor would have worked for me, too.
Sorry for my late response. I've now added a comment to the documentation that hopefully makes it clear enough that, in combination with preserveIndentation
, it doesn't exactly behave like .trim()
.