trimWhiteSpace only trims newlines at the start of content

Question

trimWhiteSpace only trims newlines at the start of content

Closed this issue 6 years ago · 5 comments

The documentation for the trimWhiteSpace option of HtmlExtractor indicates:

If set to true, white space at the very beginning and at the end of the content will get removed

The actual implementation, below, only trims newline characters from the start of string, but all whitespace characters from the end.

if (options.trimWhiteSpace) {
   content = content.replace(/^\n+|\s+$/g, '');
}

I believe that should be content.replace(/^\s+|\s+$/g, '').

Answer 1 · 2018-10-19T17:49:19.000Z

I actually implemented it this way on purpose. This way a conflict with the preserveIndentation option is avoided. Let's look at an example:

<p>
    Lorem ipsum
    dolor sit amet
</p>

The raw content of the element is:

\n    Lorem ipsum\n    dolor sit amet\n

With trimWhiteSpace = true and preserveIndentation = false (the default) we will get:

Lorem ipsum\ndolor sit amet

With trimWhiteSpace = true and preserveIndentation = true we get:

    Lorem ipsum\n    dolor sit amet

Now I agree that the documentation doesn't fully describe how the option works.

I would appreciate if you could share your use case with me so I can determine whether it makes sense to release a new version or if it is enough to update the documentation. Thanks!

Answer 2 · 2018-10-29T22:55:17.000Z

I see what you mean. I think it would be nice if it used the current regex or the one I'm recommending above contextually based on whether trimWhiteSpace && preserveIndentation or trimWhiteSpace && !preserveIndentation, but doing that would probably cause a lot of headaches in the current userbase.

Maybe instead just a documentation update to clarify how trimWhitespace behaves?

As for my use case -- I was just trying to generate "clean" message IDs. Once I discovered this quirk, I was able to work around it. The challenge is that I'm consuming my language files inside a web browser and I need my JS to extract message IDs from markup the same way gettext-extractor does so they line up. I was trying to replicate trimWhiteSpace = true in the browser by using theMarkup.trim() and running into issues when the markup started with a tab. Once I encountered this problem, it was simple enough for me to switch to trimWhiteSpace = false and preserveIndentation = true and let both the browser and gettext-extractor use the raw, untouched content.

Answer 3 · 2018-10-30T09:12:19.000Z

Thanks for your answer. If your goal is to generate "clean" message IDs why not just use the same regex for removing indentation during runtime?

I'm still contemplating what to do here and having message ids like this doesn't seem very desirable to me:

Lorem ipsum
    dolor sit
    amet

Answer 4 · 2018-10-30T14:17:41.000Z

Because I expected trimWhiteSpace to behave the same as String.prototype.trim. Once I dug into the code and realized it didn't, I felt it was better to just leave things untouched on both sides. Copying the regex out of gettext-extractor would have worked for me, too.

Answer 5 · 2019-01-04T23:56:44.000Z

Sorry for my late response. I've now added a comment to the documentation that hopefully makes it clear enough that, in combination with preserveIndentation, it doesn't exactly behave like .trim().