adams85/po

Parsing of \n is system-dependent

MKuckert opened this issue · 7 comments

POString.Decode uses Environment.NewLine to decode the escape sequence \n and is system-dependent as a result: The resulting string (or key to lookup the translation) contains a \n on unix systems and \r\n on windows systems. This can be problematic when the source to create the lookup keys always contains the \n (as in our case).

I suggest to change the TryDecodeEscapeSequence local function in POString.Decode to produce the correct newline. One can add a property to POParserSettings to change the behaviour and be backwards compatible.

See pull request #10 for a fix.

The PR adds a setting to use \n instead of Environment.NewLine. A college of mine suggested to pass the newline string directly to better customize for other scenarios. E.g. one would be able to pass \r\n to utilize the windows encoding even on unix systems.

What do you think?

In PO files line breaks are always represented in a platform-independent way as the \n escape sequence. (At least, my lib builds on this assumption.) Indeed, the parser decodes this to the platform-dependent line break character sequence. This is an internal representation of the data contained in the PO file for your application to work with internally. When it decides to write it back to PO format, the line endings are converted to the platform-independent escape sequences.

This can be problematic when the source to create the lookup keys always contains the \n (as in our case).

If I understand you correctly, you have issues when the key of an entry contains line breaks. Please confirm.

At the moment I don't see why your application cannot lookup such entries. In the case of multi-line entries, it just have to use keys containing platform-dependent line breaks and it will work. At a time your app runs either on Windows or on Linux but not on both.

Could you explain your use case in more detail?

Sure, let me elaborate some more.

We're using Karambolo.Po do read some po-files to display translated content in our mobile apps. Works really good. We like the API and the way the parser and the catalog works.

Our po-files contain some keys containing line breaks. We have another source, some project specific recipe description, for some of our apps content. And this descriptions include some translate texts, including multiline texts.
Those multiline texts are the current pitfall: They definitely use newline chars as line breaks, on all platforms. But the po parser is platform-agnostic and uses the newline chars for the local system. Text lookups are a miss on windows platforms as the line breaks in the lookup key from our recipe description (always \n) and in the po catalog (environment specific, \r\n for windows) differ.

We had code in place to replace every \n in the keys used to lookup translations with Environment.NewLine, when those differ, but that's just band-aid for the underlying problem. The parser could ignore environment specifics and always use a normalized line break, e.g. like \n. And it's creating some unnecessary new string objects just for the lookup.

My PR #10 introduces a new setting to instruct the parser to always parse newlines as \n instead of the environment-specific newline string. This works pretty well, fixes the root cause and feels elegant. The setting is also backwards-compatible as it's not set by default.

FYI, string.Replace is not that bad. It only allocates new string objects if there's at least one replacement:

var str = "A single line text.";
Console.WriteLine(ReferenceEquals(str, str.Replace("\n", Environment.NewLine)));
// prints True

So, if fixing up lookup keys is done only once (that is, not for every single lookup), it's not a big deal IMO. I cannot really think up a situation where this workaround is impossible.

However, I can imagine that in some cases this preliminary fix-up can be pretty inconvenient (e.g. when lookup keys must be compile-time constants). So I'm leaning toward accepting your proposal and add this option to the parser.

Just one more thing. As far as I can see, this change makes sense only for entry key strings (msgid, msgid_plural, msgctxt). Translation strings (msgstr) should always be decoded using Environment.NewLine. When printing these texts in the application, platform-dependent line breaks are expected. Do you agree?

Yes, I agree, the translated string should include the NewLine character – even if \n is OK in my special scenario (the newline is displayed just fine), but that's obviously technology dependent.

I'll revise my PR accordingly.

Giving some more thought to the problem I came to the conclusion that we should be as future-proof as possible regarding this change. Let's enable all the options which seem to make sense. These options would be:

  1. Decode all new line escape sequences to platform-dependent new lines. This should be the default behavior to maintain backward-compatibility.
  2. Decode all new line escape sequences to platform-dependent new lines except for key strings. Key strings are kept platform-independent.
  3. Decode all new line escape sequences to platform-independent new lines.

I don't see much point in supporting the "Decode all new line escape sequences to platform-dependent new lines except for translation strings." case.

So it seems our best bet is an enum called something like POStringNewLineHandling. At least, the name should make clear that this option is about the strings contained in the PO file and not about the line ending format of the PO file itself.

Please update your PR in the light of the above.

@MKuckert I've updated your PR and completed the feature. It will ship with v1.7 which will be available soon on NuGet.

To prevent platform-dependent decoding of \n sequences in your entry keys, use the following setup:

var parserSettings = new POParserSettings
{
    StringDecodingOptions = new POStringDecodingOptions { KeepKeyStringsPlatformIndependent = true }
};

var parser = new POParser(parserSettings);