elixir-gettext/expo

Support multiline msgid and msgstr

Closed this issue · 7 comments

I believe the current Gettext parser (and the Gettext 'standard') support multiline messages. Currently they do not parse:

iex(4)> Expo.Parser.Po.parse """
...(4)> msgid "hello            
...(4)> beautiful"              
...(4)> msgstr "ciao            
...(4)> bella"
...(4)> """
{:error,
 {:parse_error, "did not expect newline inside string",
  "\nbeautiful\"\nmsgstr \"ciao\nbella\"\n", 1}}

This is quite important when parsing ICU formatted messages which can very often be multiline.

@kipcole9 Hm. I'm not sure if that is correct:

msgid ""
msgstr ""
"Project-Id-Version: \n"
"POT-Creation-Date: \n"
"PO-Revision-Date: \n"
"Last-Translator: \n"
"Language-Team: \n"
"Language: de\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"X-Generator: Poedit 3.0.1\n"

msgid "test
two lines"
msgstr "test message
two lines"
$ msgfmt -c test.po
test.po:16: end-of-line within string
test.po:15: missing 'msgstr' section
test.po:16:4: syntax error
test.po:16: keyword "lines" unknown
test.po:17: end-of-line within string
test.po:18: end-of-line within string
msgfmt: found 6 fatal errors

The fix would be rather simple though:

diff --git a/lib/expo/parser/po.ex b/lib/expo/parser/po.ex
index bcbf565..0c9f499 100644
--- a/lib/expo/parser/po.ex
+++ b/lib/expo/parser/po.ex
@@ -42,8 +42,7 @@ defmodule Expo.Parser.Po do
 
   defcombinatorp :string,
                  parsec(:double_quote)
-                 |> repeat(choice([parsec(:escaped_char), utf8_char(not: ?", not: ?\n)]))
-                 |> label(lookahead_not(parsec(:newline)), "newline inside string")
+                 |> repeat(choice([parsec(:escaped_char), utf8_char(not: ?")]))
                  |> concat(parsec(:double_quote))
                  |> reduce(:to_string)
 
diff --git a/test/expo/parser/po_test.exs b/test/expo/parser/po_test.exs
index 32d64bc..23f99ea 100644
--- a/test/expo/parser/po_test.exs
+++ b/test/expo/parser/po_test.exs
@@ -72,6 +72,21 @@ defmodule Expo.Parser.PoTest do
              """)
   end
 
+  test "parse/1 with strings spanning multiple lines" do
+    assert {:ok,
+            %Translations{
+              translations: [
+                %Translation.Singular{msgid: ["hello\nworld"], msgstr: ["ciao\nmondo"]}
+              ]
+            }} =
+             Po.parse("""
+             msgid "hello
+             world"
+             msgstr "ciao
+             mondo"
+             """)
+  end
+
   test "parse/1 with multiple translations" do
     assert {:ok,
             %Translations{

I apologise, I believe you are correct. The gnu gettext docs note how a multiline string should be broken up. The example in the linked docs is:

msgid ""
"\n"
"\n"
"Hello,\n"
"world!\n"
"\n"
"\n"

Which appears to parse correctly (as long as the \n is escaped as \\n in the string before parsing.

Also noting this issue from 2019.

I suppose the question is: when presented with a message string with embedded newlines, who is responsible for the canonical storage format? My assumption is that this is the responsibility of the storage provider (in this case Expo). But I don't know how a single string with embedded newlines could round trip with Expo at the moment.

Moved into its own issue: #20

Right now, .po parse & compose should preserve all multiline string splitting

.mo does not have multiline strings, therefore all translations are in a single line on parse and are put into a single line on write.

Moved into its own issue: #20

Right now, .po parse & compose should preserve all multiline string splitting

.mo does not have multiline strings, therefore all translations are in a single line on parse and are put into a single line on write.