Umlauts and € symbol get mangled
delexi opened this issue · 16 comments
I am trying to convert my ledger file to beancount syntax and I cannot get umlauts and €-symbols to work correctly.
This is my input ledger file:
2020/11/13 * Täst
Expenses:Draußen €5.00
Assets:Bar
This is the config file for ledger2beancount:
date_format: "%Y/%m/%d"
date_format_no_year: "%m/%d"
account_open_date: "2010-03-01"
commodities_date: "2010-03-01"
beancount_indent: 2
ledger_indent: 4
operating_currencies:
- EUR
commodity_map:
"$": USD
"£": GBP
"€": EUR
This is the output beancount file:
;----------------------------------------------------------------------
; ledger2beancount conversion notes:
;
; - Account Expenses:Draußen renamed to Expenses:Drau-�en
; - Commodity € renamed to OE1-4
;----------------------------------------------------------------------
option "operating_currency" "EUR"
2010-03-01 open Assets:Bar
2010-03-01 open Expenses:Drau-�en
2010-03-01 commodity OE1-4
2020-11-13 * "Täst"
Expenses:Drau-�en 5.00 OE1-4
Assets:Bar
I expected the output to not mangle the "ß" and not to rename the "€" symbol. I am on Windows running the script with Strawberry Perl. Does anyone have any pointer as to what I could be doing wrong here?
This looks like an encoding issue. What locale do you use? Is it not UTF-8?
And does make test
work for you?
Running make test
produces the error shown below. (It stumbles over the ./ syntax, presumably because I am on Windows). It seems to me, like the test suite is also targeted at Linux / Mac OS only, so this would not run anyways. I will try to see what I can do with Cygwin or WSL or somesuch.
> make test
cd tests && ./runtests
Der Befehl "." ist entweder falsch geschrieben oder konnte nicht gefunden werden.
make: *** [Makefile:13: test-stamp] Error 1
This looks like an encoding issue. What locale do you use? Is it not UTF-8?
Which encoding do you mean? The input ledger file is UTF-8 encoded. I have no perl knowledge, do I have to set something in my perl environment? It is also interesting that the "ä" in "Täst" is handled correctly.
Which encoding do you mean? The input ledger file is UTF-8 encoded
What about your environment ("locale" in Linux).
It is also interesting that the "ä" in "Täst" is handled correctly
Yeah, I noticed that. The original account name is also displayed correctly but ledger2beancount thinks it's not a valid account, therefore it maps it to something else.
like the test suite is also targeted at Linux / Mac OS only
Yeah, I don't know much about Windows, but the test suite works on Windows with GitHub Actions, although of course that's a Windows with a lot of Unix tools: https://github.com/beancount/ledger2beancount/runs/1373888694?check_suite_focus=true
Can you edit ledger2beancount and comment out this line. It's on around line 1170.
$account =~ s/[^\p{letter}\p{number}:-]/-/g; # Replace disallowed characters
Just change it to
# $account =~ s/[^\p{letter}\p{number}:-]/-/g; # Replace disallowed characters
(i.e. add a #
at the beginning of the line)
I just want to confirm that this is causing the replacement.
That still doesn't answer the encoding issue though, I think.
Can you add your testcase to a ZIP file and upload somewhere (I think you can attach ZIP files to this issue but not sure). Just in case it shows up on other platforms (i.e. the file isn't really UTF-8 or there's something weird about it). I don't think that's the case but might be worth a try.
Which encoding do you mean? The input ledger file is UTF-8 encoded
What about your environment ("locale" in Linux).
It is set to "Deutsch (Deutsch)" which should be de_DE.
Can you edit ledger2beancount and comment out this line. It's on around line 1170.
[...]
I just want to confirm that this is causing the replacement.
I commented out the line you mentioned in the /bin/ledger2beancount script and this fixes the Account name. Now the "ß" is left untouched in the account name. The € symbol, however, is still mangled as before (which was expected, I assume).
Can you add your testcase to a ZIP file and upload somewhere (I think you can attach ZIP files to this issue but not sure). Just in case it shows up on other platforms (i.e. the file isn't really UTF-8 or there's something weird about it). I don't think that's the case but might be worth a try.
Here you go:
l2b_issue_236.zip
Please let me know, if there is something missing. The commented line has since been uncommented again.
delexi:
It seems to me, like the test suite is also targeted at Linux / Mac OS only, so this would not run anyways. I will try to see what I can do with Cygwin or WSL or somesuch.
I finally got WSL to work. There the conversion works as expected (umlauts and symbols convert OK). This is the result of running make test
under WSL:
$ make test
cd tests && ./runtests
Skipping ledger validation checks since ledger is not installed
Skipping hledger validation checks since hledger is not installed
Converting accounts.ledger... ok
Converting amounts-decimal-comma.ledger... ok
Converting amounts.ledger... ok
Converting aux-date.ledger... ok
Converting balance-assertion.ledger... ok
Converting bug214.ledger... ok
Converting code.ledger... ok
Converting comments.ledger... ok
Converting commodities.ledger... ok
Converting dates-month.ledger... ok
Converting dates.ledger... ok
Converting directives.ledger... ok
Converting fixated.ledger... ok
Converting flags.ledger... ok
Converting ignore.ledger... ok
Converting include1.ledger... ok
Converting include2.ledger... ok
Converting include3.ledger... ok
Converting lots.ledger... ok
Converting metadata.ledger... ok
Converting narration.ledger... ok
Converting no-config.ledger... ok
Converting non-standard-account-root.ledger... ok
Converting payee.ledger... ok
Converting prices.ledger... ok
Converting spacing.ledger... ok
Converting tags.ledger... ok
Converting transactions.ledger... ok
Converting virtual-postings.ledger... ok
Converting hledger.hledger... ok
Validating accounts.beancount... ok
Validating amounts-decimal-comma.beancount... ok
Validating amounts.beancount... ok
Validating aux-date.beancount... ok
Validating balance-assertion.beancount... ok
Validating bug214.beancount... ok
Validating code.beancount... ok
Validating comments.beancount... ok
Validating commodities.beancount... ok
Validating dates-month.beancount... ok
Validating dates.beancount... ok
Validating directives.beancount... ok
Validating fixated.beancount... ok
Validating flags.beancount... ok
Validating hledger.beancount... ok
Validating ignore.beancount... ok
Validating include1.beancount... ok
Validating include2.beancount... ok
Validating include3.beancount... ok
Validating lots.beancount... ok
Validating metadata.beancount... ok
Validating narration.beancount... ok
Validating no-config.beancount... ok
Validating non-standard-account-root.beancount... ok
Validating payee.beancount... ok
Validating prices.beancount... ok
Validating spacing.beancount... ok
Validating tags.beancount... ok
Validating transactions.beancount... ok
Validating virtual-postings.beancount... ok
touch test-stamp
I found something which makes the script output the correct text under the windows/perl environment. I added
use open ':encoding(utf8)';
near the top of the ledger2beancount script. I have no idea what this means or why it works (again no perl knowledge), I just got lucky with a combination of SO and google. Is that something you think is worth adding to the script?
Thanks @delexi for taking the time to investigate. It's also interesting to hear that it works with WSL but not without. (I used to test on Windows with AppVayor in the past but that broke at some point and that wasn't "pure" Windows either).
Anythere, there are two encoding issues:
-
We always output a beancount file in the user encoding. But we recently learned that beancount files have to be UTF-8: see beancount/beancount2ledger#26
-
I wanted to assume UTF-8 input in 2018 but @zacchiro was opposed:
15:44 <zack> re #54, precisely because it's 2018 i don't think we should impose any locale to anyone, there are places where you need UTF-16...
15:44 <zack> UTF-8 is mostly west world only, and i can assure you in france many configurations still do iso-8859 :)
15:45 <zack> but really, i think we can easily be encoding-agnostic here
I just looked at this again and I don't see a good solution. Either we use use open ':std :utf8' which would fix this bug and also the output issue... or... or we read the input stream as binary, try to guess the encoding (with
Encode:Guessor
Encode::Detect`), and if that fails, use the user locale. But yuck...
@zacchiro do you have any idea how to solve this properly?
I found something which makes the script output the correct text under the windows/perl environment.
Sadly it seems I have spoken too soon. The "fix" I found outputs the contents correctly to the terminal, but when redirected to a file, the umlauts get still mangled in account names. The € symbol is fine, however.
15:44 re #54, precisely because it's 2018 i don't think we should impose any locale to anyone, there are places where you need UTF-16...
15:44 UTF-8 is mostly west world only, and i can assure you in france many configurations still do iso-8859 :)
15:45 but really, i think we can easily be encoding-agnostic hereI just looked at this again and I don't see a good solution. Either we use `use open ':std :utf8' which would fix this bug and also the output issue... or... or we read the input stream as binary, try to guess the encoding (with `Encode:Guess`or`Encode::Detect`), and if that fails, use the user locale. But yuck... @zacchiro do you have any idea how to solve this properly?
At the time I wasn't aware that beancount tools de facto expects input files to be UTF8.
Given that's the case, I think the "proper" solution is to just assume that the ledger input is also UTF8 and treat it as such, also outputing UTF8 as a rule.
We can document this limitation and suggest that users whose ledger input is not UTF8 should transcode it first (e.g., using iconv) before passing it to ledger2beancount.
It seems to me to be the KISS solution here.
Ok. It seems ledger2beancount is not at fault here, after all. The console under windows was the problem. Once I set its codepage to utf-8, everything works smoothly -- even without my use open ':encoding(utf8)';
"fix".
Here is a nice explanation of what is happening and how to set the code page in CMD under Windows: https://keepass.info/help/kb/console_encoding.html
At the time I wasn't aware that beancount tools de facto expects input files to be UTF8.
Sorry, re-reading my message I came across a bit as "I knew all along and now I was proven right" when as you say we recently obtained new information that influences the decision. (And you raised very good points when this came up two years ago!)
Anyway, I'll document the Windows code page issue (thanks for the link @delexi!) and ensure the output is UTF-8.