UTF-8 implementation and umlaut conversion

Question

UTF-8 implementation and umlaut conversion

xanathon opened this issue 3 years ago · 3 comments

Hello,

first let me thank you for your brilliant library that saved me a lot of time!

I just found something in you "implementation vs guidelines" where you say that details about utf-8 are missing from the official documentation.
You can take a look at this official pdf, where there are more details about utf-8 (page 24):

https://www.paymentstandards.ch/dam/downloads/ig-qr-bill-de.pdf

It's in german, but I hope from what I see in you wiki that's no problem.

There it says that the QR code is in utf-8 but only uses Latin Character Set, I interpreted that as the first 128 characters of UTF-8 and those are identical with ANSI.

Is my conclusion correct that I need to provide the data without special characters and umlauts? Or are there conversion methods in your library that I cannot find in the JavaDocs?

Thanks!

Answer 1 · 2021-07-29T13:29:56.000Z

It looks as if chapter 4.1.1 mixes up two separate concepts: character set (allowed set of characters) and encoding (binary representation of characters). Unicode is a character set, UTF-8 is an encoding. ISO-8859-1 (aka as Latin 1) is both.

The bill data for the QR code is a text with a well specified, restricted set of characters. The character set is neither Unicode, nor ISO-8859-1 nor ANSI but happens to be a subset of each of them.

The encoding (binary representation) could be solved at two levels:

Either the text is encoded with a given encoding and the binary result is embedded in the QR code;
or the QR code takes care of the text encoding.

The QR code standard supports several encoding modes like numeric, alphanumeric, Kanji, binary. And the data can be broken into several segments and each segment can use a different encoding mode.

So should encoding at level 1 (with a single binary segment) or should encoding at level 2 (with one or more, with numeric, alphanumeric or binary segments) be used? The specification fails to tell.

It could be that the specification wants to tell us that the text should be encoded in UTF-8 and then added as a binary segment to the QR code (level 1). But in practice (real QR bills), all variations of level 2 work as well. And it's even more confusing: encoding at level 1 and 2 can result in the same QR code.

Most likely, the specification authors didn't fully understand it.

Regarding the data you specify: You can either specify data within the allowed characters set (which includes umlauts) or you can specify any text without restriction. In the latter case, invalid characters are replaced. The validations will contain warnings.

It's mentioned in the Wiki and probably not obvious from the JavaDoc:
https://github.com/manuelbl/SwissQRBill/wiki/Bill-data-validation#data-modifications-with-warnings

Answer 2 · 2021-07-29T13:52:51.000Z

Thank you for the quick and detailed explanation. Since we are talking switzerland here there are not only umlauts, but also diacritics from e.g. french.

I'll try to convert them to Basic Latin via the usual Java conversion routines that e.g. convert í to i to avoid validation warnings.

Answer 3 · 2021-07-29T14:06:02.000Z

You are probably better off not converting anything, but rather ignore the warnings. If you convert it yourself, you will write the same code that's already present in the library (good case) or code with less features (e.g. missing normalization leading to unnecessarily removed characters).