What is the purpose of `forceUtf8`?

Question

What is the purpose of `forceUtf8`?

Shane32 opened this issue 6 months ago · 4 comments

Question

When generating a QR code with QRCodeGenerator.CreateQrCode, what is the purpose of forceUtf8 versus eciMode: EciMode.Utf8?

I would expect that specifying EciMode.Utf8 would force the generated code to use the UTF-8 character set when encoding the data. But in fact, it does so only when either (a) there are non-ISO-8859-1 characters present, or (b) forceUtf8 is true.

Consider the following:

// produces an unreadable code:
var qrData = new QRCodeGenerator().CreateQrCode("https://en.wikipedia.org/wiki/È", ECCLevel.L, eciMode: EciMode.Utf8);
// works fine:
var qrData = new QRCodeGenerator().CreateQrCode("https://en.wikipedia.org/wiki/È", ECCLevel.L, forceUtf8: true, eciMode: EciMode.Utf8);

// whereas this produces identical QR codes:
var qrData = new QRCodeGenerator().CreateQrCode("https://en.wikipedia.org/wiki/🍕", ECCLevel.L, eciMode: EciMode.Utf8);
var qrData = new QRCodeGenerator().CreateQrCode("https://en.wikipedia.org/wiki/🍕", ECCLevel.L, forceUtf8: true, eciMode: EciMode.Utf8);

A similar phenomenon occurs when specifying EciMode.Iso8859_2 with specific strings -- interestingly, forceUtf8 must be true for the code to function correctly.

If we assume that this is a bug, and EciMode.Utf8 should always encode with UTF-8, then what is the purpose for the forceUtf8 argument? My best guess is that it is some compatibility mode to use UTF-8 encoding when encoded with EciMode.Default (which normally encodes as ISO-8859-1).

Suggestion

I suggest that:

When specifying a EciMode besides EciMode.Default, the text is always encoded in the specified encoding.
For EciMode.Default with forceUtf8 == false, it uses ISO-8859-1 (per spec, EciMode.Iso8859_1 is default)
For EciMode.Default with forceUtf8 == true, it uses UTF-8 (against spec)

Spec

From ISO spec page 20:

The default interpretation for QR Code is ECI 000003 representing the ISO/IEC 8859-1 character set.

Answer 1 · 2024-05-02T07:40:06.000Z

Time for a little history lesson. Unfortunately, the question is not quite so easy to answer...

Chapter 1 (the year is 2013)

In the original implementation of the QRCoder, the CreateQrCode method had neither a forceUtf8 nor an eciMode parameter. The idea was to keep the interface as simple as possible, so that the QRCoder always selects the best ECI mode independently. (“The best” is defined as “the encoding with which the smallest QR code is generated or the most user data can be stored”).
Here I think I follow the ISO standard, which says: “If you only have numeric characters, encode as numeric. If you also have letters, use Alphanumeric, ...”

Chapter 2 (the year is 2016)

Unfortunately, the world is not always that simple and some QRCoder readers always expected the user data in ByteMode. If data came in ISO 8859-1 format, these readers displayed special characters incorrectly. (In my opinion, the fault lies with the readers, which do not fully map the ISO standard).
In order not to stubbornly insist on the standard and to give users of QRCoder the choice, I have introduced the parameter forceUtf8. (With the explicit note that this parameter breaks with the standard!)
This was done in:

#43

Chapter 3 (the year is 2018)

For mapping SlovenianUpn QR codes, it is necessary to encode QR codes in ISO 8859-2 mode. (This is a requirement from the UPN QR Form Standard)
However, the automatic ECI mode selection in the QRCoder has not yet taken this encoding into account. Although the QR code standard allows the encoding, it lists it as an encoding that should be set specifically at the user's request.
The eciMode parameter was introduced to implement the UPN QR code. This was done in:

#111

Chapter 4 (back to the here and now)

As you noted @Shane32, the forceUtf8 is actually obsolete. I can no longer say exactly why I didn't remove it in 2018. Either I simply didn't correctly recognize the functional duplication as such at the time or, as is so often the case, I was worried that removing the parameter would result in a breaking change that could annoy our loyal user base.

Long story short: The forceUtf8 parameter is a relic of old times and can probably be removed. However, this should be done carefully. (Possibly a legacy overload that still accepts the parameter and sets the ECI mode in the background. The overload can be marked as obsolete at the same time and then removed completely with version 2.0).

Or do you think we should make a hard cut here and remove the parameter completely?

Answer 2 · 2024-05-02T13:29:57.000Z

I want to think about it some more, but generally I'm thinking:

Don't make any (or few) changes for v1.x. (Consider if any changes are straight bugfixes or are likely to affect legacy code.)
For v2, make the optional parameters required and mark the method as obsolete. This retains binary compatibility, not source compatibility, for users. Include compatibility code within this method in some fashion.
For v2, match the QR Code spec more closely, and also rely on the Payload class's EciMode property to select the default encoding. If we need to add additional properties to Payload, such as ForceUtf8 then do so. So for SlovenianUpn QR codes and similar, it is all handled by properties set on Payload.
Configure default generation of QR codes to be based on modern technologies (e.g. assume UTF-8 support), with the assumption that some payload generators will select a specific target.

I'll have to think about it some more when I have additional time.

Answer 3 · 2024-05-03T05:07:57.000Z

I'll have to think about it some more when I have additional time.

So should I wait with merging #514 ?

Answer 4 · 2024-05-03T05:49:31.000Z

Yeah; I’ll mark it WIP. I’d rather have a consensus before we merge anything.