What is the purpose of `forceUtf8`?
Shane32 opened this issue · 4 comments
Question
When generating a QR code with QRCodeGenerator.CreateQrCode
, what is the purpose of forceUtf8
versus eciMode: EciMode.Utf8
?
I would expect that specifying EciMode.Utf8
would force the generated code to use the UTF-8 character set when encoding the data. But in fact, it does so only when either (a) there are non-ISO-8859-1 characters present, or (b) forceUtf8
is true
.
Consider the following:
// produces an unreadable code:
var qrData = new QRCodeGenerator().CreateQrCode("https://en.wikipedia.org/wiki/È", ECCLevel.L, eciMode: EciMode.Utf8);
// works fine:
var qrData = new QRCodeGenerator().CreateQrCode("https://en.wikipedia.org/wiki/È", ECCLevel.L, forceUtf8: true, eciMode: EciMode.Utf8);
// whereas this produces identical QR codes:
var qrData = new QRCodeGenerator().CreateQrCode("https://en.wikipedia.org/wiki/🍕", ECCLevel.L, eciMode: EciMode.Utf8);
var qrData = new QRCodeGenerator().CreateQrCode("https://en.wikipedia.org/wiki/🍕", ECCLevel.L, forceUtf8: true, eciMode: EciMode.Utf8);
A similar phenomenon occurs when specifying EciMode.Iso8859_2
with specific strings -- interestingly, forceUtf8
must be true
for the code to function correctly.
If we assume that this is a bug, and EciMode.Utf8
should always encode with UTF-8, then what is the purpose for the forceUtf8
argument? My best guess is that it is some compatibility mode to use UTF-8 encoding when encoded with EciMode.Default
(which normally encodes as ISO-8859-1).
Suggestion
I suggest that:
- When specifying a
EciMode
besidesEciMode.Default
, the text is always encoded in the specified encoding. - For
EciMode.Default
withforceUtf8 == false
, it uses ISO-8859-1 (per spec,EciMode.Iso8859_1
is default) - For
EciMode.Default
withforceUtf8 == true
, it uses UTF-8 (against spec)
Spec
From ISO spec page 20:
The default interpretation for QR Code is ECI 000003 representing the ISO/IEC 8859-1 character set.
Time for a little history lesson. Unfortunately, the question is not quite so easy to answer...
Chapter 1 (the year is 2013)
In the original implementation of the QRCoder, the CreateQrCode
method had neither a forceUtf8
nor an eciMode
parameter. The idea was to keep the interface as simple as possible, so that the QRCoder always selects the best ECI mode independently. (“The best” is defined as “the encoding with which the smallest QR code is generated or the most user data can be stored”).
Here I think I follow the ISO standard, which says: “If you only have numeric characters, encode as numeric. If you also have letters, use Alphanumeric, ...”
Chapter 2 (the year is 2016)
Unfortunately, the world is not always that simple and some QRCoder readers always expected the user data in ByteMode. If data came in ISO 8859-1 format, these readers displayed special characters incorrectly. (In my opinion, the fault lies with the readers, which do not fully map the ISO standard).
In order not to stubbornly insist on the standard and to give users of QRCoder the choice, I have introduced the parameter forceUtf8
. (With the explicit note that this parameter breaks with the standard!)
This was done in:
Chapter 3 (the year is 2018)
For mapping SlovenianUpn QR codes, it is necessary to encode QR codes in ISO 8859-2 mode. (This is a requirement from the UPN QR Form Standard)
However, the automatic ECI mode selection in the QRCoder has not yet taken this encoding into account. Although the QR code standard allows the encoding, it lists it as an encoding that should be set specifically at the user's request.
The eciMode
parameter was introduced to implement the UPN QR code. This was done in:
Chapter 4 (back to the here and now)
As you noted @Shane32, the forceUtf8
is actually obsolete. I can no longer say exactly why I didn't remove it in 2018. Either I simply didn't correctly recognize the functional duplication as such at the time or, as is so often the case, I was worried that removing the parameter would result in a breaking change that could annoy our loyal user base.
Long story short: The forceUtf8
parameter is a relic of old times and can probably be removed. However, this should be done carefully. (Possibly a legacy overload that still accepts the parameter and sets the ECI mode in the background. The overload can be marked as obsolete at the same time and then removed completely with version 2.0).
Or do you think we should make a hard cut here and remove the parameter completely?
I want to think about it some more, but generally I'm thinking:
- Don't make any (or few) changes for v1.x. (Consider if any changes are straight bugfixes or are likely to affect legacy code.)
- For v2, make the optional parameters required and mark the method as obsolete. This retains binary compatibility, not source compatibility, for users. Include compatibility code within this method in some fashion.
- For v2, match the QR Code spec more closely, and also rely on the Payload class's
EciMode
property to select the default encoding. If we need to add additional properties toPayload
, such asForceUtf8
then do so. So for SlovenianUpn QR codes and similar, it is all handled by properties set onPayload
. - Configure default generation of QR codes to be based on modern technologies (e.g. assume UTF-8 support), with the assumption that some payload generators will select a specific target.
I'll have to think about it some more when I have additional time.
I'll have to think about it some more when I have additional time.
So should I wait with merging #514 ?
Yeah; I’ll mark it WIP. I’d rather have a consensus before we merge anything.