The default iso-8859-1 encoding unsafe, can be misdecoded by zbar as Big5 (was: Unicode-related test failures on musl libc)
mgorny opened this issue · 14 comments
When running the test suite on a system with musl libc, I'm getting the following test failures:
=================================== FAILURES ===================================
_________________ test_encode_decode[M\xe4rchenb\xfccher-byte] _________________
content = 'Märchenbücher', mode = 'byte'
@pytest.mark.parametrize('content, mode',
[('漢字', 'kanji'),
('続きを読む', 'kanji'),
('Märchenbücher', 'byte'),
('汉字', 'byte'),
])
def test_encode_decode(content, mode):
qr = segno.make_qr(content)
assert mode == qr.mode
> assert content == decode(qr)
E AssertionError: assert 'Märchenbücher' == 'M酺chenb𡡷her'
E
E - M酺chenb𡡷her
E ? ^ ^
E + Märchenbücher
E ? ^^ ^^
tests/test_encode_decode.py:50: AssertionError
_____________________________ test_issue_109_bytes _____________________________
def test_issue_109_bytes():
data = b'\xb8\xd6\x90\xaf'
qr_code = segno.make(data, micro=False, mode='byte')
assert qr_code
> decoded = decode(qr_code)
tests/test_issue_109_bytes.py:46:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
qrcode = <segno.QRCode object at 0x7fb8472487c0>
def decode(qrcode):
scale = 3
width, height = qrcode.symbol_size(scale=scale)
qr_bytes = qr_to_bytes(qrcode, scale)
decoded = zbardecode((qr_bytes, width, height))
assert 1 == len(decoded)
assert 'QRCODE' == decoded[0].type
> return decoded[0].data.decode('utf-8').encode('cp932')
E UnicodeEncodeError: 'cp932' codec can't encode character '\u9373' in position 1: illegal multibyte sequence
tests/test_issue_109_bytes.py:39: UnicodeEncodeError
__________________________ test_issue_109_bytes_auto ___________________________
def test_issue_109_bytes_auto():
data = b'\xb8\xd6\x90\xaf'
qr_code = segno.make(data, micro=False)
assert qr_code
> decoded = decode(qr_code)
tests/test_issue_109_bytes.py:54:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
qrcode = <segno.QRCode object at 0x7fb846d5dbc0>
def decode(qrcode):
scale = 3
width, height = qrcode.symbol_size(scale=scale)
qr_bytes = qr_to_bytes(qrcode, scale)
decoded = zbardecode((qr_bytes, width, height))
assert 1 == len(decoded)
assert 'QRCODE' == decoded[0].type
> return decoded[0].data.decode('utf-8').encode('cp932')
E UnicodeEncodeError: 'cp932' codec can't encode character '\u9373' in position 1: illegal multibyte sequence
tests/test_issue_109_bytes.py:39: UnicodeEncodeError
=========================== short test summary info ============================
FAILED tests/test_encode_decode.py::test_encode_decode[M\xe4rchenb\xfccher-byte]
FAILED tests/test_issue_109_bytes.py::test_issue_109_bytes - UnicodeEncodeErr...
FAILED tests/test_issue_109_bytes.py::test_issue_109_bytes_auto - UnicodeEnco...
======================= 3 failed, 1583 passed in 15.23s ========================
I can reproduce both on Gentoo and Alpine Linux, amd64.
An easy way to reproduce is to use a Dockerfile like this:
FROM alpine
RUN apk add git py3-nox zbar-dev
RUN git clone https://github.com/heuer/segno/
RUN cd segno && nox
Tested with 39c93b1.
I'm going to try investigating further.
Curious enough, the generated QRCode (at least according to the file produced by .save()
with .txt
format) is the same. However, zbar decodes it differently.
On glibc:
[Decoded(data=b'M\xc3\xa4rchenb\xc3\xbccher', type='QRCODE', rect=Rect(left=12, top=12, width=63, height=63), polygon=[Point(x=12, y=12), Point(x=12, y=75), Point(x=75, y=75), Point(x=75, y=12)], quality=1, orientation='UP')]
On musl:
[Decoded(data=b'M\xe9\x85\xbachenb\xf0\xa1\xa1\xb7her', type='QRCODE', rect=Rect(left=12, top=12, width=63, height=63), polygon=[Point(x=12, y=12), Point(x=12, y=75), Point(x=75, y=75), Point(x=75, y=12)], quality=1, orientation='UP')]
However, the test suite of pyzbar
itself passes all tests…
Ok, using zbarimg
I've confirmed that the problem apparently lies in zbar and not here. Sorry for the noise.
Actually, this may be a more significant bug. This only works on glibc because iso-8859-1 üc
falls into "user-defined" Big5 characters. If I shorten the string to Märchen
, zbar is confused by the iso-8859-1 encoding and decodes it as Big5. Perhaps it would be better to use UTF-8 after all?
Thanks for your report. Acc. to ISO/IEC 18004 (3rd edition) the default encoding for QR codes should be ISO 8859-1.
The lib tries to use ISO 8859-1 and if the content does not fit, it falls back to UTF-8. However, you may enforce UTF-8 by using the "encoding" parameter:
import segno
qr = segno.make("Märchen", encoding="utf-8")
Acc. to the following test case it does not help, though. All tests fail (pyzbar-0.1.9, segno 1.6.1-dev, libzbar0 0.23.93-1, Python 3.11.7, PyPy 7.3.15, Debian trixie)
import io
import pytest
import segno
from pyzbar.pyzbar import decode as zbardecode
def qr_to_bytes(qrcode, scale):
buff = io.BytesIO()
for row in qrcode.matrix_iter(scale=scale):
buff.write(bytearray(0x0 if b else 0xff for b in row))
return buff.getvalue()
def decode(qrcode):
scale = 3
width, height = qrcode.symbol_size(scale=scale)
qr_bytes = qr_to_bytes(qrcode, scale)
decoded = zbardecode((qr_bytes, width, height))
assert 1 == len(decoded)
assert 'QRCODE' == decoded[0].type
return decoded[0].data.decode('utf-8')
@pytest.mark.parametrize('encoding', [None, 'latin1', 'ISO-8859-1', 'utf-8'])
def test_issue134(encoding):
# See <https://github.com/heuer/segno/issues/134>
content = 'Märchen'
qr = segno.make(content, encoding=encoding, micro=False)
assert 'byte' == qr.mode
assert content == decode(qr)
if __name__ == '__main__':
pytest.main([__file__])
I don't know yet whether this is a Segno, pyzbar or zbar problem.
It's a zbar problem. It's trying hard to decode QR codes as Big5 and Shift-JIS, before attempting ISO-8859-1. The bugs about that date back to 2012 (and many were filed since), so I don't think there's a much chance of zbar being ever fixed (and I have to admit that the code is a horror).
I don't know how significant zbar is, so I don't know if it's really worth caring for. But if you want things to work on zbar, I'm afraid you can't use ISO-8859-1 for non-ASCII characters.
If you don't care about zbar working, then I suppose it's either a matter of avoiding strings that are misdecoded by zbar (including "Märchenbücher"), or replacing zbar with something else (I don't know any library like that, though).
Thanks for the information. zbar is not that important for me, I only use it for some test cases as it is easy to use and has few dependencies.
If you don't mind, I would close the bug. You are welcome to reopen it if I can do anything on the library side.
Well, my point is that tests — as they are now — fail on musl systems.
Okay, got it.
I can replace zbar with OpenCV. Would that help? Although it creates more dependencies for the test cases, I wouldn't mind
I can replace zbar with OpenCV. Would that help? Although it creates more dependencies for the test cases, I wouldn't mind
Well, switching to OpenCV isn't a solution since all Kanji decoding test cases would fail.
I suppose you could try zxing-cpp but I haven't tried it; I just recall ZXing app was quite good.
That said, for my purposes it would be sufficient to change the test strings to avoid the problems. However, I'm not sure how hard it would be to find some that work everywhere. If you want to proceed this route, I could try finding some.
I think I have found a solution that is satisfactory. Please run the current test suite under musl and report any problems.
I'm afraid our OpenCV build does not support decoding QR Codes right now. I've requested changing that, and I'll try to remember to test it again when that's done.
I took out OpenCV again. That caused more problems than it helped.
Three tests under musl are now skipped. I think this is acceptable.
I tried the test suite on Void Linux/musl and, apart from the three tests skipped, it runs.
I think it will work with Alpine Linux as well.
Ok, thanks a lot for your effort!