The default iso-8859-1 encoding unsafe, can be misdecoded by zbar as Big5 (was: Unicode-related test failures on musl libc)

Question

The default iso-8859-1 encoding unsafe, can be misdecoded by zbar as Big5 (was: Unicode-related test failures on musl libc)

mgorny opened this issue a year ago · 14 comments

When running the test suite on a system with musl libc, I'm getting the following test failures:

=================================== FAILURES ===================================
_________________ test_encode_decode[M\xe4rchenb\xfccher-byte] _________________

content = 'Märchenbücher', mode = 'byte'

    @pytest.mark.parametrize('content, mode',
                             [('漢字', 'kanji'),
                              ('続きを読む', 'kanji'),
                              ('Märchenbücher', 'byte'),
                              ('汉字', 'byte'),
                              ])
    def test_encode_decode(content, mode):
        qr = segno.make_qr(content)
        assert mode == qr.mode
>       assert content == decode(qr)
E       AssertionError: assert 'Märchenbücher' == 'M酺chenb𡡷her'
E         
E         - M酺chenb𡡷her
E         ?  ^     ^
E         + Märchenbücher
E         ?  ^^     ^^

tests/test_encode_decode.py:50: AssertionError
_____________________________ test_issue_109_bytes _____________________________

    def test_issue_109_bytes():
        data = b'\xb8\xd6\x90\xaf'
        qr_code = segno.make(data, micro=False, mode='byte')
        assert qr_code
>       decoded = decode(qr_code)

tests/test_issue_109_bytes.py:46: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

qrcode = <segno.QRCode object at 0x7fb8472487c0>

    def decode(qrcode):
        scale = 3
        width, height = qrcode.symbol_size(scale=scale)
        qr_bytes = qr_to_bytes(qrcode, scale)
        decoded = zbardecode((qr_bytes, width, height))
        assert 1 == len(decoded)
        assert 'QRCODE' == decoded[0].type
>       return decoded[0].data.decode('utf-8').encode('cp932')
E       UnicodeEncodeError: 'cp932' codec can't encode character '\u9373' in position 1: illegal multibyte sequence

tests/test_issue_109_bytes.py:39: UnicodeEncodeError
__________________________ test_issue_109_bytes_auto ___________________________

    def test_issue_109_bytes_auto():
        data = b'\xb8\xd6\x90\xaf'
        qr_code = segno.make(data, micro=False)
        assert qr_code
>       decoded = decode(qr_code)

tests/test_issue_109_bytes.py:54: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

qrcode = <segno.QRCode object at 0x7fb846d5dbc0>

    def decode(qrcode):
        scale = 3
        width, height = qrcode.symbol_size(scale=scale)
        qr_bytes = qr_to_bytes(qrcode, scale)
        decoded = zbardecode((qr_bytes, width, height))
        assert 1 == len(decoded)
        assert 'QRCODE' == decoded[0].type
>       return decoded[0].data.decode('utf-8').encode('cp932')
E       UnicodeEncodeError: 'cp932' codec can't encode character '\u9373' in position 1: illegal multibyte sequence

tests/test_issue_109_bytes.py:39: UnicodeEncodeError
=========================== short test summary info ============================
FAILED tests/test_encode_decode.py::test_encode_decode[M\xe4rchenb\xfccher-byte]
FAILED tests/test_issue_109_bytes.py::test_issue_109_bytes - UnicodeEncodeErr...
FAILED tests/test_issue_109_bytes.py::test_issue_109_bytes_auto - UnicodeEnco...
======================= 3 failed, 1583 passed in 15.23s ========================

I can reproduce both on Gentoo and Alpine Linux, amd64.

An easy way to reproduce is to use a Dockerfile like this:

FROM alpine

RUN apk add git py3-nox zbar-dev
RUN git clone https://github.com/heuer/segno/
RUN cd segno && nox

Tested with 39c93b1.

I'm going to try investigating further.

Answer 1 · 2024-02-03T16:21:15.000Z

Curious enough, the generated QRCode (at least according to the file produced by .save() with .txt format) is the same. However, zbar decodes it differently.

On glibc:

[Decoded(data=b'M\xc3\xa4rchenb\xc3\xbccher', type='QRCODE', rect=Rect(left=12, top=12, width=63, height=63), polygon=[Point(x=12, y=12), Point(x=12, y=75), Point(x=75, y=75), Point(x=75, y=12)], quality=1, orientation='UP')]

On musl:

[Decoded(data=b'M\xe9\x85\xbachenb\xf0\xa1\xa1\xb7her', type='QRCODE', rect=Rect(left=12, top=12, width=63, height=63), polygon=[Point(x=12, y=12), Point(x=12, y=75), Point(x=75, y=75), Point(x=75, y=12)], quality=1, orientation='UP')]

However, the test suite of pyzbar itself passes all tests…

Answer 2 · 2024-02-03T16:26:09.000Z

Ok, using zbarimg I've confirmed that the problem apparently lies in zbar and not here. Sorry for the noise.

Answer 3 · 2024-02-03T17:12:14.000Z

Actually, this may be a more significant bug. This only works on glibc because iso-8859-1 üc falls into "user-defined" Big5 characters. If I shorten the string to Märchen, zbar is confused by the iso-8859-1 encoding and decodes it as Big5. Perhaps it would be better to use UTF-8 after all?

Answer 4 · 2024-02-06T09:50:13.000Z

Thanks for your report. Acc. to ISO/IEC 18004 (3rd edition) the default encoding for QR codes should be ISO 8859-1.
The lib tries to use ISO 8859-1 and if the content does not fit, it falls back to UTF-8. However, you may enforce UTF-8 by using the "encoding" parameter:

import segno

qr = segno.make("Märchen", encoding="utf-8")

Acc. to the following test case it does not help, though. All tests fail (pyzbar-0.1.9, segno 1.6.1-dev, libzbar0 0.23.93-1, Python 3.11.7, PyPy 7.3.15, Debian trixie)

import io
import pytest
import segno
from pyzbar.pyzbar import decode as zbardecode


def qr_to_bytes(qrcode, scale):
    buff = io.BytesIO()
    for row in qrcode.matrix_iter(scale=scale):
        buff.write(bytearray(0x0 if b else 0xff for b in row))
    return buff.getvalue()


def decode(qrcode):
    scale = 3
    width, height = qrcode.symbol_size(scale=scale)
    qr_bytes = qr_to_bytes(qrcode, scale)
    decoded = zbardecode((qr_bytes, width, height))
    assert 1 == len(decoded)
    assert 'QRCODE' == decoded[0].type
    return decoded[0].data.decode('utf-8')


@pytest.mark.parametrize('encoding', [None, 'latin1', 'ISO-8859-1', 'utf-8'])
def test_issue134(encoding):
    # See <https://github.com/heuer/segno/issues/134>
    content = 'Märchen'
    qr = segno.make(content, encoding=encoding, micro=False)
    assert 'byte' == qr.mode
    assert content == decode(qr)


if __name__ == '__main__':
    pytest.main([__file__])

I don't know yet whether this is a Segno, pyzbar or zbar problem.

Answer 5 · 2024-02-06T11:17:50.000Z

It's a zbar problem. It's trying hard to decode QR codes as Big5 and Shift-JIS, before attempting ISO-8859-1. The bugs about that date back to 2012 (and many were filed since), so I don't think there's a much chance of zbar being ever fixed (and I have to admit that the code is a horror).

I don't know how significant zbar is, so I don't know if it's really worth caring for. But if you want things to work on zbar, I'm afraid you can't use ISO-8859-1 for non-ASCII characters.

If you don't care about zbar working, then I suppose it's either a matter of avoiding strings that are misdecoded by zbar (including "Märchenbücher"), or replacing zbar with something else (I don't know any library like that, though).

Answer 6 · 2024-02-06T17:06:38.000Z

Thanks for the information. zbar is not that important for me, I only use it for some test cases as it is easy to use and has few dependencies.

If you don't mind, I would close the bug. You are welcome to reopen it if I can do anything on the library side.

Answer 7 · 2024-02-06T18:42:40.000Z

Well, my point is that tests — as they are now — fail on musl systems.

Answer 8 · 2024-02-06T19:34:07.000Z

Okay, got it.

I can replace zbar with OpenCV. Would that help? Although it creates more dependencies for the test cases, I wouldn't mind

Answer 9 · 2024-02-06T19:57:54.000Z

I can replace zbar with OpenCV. Would that help? Although it creates more dependencies for the test cases, I wouldn't mind

Well, switching to OpenCV isn't a solution since all Kanji decoding test cases would fail.

Answer 10 · 2024-02-06T20:42:52.000Z

I suppose you could try zxing-cpp but I haven't tried it; I just recall ZXing app was quite good.

That said, for my purposes it would be sufficient to change the test strings to avoid the problems. However, I'm not sure how hard it would be to find some that work everywhere. If you want to proceed this route, I could try finding some.

Answer 11 · 2024-02-06T23:12:50.000Z

I think I have found a solution that is satisfactory. Please run the current test suite under musl and report any problems.

Answer 12 · 2024-02-07T12:18:09.000Z

I'm afraid our OpenCV build does not support decoding QR Codes right now. I've requested changing that, and I'll try to remember to test it again when that's done.

Answer 13 · 2024-02-07T14:10:21.000Z

I took out OpenCV again. That caused more problems than it helped.
Three tests under musl are now skipped. I think this is acceptable.
I tried the test suite on Void Linux/musl and, apart from the three tests skipped, it runs.
I think it will work with Alpine Linux as well.

Answer 14 · 2024-02-07T14:45:00.000Z

Ok, thanks a lot for your effort!