heuer/segno

The default iso-8859-1 encoding unsafe, can be misdecoded by zbar as Big5 (was: Unicode-related test failures on musl libc)

mgorny opened this issue · 14 comments

mgorny commented

When running the test suite on a system with musl libc, I'm getting the following test failures:

=================================== FAILURES ===================================
_________________ test_encode_decode[M\xe4rchenb\xfccher-byte] _________________

content = 'Märchenbücher', mode = 'byte'

    @pytest.mark.parametrize('content, mode',
                             [('漢字', 'kanji'),
                              ('続きを読む', 'kanji'),
                              ('Märchenbücher', 'byte'),
                              ('汉字', 'byte'),
                              ])
    def test_encode_decode(content, mode):
        qr = segno.make_qr(content)
        assert mode == qr.mode
>       assert content == decode(qr)
E       AssertionError: assert 'Märchenbücher' == 'M酺chenb𡡷her'
E         
E         - M酺chenb𡡷her
E         ?  ^     ^
E         + Märchenbücher
E         ?  ^^     ^^

tests/test_encode_decode.py:50: AssertionError
_____________________________ test_issue_109_bytes _____________________________

    def test_issue_109_bytes():
        data = b'\xb8\xd6\x90\xaf'
        qr_code = segno.make(data, micro=False, mode='byte')
        assert qr_code
>       decoded = decode(qr_code)

tests/test_issue_109_bytes.py:46: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

qrcode = <segno.QRCode object at 0x7fb8472487c0>

    def decode(qrcode):
        scale = 3
        width, height = qrcode.symbol_size(scale=scale)
        qr_bytes = qr_to_bytes(qrcode, scale)
        decoded = zbardecode((qr_bytes, width, height))
        assert 1 == len(decoded)
        assert 'QRCODE' == decoded[0].type
>       return decoded[0].data.decode('utf-8').encode('cp932')
E       UnicodeEncodeError: 'cp932' codec can't encode character '\u9373' in position 1: illegal multibyte sequence

tests/test_issue_109_bytes.py:39: UnicodeEncodeError
__________________________ test_issue_109_bytes_auto ___________________________

    def test_issue_109_bytes_auto():
        data = b'\xb8\xd6\x90\xaf'
        qr_code = segno.make(data, micro=False)
        assert qr_code
>       decoded = decode(qr_code)

tests/test_issue_109_bytes.py:54: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

qrcode = <segno.QRCode object at 0x7fb846d5dbc0>

    def decode(qrcode):
        scale = 3
        width, height = qrcode.symbol_size(scale=scale)
        qr_bytes = qr_to_bytes(qrcode, scale)
        decoded = zbardecode((qr_bytes, width, height))
        assert 1 == len(decoded)
        assert 'QRCODE' == decoded[0].type
>       return decoded[0].data.decode('utf-8').encode('cp932')
E       UnicodeEncodeError: 'cp932' codec can't encode character '\u9373' in position 1: illegal multibyte sequence

tests/test_issue_109_bytes.py:39: UnicodeEncodeError
=========================== short test summary info ============================
FAILED tests/test_encode_decode.py::test_encode_decode[M\xe4rchenb\xfccher-byte]
FAILED tests/test_issue_109_bytes.py::test_issue_109_bytes - UnicodeEncodeErr...
FAILED tests/test_issue_109_bytes.py::test_issue_109_bytes_auto - UnicodeEnco...
======================= 3 failed, 1583 passed in 15.23s ========================

I can reproduce both on Gentoo and Alpine Linux, amd64.

An easy way to reproduce is to use a Dockerfile like this:

FROM alpine

RUN apk add git py3-nox zbar-dev
RUN git clone https://github.com/heuer/segno/
RUN cd segno && nox

Tested with 39c93b1.

I'm going to try investigating further.

mgorny commented

Curious enough, the generated QRCode (at least according to the file produced by .save() with .txt format) is the same. However, zbar decodes it differently.

On glibc:

[Decoded(data=b'M\xc3\xa4rchenb\xc3\xbccher', type='QRCODE', rect=Rect(left=12, top=12, width=63, height=63), polygon=[Point(x=12, y=12), Point(x=12, y=75), Point(x=75, y=75), Point(x=75, y=12)], quality=1, orientation='UP')]

On musl:

[Decoded(data=b'M\xe9\x85\xbachenb\xf0\xa1\xa1\xb7her', type='QRCODE', rect=Rect(left=12, top=12, width=63, height=63), polygon=[Point(x=12, y=12), Point(x=12, y=75), Point(x=75, y=75), Point(x=75, y=12)], quality=1, orientation='UP')]

However, the test suite of pyzbar itself passes all tests…

mgorny commented

Ok, using zbarimg I've confirmed that the problem apparently lies in zbar and not here. Sorry for the noise.

mgorny commented

Actually, this may be a more significant bug. This only works on glibc because iso-8859-1 üc falls into "user-defined" Big5 characters. If I shorten the string to Märchen, zbar is confused by the iso-8859-1 encoding and decodes it as Big5. Perhaps it would be better to use UTF-8 after all?

heuer commented

Thanks for your report. Acc. to ISO/IEC 18004 (3rd edition) the default encoding for QR codes should be ISO 8859-1.
The lib tries to use ISO 8859-1 and if the content does not fit, it falls back to UTF-8. However, you may enforce UTF-8 by using the "encoding" parameter:

import segno

qr = segno.make("Märchen", encoding="utf-8")

Acc. to the following test case it does not help, though. All tests fail (pyzbar-0.1.9, segno 1.6.1-dev, libzbar0 0.23.93-1, Python 3.11.7, PyPy 7.3.15, Debian trixie)

import io
import pytest
import segno
from pyzbar.pyzbar import decode as zbardecode


def qr_to_bytes(qrcode, scale):
    buff = io.BytesIO()
    for row in qrcode.matrix_iter(scale=scale):
        buff.write(bytearray(0x0 if b else 0xff for b in row))
    return buff.getvalue()


def decode(qrcode):
    scale = 3
    width, height = qrcode.symbol_size(scale=scale)
    qr_bytes = qr_to_bytes(qrcode, scale)
    decoded = zbardecode((qr_bytes, width, height))
    assert 1 == len(decoded)
    assert 'QRCODE' == decoded[0].type
    return decoded[0].data.decode('utf-8')


@pytest.mark.parametrize('encoding', [None, 'latin1', 'ISO-8859-1', 'utf-8'])
def test_issue134(encoding):
    # See <https://github.com/heuer/segno/issues/134>
    content = 'Märchen'
    qr = segno.make(content, encoding=encoding, micro=False)
    assert 'byte' == qr.mode
    assert content == decode(qr)


if __name__ == '__main__':
    pytest.main([__file__])

I don't know yet whether this is a Segno, pyzbar or zbar problem.

mgorny commented

It's a zbar problem. It's trying hard to decode QR codes as Big5 and Shift-JIS, before attempting ISO-8859-1. The bugs about that date back to 2012 (and many were filed since), so I don't think there's a much chance of zbar being ever fixed (and I have to admit that the code is a horror).

I don't know how significant zbar is, so I don't know if it's really worth caring for. But if you want things to work on zbar, I'm afraid you can't use ISO-8859-1 for non-ASCII characters.

If you don't care about zbar working, then I suppose it's either a matter of avoiding strings that are misdecoded by zbar (including "Märchenbücher"), or replacing zbar with something else (I don't know any library like that, though).

heuer commented

Thanks for the information. zbar is not that important for me, I only use it for some test cases as it is easy to use and has few dependencies.

If you don't mind, I would close the bug. You are welcome to reopen it if I can do anything on the library side.

mgorny commented

Well, my point is that tests — as they are now — fail on musl systems.

heuer commented

Okay, got it.

I can replace zbar with OpenCV. Would that help? Although it creates more dependencies for the test cases, I wouldn't mind

heuer commented

I can replace zbar with OpenCV. Would that help? Although it creates more dependencies for the test cases, I wouldn't mind

Well, switching to OpenCV isn't a solution since all Kanji decoding test cases would fail.

mgorny commented

I suppose you could try zxing-cpp but I haven't tried it; I just recall ZXing app was quite good.

That said, for my purposes it would be sufficient to change the test strings to avoid the problems. However, I'm not sure how hard it would be to find some that work everywhere. If you want to proceed this route, I could try finding some.

heuer commented

I think I have found a solution that is satisfactory. Please run the current test suite under musl and report any problems.

mgorny commented

I'm afraid our OpenCV build does not support decoding QR Codes right now. I've requested changing that, and I'll try to remember to test it again when that's done.

heuer commented

I took out OpenCV again. That caused more problems than it helped.
Three tests under musl are now skipped. I think this is acceptable.
I tried the test suite on Void Linux/musl and, apart from the three tests skipped, it runs.
I think it will work with Alpine Linux as well.

mgorny commented

Ok, thanks a lot for your effort!