Double-byte (or larger) UTF-8 strings are encoded with the wrong size.
cforger opened this issue · 3 comments
Hello,
Thanks for your work on this, it's been most useful for me.
I have encountered one error and fixed it, so I'm passing it on to see if you agree it's an error, and if it needs inclusion as a revision.
Currently I'm moving contact data between programs, and it's crashing on the decode of some French names.
The problem is in _pack_string. It's calculating the length of the string before it's encoded to UTF-8.
I think you must encode the string before you find the length of it, as some characters need to encode as double-byte or longer.
An example would be the French name Allagbe, or the French word precedent , where the 'e' is with Acute (http://www.fileformat.info/info/unicode/char/e9/index.htm)
Python's encoder makes this b'Allagb\xc3\xa9', which is one byte longer than than the original string.
u-msgpack encodes this as b'\xa7Allagb\xc3\xa9' - notice how it's only 7 bytes long - it's trimming the \xa9 char from the msgpack.
When you feed this trimmed string through a .decode('utf-8') method, you'll crash with a python error : 'utf-8' codec can't decode byte 0xc3 in position 6: unexpected end of data
The solution is to encode to UTF-8 before calculating the string length, as detailed below:
def _pack_string(x):
x = x.encode('utf-8')
if len(x) <= 31:
return struct.pack("B", 0xa0 | len(x)) + x
elif len(x) <= 28-1:
return b"\xd9" + struct.pack("B", len(x)) + x
elif len(x) <= 216-1:
return b"\xda" + struct.pack(">H", len(x)) + x
elif len(x) <= 2**32-1:
return b"\xdb" + struct.pack(">I", len(x)) + x
else:
raise UnsupportedTypeException("huge string")
With this patch in place, I am able to pass all French names in u-msgpack without error.
-end-
Hi cforger,
Thank you for this detailed report and this fix! You are completely correct about the bug and it slipped my test cases. I have added "Allagbé" to the unit tests and your corresponding fix to address it. I'll be releasing 1.6 later today primarily with this fix, but also with some other improvements (module docstrings, module version tuple). Thanks again!
Fixed with 3a0aa1b. New release under tag v1.6 and on PyPI (https://pypi.python.org/pypi/u-msgpack-python).
Thanks for your prompt attention.