Java identifiers in .class files containing characters outside of BMP can fail to decode
KonstantinShemyak opened this issue · 1 comments
Java class file specification defines the encoding of the bytes in CONSTANT_Utf8_info
structure. Currently, python-javatools handle zero byte in this "modified utf-8". But there is one more difference, all codepoints in the supplementary planes are encoded using 6 bytes, not 4 as in UTF-8. (Such codec has been proposed for Python in 2008, but rejected in 2012.)
On such input, UTF-8 decoder behaves differently in my Python 2 and Python 3 environments:
- Python 2.7 just decodes the consecutive surrogates and leaves them as is. (I do not yet see why, but
classinfo
even shows the original character, as though it's decoded with UTF-16... strange.) - Python 3.6 raises
UnicodeDecodeError
, as expected. (Python-javatools would catch it, replaceC0 80
with00
in the data, and attempt to UTF-8-decode again. Two consecutive UTF-16 surrogates cannot contain substringC0 80
, thus nothing gets replaced, and the second decoding attempt fails at the same place.)
Illustration with the Wikipedia example U+10400:
Python 2.7.15rc1 (default, Apr 15 2018, 21:51:34) [GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> b'\xed\xa0\x81\xed\xb0\x80'.decode('utf-8')
u'\ud801\udc00'
Python 3.6.5 (default, Jun 1 2018, 18:28:15) [GCC 5.4.0 20160609] on linux
>>> b'\xed\xa0\x81\xed\xb0\x80'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
It looks like an unlikely case that someone will use codepoints from outside the Basic Multilingual Plane for Java identifiers, so this is probably a low priority issue.
Java source file with this character: CESU8.java.txt (Github forces to use some "known extension")
To reproduce this issue in Python 3, #98 is needed, otherwise Python 3 raises an exception because of attempt to search for a string in bytes.