obriencj/python-javatools

Java identifiers in .class files containing characters outside of BMP can fail to decode

KonstantinShemyak opened this issue · 1 comments

Java class file specification defines the encoding of the bytes in CONSTANT_Utf8_info structure. Currently, python-javatools handle zero byte in this "modified utf-8". But there is one more difference, all codepoints in the supplementary planes are encoded using 6 bytes, not 4 as in UTF-8. (Such codec has been proposed for Python in 2008, but rejected in 2012.)

On such input, UTF-8 decoder behaves differently in my Python 2 and Python 3 environments:

  • Python 2.7 just decodes the consecutive surrogates and leaves them as is. (I do not yet see why, but classinfo even shows the original character, as though it's decoded with UTF-16... strange.)
  • Python 3.6 raises UnicodeDecodeError, as expected. (Python-javatools would catch it, replace C0 80 with 00 in the data, and attempt to UTF-8-decode again. Two consecutive UTF-16 surrogates cannot contain substring C0 80, thus nothing gets replaced, and the second decoding attempt fails at the same place.)

Illustration with the Wikipedia example U+10400:

Python 2.7.15rc1 (default, Apr 15 2018, 21:51:34) [GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> b'\xed\xa0\x81\xed\xb0\x80'.decode('utf-8')
u'\ud801\udc00'
Python 3.6.5 (default, Jun  1 2018, 18:28:15) [GCC 5.4.0 20160609] on linux
>>> b'\xed\xa0\x81\xed\xb0\x80'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

It looks like an unlikely case that someone will use codepoints from outside the Basic Multilingual Plane for Java identifiers, so this is probably a low priority issue.

Java source file with this character: CESU8.java.txt (Github forces to use some "known extension")

To reproduce this issue in Python 3, #98 is needed, otherwise Python 3 raises an exception because of attempt to search for a string in bytes.