Binary "printable characters" regex seems to contain non-printable characters.

Question

Binary "printable characters" regex seems to contain non-printable characters.

anerbe opened this issue 2 years ago · 2 comments

Affected tool:
olevba

Describe the bug

The binary "printable characters" regex seems to contain non-printable characters. Specifically, the code in lines 902-904 in olevba.py
is
902: # regex to extract printable strings (at least 5 chars) from VBA Forms:
903: # (must be bytes for Python 3)
904: re_printable_string = re.compile(b'[\t\r\n\x20-\xFF]{5,}')

However, it seems that only characters in the binary range \x20-\x7F guarantees the character is printable, while the characters the binary range \x80-\xFF are often (always?) non-printable.

As a result, the form extraction code often finds strings with non-printable characters.

How To Reproduce the bug
Input a macro with long strings with characters in the binary range \x80-\xFF

Expected behavior
The regex should catch only strings with printable characters

Answer 1 · 2022-09-01T15:07:22.000Z

Hi @anerbe, thank you for raising the question. The range \x80-\xFF contains characters that are actually printable in some locales. For example there are a lot of letters with accents used in Europe, that can be used in VBA macros and forms. If you look at the Latin-1 table, there are a lot of printable characters in the range A0 to FF: https://cs.stanford.edu/people/miles/iso8859.html
Do you have specific samples that are causing issues because of that regex?

Answer 2 · 2022-09-05T08:04:13.000Z

Hi @decalage2, thank you for replying so quickly.

First, it seems that this regex is inconsistent with, for example, the "is_printable" function in the same file (which uses string.printable to decide on the printable set). Is there is a reason forms should have a bigger printable set? If so, I would suggest adding that it's assuming ISO8859 in the comments preceding the regex. (Also note that even for ISO8859 the regex contains unprintable characters)

Second, it should probably be explained in what sense it is "printable", since the "natural" ways to print it don't seem to work.

E.g., I ran the following code with the hex string \b'\x80\x43\x65\x72\x74\x69\x66\x79\x31\xEC' (as part of the 'o' stream) :

from oletools.olevba import VBA_Parser
from oletools.olevba import is_printable
...
for (_, _, form_string) in vbaparser.extract_form_strings():
      print(is_printable(form_string))      # prints False
      print(form_string+'\n')               # prints �Certify1�, which seems to contain non-printable characters
      with open('test.txt', 'w') as f:
           print(form_string, file=f)       # Throws exception: 'charmap' codec can't encode character '\ufffd' in position 0: character maps to <undefined>