UnicodeDecodeError during cpp step

Question

UnicodeDecodeError during cpp step

Closed this issue 3 years ago · 2 comments

I'm using the standard parse_file(filepath, use_cpp=True) call to parse my C files. Some of my files have special unicode characters, such as “ and ”. During the cpp step, it is failing with a UnicodeDecodeError. I found that Python's subprocess.check_output() takes an optional parameter called encoding. When I set encoding='utf-8', then parse_file() succeeds.

Specifically, making the following change to preprocess_file() in pycparser/__init__.py fixes my issue:

        # Note the use of universal_newlines to treat all newlines
        # as \n for Python's purpose
        text = check_output(path_list, universal_newlines=True, encoding='utf-8')

I am running my code from a vanilla Windows 10 Command Prompt.

I would propose adding a new optional parameter to preprocess_file() to allow a caller to specify this encoding, but there might be a better way--I don't do much Python dev.

Answer 1 · 2021-04-18T13:33:35.000Z

Thanks for opening the issue.

The encoding param was added to check_output in Python 3.4, AFAICS. pycparser currently works on many older Python versions (including Python 2), so I'm not sure it's worth breaking its backwards compatibility for such minor features.

After all, it's trivial to write your own parse_file, modeled after the one in pycparser - in fact, I recommend it.

I'll leave this issue open with a label to revisit it later.

Answer 2 · 2021-04-18T17:39:22.000Z

Thanks for the response. I didn’t realize I could define a custom parse_file like that.