eliben/pycparser

UnicodeDecodeError during cpp step

Closed this issue · 2 comments

I'm using the standard parse_file(filepath, use_cpp=True) call to parse my C files. Some of my files have special unicode characters, such as and . During the cpp step, it is failing with a UnicodeDecodeError. I found that Python's subprocess.check_output() takes an optional parameter called encoding. When I set encoding='utf-8', then parse_file() succeeds.

Specifically, making the following change to preprocess_file() in pycparser/__init__.py fixes my issue:

        # Note the use of universal_newlines to treat all newlines
        # as \n for Python's purpose
        text = check_output(path_list, universal_newlines=True, encoding='utf-8')

I am running my code from a vanilla Windows 10 Command Prompt.

I would propose adding a new optional parameter to preprocess_file() to allow a caller to specify this encoding, but there might be a better way--I don't do much Python dev.

Thanks for opening the issue.

The encoding param was added to check_output in Python 3.4, AFAICS. pycparser currently works on many older Python versions (including Python 2), so I'm not sure it's worth breaking its backwards compatibility for such minor features.

After all, it's trivial to write your own parse_file, modeled after the one in pycparser - in fact, I recommend it.

I'll leave this issue open with a label to revisit it later.

Thanks for the response. I didn’t realize I could define a custom parse_file like that.