joerick/pyinstrument

Bug: Artificial characters in outfile when using pstats renderer

f-allian opened this issue · 7 comments

Hi,

Thanks for making this work open source.

My issue relates to a bug when using the pstats renderer option while saving the profiling output (FYI pstats isn't mentioned as a valid option in the most recent documentation).

For instance, using the command python -m pyinstrument -r pstats -o output_file.pstats python_file.py an output file is generated and saved without any explicit errors, however, the output file contains several artificial non-readable characters (see attached image) and attempting to read the file runs into an error of the form: "ValueError: bad marshal data (invalid reference)".

example-file

From my quick search and reviewing this PR #236, it appears that the problem arises in the serialization step using the marshal module. The current pstats rendering implementation in pyinstrument/renderers/pstatsrenderer.py uses a method called render that returns the output in the form of: return marshal.dumps(stats).decode(encoding="utf-8", errors="surrogateescape"). It appears that the errors="surrogateescape" argument is the cause of the artificial characters, whereas the default errors="strict" option will raise a UnicodeDecodeError as expected for this kind of conversion.

Is there a reason why the marshal module is used for the rendering here? Would Pickle be a suitable alternative for non-binary serialization?

I've also pasted some toy examples below to further clarify this issue:

[In] data = {"fruit": 'apples', 'quantity': 30}

[In] marshal.dumps(data).decode('utf-8', errors='strict') # default

[Out] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfb in position 0: invalid start byte # this should be expected

whereas the current implementation yields:

[In] data = {"fruit": 'apples', 'quantity': 30}

[In] marshal.dumps(data).decode('utf-8', errors='surrogateescape')

[Out] '\udcfbZ\x05fruitZ\x06applesZ\x08quantity\udce9\x1e\x00\x00\x000' # outfile contains artificial characters

These tests were carried out on a Windows OS, using pyinstrument 4.5.1, on Python 3.9.13.

Thanks.

Hey there. Would you be able to test with the main branch? I noticed a bug with this recently, and it was fixed in the v4.5.2 release, however I've just realised that the publish to PyPI for that release failed.

i just released v4.5.3, which is available on PyPI.

Thanks for the quick response @joerick. Unfortunately the outfile still contains those characters using v4.5.3/the main branch. I can't see any differences in the source code of the pstats renderer in this update either, not sure if that was intended?

I can't see any differences in the source code of the pstats renderer in this update either, not sure if that was intended?

The change was in __main__.py, relating to how the renderer saves binary data on windows.

The root cause of this is that the Renderers output strings, but marshal data is in bytes. So I chose to use the surrogateescape handler, which round-trips any byte sequence to strings and back again. So the renderer converts to strings with surrogateescape, and __main__.py converts back to bytes.

It appears that the errors="surrogateescape" argument is the cause of the artificial characters, whereas the default errors="strict" option will raise a UnicodeDecodeError as expected for this kind of conversion.

Yeah, the purpose of surrogateescape is to encode non-unicode binary data in a string. It's not intended to look at the data as a string, but as bytes.

Is there a reason why the marshal module is used for the rendering here? Would Pickle be a suitable alternative for non-binary serialization?

The reason is that the Python pstats format uses marshal. This renderer is written to be compatible with that format.


As for the problem you're seeing above, I'm not sure what's happening here. You say that the output contains non-readable characters, but the output is a binary format, it's not intended to be read with a text editor. It is intended to be read by a tool like snakeviz or the python pstats module.

Could you post the exact commands/code you're using to read the file that pyinstrument outputs, and the exact error?

Perhaps you might be able to provide a minimal recreation of this bug? Otherwise I can't move forward here. There are tests in the test suite that check loading/saving of marshal data on windows, so I'm stumped about what's going wrong here.

Sorry for the delayed reply @joerick, I've been busy but I'll check/provide a working example as soon as I can on this

feel free to reopen if/when you have a recreation.