joshy/striprtf

rtf_to_text ignores the errors parameter

powo opened this issue · 2 comments

powo commented

The errors= Parameter to rtf_to_text is documented in docstrings and mentioned in several issues (#34, #27, #27) but it is completely ignored and not being passed to .decode(..) ... therefore leading to UnicodeDecodeErrorss.

Do you have an example .rtf file that illustrates your problem?

powo commented

Here is an example:

>>> striprtf.rtf_to_text(r"{\rtf1\ansi\ansicpg0 T\'e4st}", encoding="utf-8", errors="replace")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/powo/Sync/dev/bat/.venv/lib/python3.11/site-packages/striprtf/striprtf.py", line 136, in rtf_to_text
    out += bytes.fromhex(hexes).decode(encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data

expected behavior would be, that the errors="replace" will ignore the error and replace the invalid character, like this:

>>> b'T\xe4st'.decode("utf-8", errors="replace")
'T�st'