joshy/striprtf

Error while decoding characters

ffreller opened this issue · 3 comments

I'm having issues while decoding some characters.
I get the following errors when decoding some rtf text:
"'charmap' codec can't encode character '\x96' in position 0: character maps to ",
"'charmap' codec can't encode character '\x93' in position 0: character maps to ",
"'charmap' codec can't encode character '\uf02d' in position 0: character maps to ",
"'charmap' codec can't encode character '\x99' in position 0: character maps to ",
"'charmap' codec can't encode character '\u25a1' in position 0: character maps to ",
"'charmap' codec can't encode character '\u2234' in position 0: character maps to ",
"'charmap' codec can't encode character '\x95' in position 0: character maps to "

If necessary, I can send you the rtf files that resulted in those errors.

Thank you

@ffreller I have had similar issues and worked around most by opening the source file in binary mode and explicitly decoding it as utf-8 separately, then processing each line in turn rather than the whole file at once

with open(fOpenPath, 'rb') as rtfFile:
    rawFileContent = rtfFile.read()
    rawFileContent = rawFileContent.decode("utf-8")
    for line in rawFileContent.splitlines():
        fileContent += rtf_to_text(line, errors='backslashreplace')

No guarantees, but HTH

joshy commented

HI @ffreller, did you found a solution that works for you? Otherwise you can sent me the rtf files and I can have a look at it.