Error while decoding characters
ffreller opened this issue · 3 comments
I'm having issues while decoding some characters.
I get the following errors when decoding some rtf text:
"'charmap' codec can't encode character '\x96' in position 0: character maps to ",
"'charmap' codec can't encode character '\x93' in position 0: character maps to ",
"'charmap' codec can't encode character '\uf02d' in position 0: character maps to ",
"'charmap' codec can't encode character '\x99' in position 0: character maps to ",
"'charmap' codec can't encode character '\u25a1' in position 0: character maps to ",
"'charmap' codec can't encode character '\u2234' in position 0: character maps to ",
"'charmap' codec can't encode character '\x95' in position 0: character maps to "
If necessary, I can send you the rtf files that resulted in those errors.
Thank you
@ffreller I have had similar issues and worked around most by opening the source file in binary mode and explicitly decoding it as utf-8 separately, then processing each line in turn rather than the whole file at once
with open(fOpenPath, 'rb') as rtfFile:
rawFileContent = rtfFile.read()
rawFileContent = rawFileContent.decode("utf-8")
for line in rawFileContent.splitlines():
fileContent += rtf_to_text(line, errors='backslashreplace')
No guarantees, but HTH
HI @ffreller, did you found a solution that works for you? Otherwise you can sent me the rtf files and I can have a look at it.