garbled text
Closed this issue · 1 comments
yoliax commented
I have a document, and after extracting the text, most of it is garbled. Is there a way to fix this? If not, is it possible to identify this issue beforehand
I'm using a C++ wrapper for MuPDF, and I'm seeing the following warnings:
warning: premature end of data in flate filter
warning: ... repeated 2 times...
library error: FT_New_Memory_Face(KTJ-PK748c3): broken table
warning: ignored error when loading embedded font; attempting to load system font
warning: unknown cid collection: Founder-PKU2
warning: non-embedded font using identity encoding: KTJ-PK748c3 (mapping via )
syntax error: expected generation number (3 ? obj)
warning: repairing PDF document
format error: object is not a stream
warning: ignored error when loading embedded font; attempting to load system font
warning: non-embedded font using identity encoding: SimHei+353 (mapping via )
syntax error: expected generation number (3 ? obj)
warning: repairing PDF document
format error: object is not a stream
warning: ignored error when loading embedded font; attempting to load system font
warning: non-embedded font using identity encoding: SimHei+353 (mapping via )
warning: found duplicate fz_icc_link in the storeJorjMcKie commented
This PDF is broken and contains multiple problems. Many PDF viewers (including Adobe) cannot even display it.
When you extract text, you will see error messages and pymupdf.TOOLS.mupdf_warnings() will contain the full list of errors and warnings.
So simply check this and you will have enough indication to distrust the extraction.