pymupdf/PyMuPDF

garbled text

Closed this issue · 1 comments

I have a document, and after extracting the text, most of it is garbled. Is there a way to fix this? If not, is it possible to identify this issue beforehand

I'm using a C++ wrapper for MuPDF, and I'm seeing the following warnings:

warning: premature end of data in flate filter
warning: ... repeated 2 times...
library error: FT_New_Memory_Face(KTJ-PK748c3): broken table
warning: ignored error when loading embedded font; attempting to load system font
warning: unknown cid collection: Founder-PKU2
warning: non-embedded font using identity encoding: KTJ-PK748c3 (mapping via )
syntax error: expected generation number (3 ? obj)
warning: repairing PDF document
format error: object is not a stream
warning: ignored error when loading embedded font; attempting to load system font
warning: non-embedded font using identity encoding: SimHei+353 (mapping via )
syntax error: expected generation number (3 ? obj)
warning: repairing PDF document
format error: object is not a stream
warning: ignored error when loading embedded font; attempting to load system font
warning: non-embedded font using identity encoding: SimHei+353 (mapping via )
warning: found duplicate fz_icc_link in the store

文本层乱码-1.pdf

This PDF is broken and contains multiple problems. Many PDF viewers (including Adobe) cannot even display it.

When you extract text, you will see error messages and pymupdf.TOOLS.mupdf_warnings() will contain the full list of errors and warnings.

Image

So simply check this and you will have enough indication to distrust the extraction.