Possible encoding issue in PDF metadata
Closed this issue · 12 comments
Description of the bug
I am not sure, whether we hit an actual bug, here, or whether the behavior is actually intended.
I have a PDF document, from which we (among other things) extract some metadata. It occurred to us that our PyMuPDF-based extraction sometimes fails for specific metadata values. But this happens rarely.
In one of such PDF files, I looked up the metadata using mutool info and got the following output:
> mutool info file.pdf
file.pdf:
PDF-1.4
Info object (1 0 R):
<</Creator(Canon iR-ADV C3325 PDF)/CreationDate(D:20200812164013+02'00')/Producer<FEFF00410064006F00620065002000500053004C00200031002E0033006500200066006F0072002000430061006E006F006E0000>>>
Pages: 1
Retrieving info from pages 1-1...
Mediaboxes (1):
...
Images (9):
...
The problematic value is the "Producer", which seems to be given as UTF-16 encoded string with Byte Order Mark ("FE FF" as first two bytes). This encoded string is terminated by two NULL bytes ("00 00").
Opening this file with PyMuPDF and reading the metadata dictionary results in the following:
In [1]: import pymupdf
In [2]: doc = pymupdf.open("file.pdf")
In [3]: doc.metadata
Out[3]:
{'format': 'PDF 1.4',
'title': '',
'author': '',
'subject': '',
'keywords': '',
'creator': 'Canon iR-ADV C3325 PDF',
'producer': 'Adobe PSL 1.3e for Canon\udcc0\udc80',
'creationDate': "D:20200812164013+02'00'",
...}
We recognize that the decoded string has some UTF-16 (low) surrogate characters at the end, which were the reason for our following encoding to not behave as expected. I know that there is the "surrogateescape" handler in Python (see, e.g., https://peps.python.org/pep-0383/ ), which might be used also in PyMuPDF when decoding the bytes. However, I am wondering where the additional bytes come from in the first place.
Note that a normal UTF-16 decoding of the given bytes produces the following:
In [4]: b = b"\xFE\xFF\x00\x41\x00\x64\x00\x6F\x00\x62\x00\x65\x00\x20\x00\x50\x00\x53\x00\x4C\x00\x20\x00\x31
⋮ \x00\x2E\x00\x33\x00\x65\x00\x20\x00\x66\x00\x6F\x00\x72\x00\x20\x00\x43\x00\x61\x00\x6E\x00\x6F\x00\x
⋮ 6E\x00\x00"
In [5]: b.decode("utf-16")
Out[5]: 'Adobe PSL 1.3e for Canon\x00'
The string is intact and not showing the surrogate characters. However, there is still the explicit NULL byte at the end, which is not desired (but easy to deal with).
Also a hex-editor view into the original file does show nothing except the NULL bytes at the end of the encoded string:
So, my question is: Is the behavior desired? Or at least expected?
In the meantime, we try to sanitize the strings on our side, but I would be interested to know what happened here. And I apologize if everything is in order and to be expected like this.
How to reproduce the bug
Due to the actual document being customer data, which I am not allowed to share, I, unfortunately, cannot provide a working example. But I tried to put all ingredients in the description, above.
PyMuPDF version
1.26.0
Operating system
Linux
Python version
3.12
I think this is happening in the base library. To confirm and for following up please let me have the PDF.
You can send confidential files to my personal e-mail address.
But without this we will not be able to deal with the problem.
Never mind: I have just manually created a problem example and will pursue this with the MuPDF team.
This is the metadata content:
mutool info test.pdf
test.pdf:
PDF-1.7
Info object (5 0 R):
<</Creator(Canon iR-ADV C3325 PDF)/Producer<FEFF00410064006F00620065002000500053004C00200031002E0033006500200066006F0072002000430061006E006F006E0000>>>
Pages: 1
Not a ZUGFeRD file.
Retrieving info from pages 1-1...
Mediaboxes (1):
1 (4 0 R): [ 0 0 595 842 ]
Talked to the MuPDF team. When encountering 0x0000 in a UTF16 string it will intentionally be converted to the two surrogate characters \udcc0\udc80 on UTF8 output.
There is no (easy) way to change this. We might consider to enforce a replace in PyMuPDF - either to \x0 or to "".
Thanks (as usual) for the lightning-speed response! ;-)
And thanks for investigating!
As I said, I was not sure this is an actual bug and I stated that it might well be that the behavior is completely expected. I just do not seem to understand where the additional bytes come from.
I also do not understand why the Null bytes are there in the producer field, in the first place. I did not (quickly) find any information in the PDF standard about whether or not this is good practice/valid/legal or what should be done with that. I (personally) would say that most people will have no use for additional Null bytes in metadata entries, but there must be some kind of reason why they are there and why MuPDF converts them to the surrogate characters.
(I believe, but am not totally sure, that neither an additional Null byte nor inclusion of the surrogate characters are valid UTF-8, though. So handling the case might be a good idea.)
A replace with \x00 would be sane in the sense that at least it can be encoded to utf-8, whereas \udcc0\udc80 can not (which was why the potential problem occurred to us in the first place). The \x00 version would also be the expected output of the producer bytes decoded using b.decode("utf-16").
The only question (for me) is whether such an additional Null byte would be of any value to a user. But it would certainly be the "correct technical thing" to do.
May I ask why the Null bytes are translated to the surrogate code points in MuPDF (only if there is an easy answer)?
@griai - thanks a lot for the compliments 😎!
[Since its appearance on PyPI back in August 2016, being responsive has always been among PyMuPDF's top priorities.]
As per your comments:
Those trailing Null bytes indeed make no sense at all. Probably caused by some error during PDF creation. The encoding as UTF16 is unnecessary given the actual text content: none of the used characters requires more than 1 byte for encoding.
When talking to the MuPDF team, the generation of those surrogate characters upon encountering Nulls has some deeper technical reasons (among them supporting round trips between UTF8 and UTF16 on the C-level). But they are considering ways to be more supportive for outputs to Python.
Independent from this, I dislike additional performance burdens just to avoid corner cases like yours. This would be the case if we now were to check every single string extraction in PDF source text. (Content text extraction would not fall under this anyhow).
Probably, we will just do a replace "\udcc0\udc80" -> "" within the relevant keys of the metadata dictionary.
I guess this sounds like a valid approach. I know that per https://peps.python.org/pep-0383/ such round trips should be possible. I just do not understand it in the current case since the Null bytes can be decoded by "utf-16" perfectly fine.
However, if it is feasible on PyMuPDF side, such a replacement "\udcc0\udc80" -> "" would be very helpful, in my eyes, in order to prevent surprises on user side. But I also agree that this is an edge case and I, indeed, only saw very few documents showing this behavior. It might also be viable to just "do nothing" since the problem seems to be rare.
When talking to the MuPDF team, the generation of those surrogate characters upon encountering Nulls has some deeper
technical reasons (among them supporting round trips between UTF8 and UTF16 on the C-level). But they are considering
ways to be more supportive for outputs to Python.
It's to do with how we represent those strings as utf-8 in C, and the problem comes with how this translates into Python.
Regardless of why those 0 bytes are there, we need to be able to represent them accurately within MuPDF. MuPDF is written in C, and C uses 0 terminated strings. So when we convert from the format used by PDF into a UTF-8 encoded C string, we can't just use 0 for those zero bytes.
For instance if PDF had "foo\x00bar" in it, and we converted that into a simple utf-8 string we'd end up with C seeing the result as being "foo".
So, we take advantage of a a slight nastiness within UTF-8, and we allow for the 0 byte to be encoded using an "overlong encoding".
As such, that means 0 becomes 2 bytes; 0xc0, 0x80.
This means that when we convert that string to Unicode within MuPDF, we get the 0 value out as we expect.
The problem comes with where PyMuPDF is built on MuPDF. The C strings are converted to Python strings automatically (using some code autogenerated by SWIG).
This code is seeing these overlong encodings (which are strictly speaking illegal), and is choosing to encode them using a "surrogate escape" mechanism. Hence \xc0,\x80 is turning into \xdcc0 and \xdc80.
I don't think we will be changing how MuPDF represents these strings. We need to talk to our SWIG expert about whether we can influence this conversion. Any fix is going to have to be in the python wrappings or pymupdf, not in core MuPDF.
Thank you very much for the explanations!
Fixed in PyMuPDF-1.26.3.
