pymupdf/PyMuPDF

non-modified empty metadata fields are converted to 4 char length literal "null" strings

Closed this issue · 3 comments

Description of the bug

From the docs the behaviour of making a copy of pdf_doc.metadata and modifying the value of a field, say "author", and then calling set_metadata() with that modified metadata dict (and then saving), should only affect the modified field (author, in the example). But actual behaviour is that all empty fields (empty string-valued metadata dictionary values) are changed from empty to a literal string "null" that is awkward to see, it should be "" (empty, as in the original). I attach a small code to add keywords to a pdf that has this side effect, and a screenshot of the windows explorer property sheet of a pdf file before and after being processed with attached code.

pdf_keywords.py.txt

Image

How to reproduce the bug

with the attached file pdf_keywords.py.txt renamed to .py, and with a test PDF with few metadata set called kk.pdf run

python pdf_keywords.py -l kk.log -p kk.pdf --set ebook

PyMuPDF version

1.26.3

Operating system

Windows

Python version

3.10

According to the specification, non-present key in the metadata should not be represented by empty strings "".

Image

To ease modifying the metadata we always want to represent the metadata with a full set of standard keys. So we had to decide what to do if the PDF does not contain a certain key at all. The decision was to return a "" in that case - as opposed to None - when the allowed value format a key is string.
Therefore, when setting the metadata the decision must be made again: What to put in the PDF if a value is "". We decided to use the PDF equivalent for None: "null". As per the PDF specifications, a value of "null" is to be treated as if its key is not present. So conforming readers return nothing (or "not available" etc.) for any PDF key whose value is "null".
When a value for a key is "" in set_metadata we cannot simply to nothing because the user might want to erase any previous data there. Neither is it possible to physically erase a key like /Title form PDF (in all cases).

Image

you are right, I inspected the generated PDF and there are no "null" string in metadata. Should have checked before reporting. Maybe the culprit of the ugly property sheet is a PDF property handler shell extension I use (this). Sorry for having to handle the non-issue.

No worries at all. We had been into this multiple times now. But you do have a point: when creating a virgin PDF via pymupdf.open() and give it a metadata where only e.g. the author if filled, then the created object is this

<<
  /Title null
  /Author (author)
  /Subject null
  /Keywords null
  /Creator null
  /Producer null
  /CreationDate null
  /ModDate null
  /Trapped null
>>

Whereas this

<<
  /Author (author)
>>

is what it really could / should look like.
As I wrote, from a PDF perspective both are equivalent, and the first version shows a consistent behavior in round trips.