mikehaertl/php-pdftk

Bug/missing feature: updateInfo() cannot parse the data generated by getData()

rotdrop opened this issue · 16 comments

updateInfo() is the counter-part of getData() (pdftk commands ump_data and update_info):

 update_info <info data filename | - | PROMPT>
                 Changes the bookmarks and metadata in a single PDF's Info dictionary to match the input data file.
                 The input data file uses the same syntax as the output from dump_data.

However, on the php-pdftk level updateInfo() is not able to parse the output of getData(). This was quite surprising. Of course, one can pass a suitably prepared file to updateInfo(), still it would be convenient if updateInfo() could just take the output of getData().

Kind thanks for this nice piece of software, BTW!

As already commented on your PR there's a deeper issue here with our implementation of InfoFile (used inside updateFields()) that we need to address first: It only allows to create Info blocks in this form:

InfoBegin
InfoKey: someKey
InfoValue: someValue
InfoBegin
InfoKey: ...

But if you look at the output of dump_data there's much more:

InfoBegin
InfoKey: Creator
InfoValue: Asciidoctor PDF 1.5.3, based on Prawn 2.2.2
...
PdfID0: 8b93f76a0b28b720d0dee9a6eb2a780a
PdfID1: 8b93f76a0b28b720d0dee9a6eb2a780a
NumberOfPages: 55
BookmarkBegin
BookmarkTitle: some bookmark
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: other bookmark
BookmarkLevel: 1
BookmarkPageNumber: 2
...
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 595.28 841.89
PageMediaDimensions: 595.28 841.89
...
PageLabelBegin
PageLabelNewIndex: 01
PageLabelStart: 1
PageLabelPrefix: i
PageLabelNumStyle: NoNumber
PageLabelBegin
...

To quote again the man page for pdftk:

Changes the bookmarks and metadata in a single PDF's Info dictionary to match the input data file.

So it says you can update 2 things:

  • metadata
  • bookmarks (we definititely missed those in our InfoFile implementation)

The question is: What exactly is metadata? I don't think it includes all the data from dump_data because this would include NumberOfPages. I doubt that you can update that. My assumption: It relates to the Info... lines on top of the output. That's what we already have. So maybe we only miss the bookmarks part.

Unfortunately I can't spend much time on this. If you can help out here and find out which fields exactly can be updated and which not then we could think about a proper fix.

If you do that, keep in mind that the above list is probably far from complete. There may be other keys we haven't seen yet but which can also appear in dump_data for certain PDFs.

The question is: What exactly is metadata? I don't think it includes all the data from dump_data because this would include NumberOfPages. I doubt that you can update that. My assumption: It relates to the Info... lines on top of the output. That's what we already have. So maybe we only miss the bookmarks part.

Well, you could probably also modify the media dimensions and their rotation.

Unfortunately I can't spend much time on this. If you can help out here and find out which fields exactly can be updated and which not then we could think about a proper fix.

Why not just take the proposal of my merge-request #292? There is also the following: if I call pdftk update_info without bookmarks, then all bookmarks are deleted. This is rather a PUT request than a PATCH request. So it would be more convenient to just pass back the entire data provided by dump_data. If some modifications make no: so be it. But you can then modify only parts and be sure that you do not accidentally erase things that you want to keep.

If you do that, keep in mind that the above list is probably far from complete. There may be other keys we haven't seen yet but which can also appear in dump_data for certain PDFs.

Well, the current implementation of your getData() just groups everything which starts with Begin into a sub-array (with the example of the Info part which is handled in a special way) and keeps all other keys just as plain KEY => VALUE pair in the top level array.

Well, you could probably also modify the media dimensions and their rotation.

Could you do some testing on this? What can and can not be updated? It would really help making a decision.

Why not just take the proposal of my merge-request #292?

Because this would break BC. People already using this will come here and complain and I have to find a fix. It's basically a bit what happened here: InfoFile was a contribution and I didn't look too closely what it does - with the problems we have now.

That's why I first want to get the full picture to maybe find a solution that doesn't break BC.

So it would be more convenient to just pass back the entire data provided by dump_data

Are you sure that pdftk does not complain in this case? And will it silently ignore those parts that can not be updated?

Well, you could probably also modify the media dimensions and their rotation.

Could you do some testing on this? What can and can not be updated? It would really help making a decision.

Why not just take the proposal of my merge-request #292?

Because this would break BC. People already using this will come here and complain and I have to find a fix. It's basically a bit what happened here: InfoFile was a contribution and I didn't look too closely what it does - with the problems we have now.

True. For this: InfoFile::__construct() could check whether or not its input contains an array key Info. If it is not present, then interpret the input as before, otherwise as a nested array.

That's why I first want to get the full picture to maybe find a solution that doesn't break BC.

So it would be more convenient to just pass back the entire data provided by dump_data

Are you sure that pdftk does not complain in this case? And will it silently ignore those parts that can not be updated?

It seems so. At least also passing the NumberOfPages and the PDF ids does not break pdftk. Still one could leave this experience to the user: one could document that the input array perhaps first has to be cleaned up. Still it would be more convenient if InfoFile could convert the output of getData() back to a string, so to say "unparse" it.

Anyhow, I am running a little bit out of time. I might eventually get back to this issue. I'll modify my merge request in order to keep backwards compatibility as explained at the start of this comment. Everything unfortunately has to wait, although I understand you comments and agree in most parts.

Thanks again

Claus

For this: InfoFile::__construct() could check whether or not its input contains an array key Info.

Yes, that's what I thought, too. It could just happen that someone wants to set only Bookmarks and doesn't care about setting Info - in which case the solution would break again. Hmm.

I need some time to consider this. I may also refactor your code a little e.g. to make it still work with PHP 5.x (which we still support).

More findings: The output format of pdftk dump_data seems to be proprietary to pdftk. I.e. it has nothing to do with how data is stored inside a PDF - which doesn't make things less confusing 🙄 .

The official PDF specs can be found here: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf

Section 14.3.3 lists the possible entries for a Document Information Directory like Title, Author, etc. pdftk seems to convert these into the InfoBegin ... InfoKey ... InfoValue ... blocks. There is more data included in the dump, e.g. bookmarks which are called "outlines" in PDF lingo.

The related code for dump/update in pdftk-java seems to be this:

I didn't study the code in full detail but it seems like almost all data is encoded in SomethingBegin ... SomethingAbc ... SomethingXyz ... blocks. The only exceptions to this are:

  • PdfID0
  • PdfID1
  • NumberOfPages

All this should help us to improve InfoFile so that it can "eat" all kind of data:

$pdf->udpateInfo([
    // "Info" fields as used by the current implementation should still work to not break BC
    'Creator' => 'bla bla',
    'Producer' => 'xyz',

    // Standalone fields
    'NumberOfPages' => 17,
    'PdfId0' => 'abcde',
    'PdfId1' => 'abcde',

    'Bookmark' => [ ... ],
    'PageMedia' => [ ... ],
]);

$pdf->udpateInfo([
    // Nested Info as returned by getData
    'Info' => [
        'Creator' => 'bla bla',
        'Producer' => 'xyz',
    ],

    // Standalone fields
    'NumberOfPages' => 17,
    'PdfId0' => 'abcde',
    'PdfId1' => 'abcde',

    'Bookmark' => [ ... ],
    'PageMedia' => [ ... ],
]);

From tests it seems that the following can not be updated:

  • NumberOfPages
  • PdfId0
  • PdfId1
  • PageMedia...
  • PageLabel...

@rotdrop I did a slightly different implementation based on my findings above.. Could you help testing/reviewing this please (see PR above)? Also have a look at the included test cases which show the different input formats that you can pass to updateInfo() now.

Please correct me if I'm wrong but from my tests it really seems that you can only update the Info* fields and the bookmarks - nothing else. The implementation is now restricted to that. But it would be easy to extend this to more blocks.

From tests it seems that the following can not be updated:

* NumberOfPages

* PdfId0

* PdfId1

* PageMedia...

* PageLabel...

Are you sure about PageMedia? Just asking because pdftk --help gives

      update_info <info data filename | - | PROMPT>
             Changes the bookmarks, page labels, page sizes, page rota-
             tions, and metadata in a single PDF's Info dictionary to
             match the input data file.

At this point I would suggest to contact the authors of pdftk.

@rotdrop I did a slightly different implementation based on my findings above.. Could you help testing/reviewing this please (see PR above)? Also have a look at the included test cases which show the different input formats that you can pass to updateInfo() now.

Please correct me if I'm wrong but from my tests it really seems that you can only update the Info* fields and the bookmarks - nothing else. The implementation is now restricted to that. But it would be easy to extend this to more blocks.

At least passing non-updatable fields back to pdftk seems not to do any harm. But an opinion from the developers of pdftk would probably be really helpful here.

I hope that I find the time to test the thing with the page-media (as --help says it should be updatable). Kind thanks for taking my remarks serious!

One other point: I am interested in this issue, but unfortunately have to postpone any time-consuming activities until start of September. I am very sorry for this, as it shows now that my "complaints" burn quite a bit of your time. I actually need the next two weeks entirely for my pay-work and after that I am on holiday until mid of August.

Which version of pdftk are you using? pdftk --help for me gives this:

    update_info <info data filename | - | PROMPT>  
     Changes the bookmarks and metadata in a single PDF's Info 
     dictionary to match the input data file. The input data file 
     uses the same syntax as the output from dump_data. Non-ASCII 
     characters should be encoded as XML numerical entities. 
 
     This operation does not change the metadata stored in the 
     PDF's XMP stream, if it has one. (For this reason you should 
     include a ModDate entry in your updated info with a current 
     date/timestamp, format: D:YYYYMMDDHHmmSS, e.g. D:201307241346 
     -- omitted data after YYYY revert to default values.)
$ pdftk --version
pdftk port to java 3.0.9 a Handy Tool for Manipulating PDF Documents

I've now modified it again so that it basically accepts whatever is returned by getData(). It only ignores NumberOfPages, PdfId0 and PdfId1. They are not part of the PDF specs and probably something pdftk generates.

Let me know when you found time do test this. But I'm actually pretty sure that it now accepts pretty much everything.

@rotdrop Any news here?

@rotdrop Are you still interested in this?

@rotdrop At least some response would have been nice as I see that you are active on Github in other projects. Anyway: Merged and released now.

@mikehaertl I am sorry for my silence, any sort of response would have been better as mere silence. Please accept my apologies.