1.0.21 causes previously-consumable PDFs to fail now with RangeError
rdunlop opened this issue · 6 comments
I suspect that the input PDF that I'm dealing with is invalid...but I wanted to mention that it was working in 1.0.20, but no longer in 1.0.21.
The PDF appears to have an invalid stream defined near the end of my file (relevant part here::
8 0 obj\r<</Length 2200\r/Type\r/Metadata\r/Subtype \r/XML>>\rstream\rendstream\rendobj\r9 0 obj\r<< /Keywords()\r/Creator(HP Scan) \r/CreationDate(D:20210326163700-08'00')\r/ModDate(D:20210326163700-08'00')\r/Author ()\r/Producer (HP Scan Extended Application)\r/Title ()\r/Subject ()\r>>\rendobj\rxref\r0 10\r0000000000 65535 f \r0000000009 00000 n \r0000522282 00000 n \r0000522379 00000 n \r0000522588 00000 n \r0000522646 00000 n \r0000522697 00000 n \r0000522746 00000 n \r0000522892 00000 n \r0000522972 00000 n \rtrailer\r<<\r/Size 10\r/Root 5 0 R\r/Info 6 0 R\r/Info 7 0 R\r/Info 8 0 R\r/Info 9 0 R\r>>\rstartxref\r523171\r%%EOF\r
(pretty printed):
8 0 obj
<</Length 2200
/Type
/Metadata
/Subtype
/XML>>
stream
endstream
endobj
9 0 obj
<< /Keywords()
/Creator(HP Scan)
/CreationDate(D:20210326163700-08'00')
/ModDate(D:20210326163700-08'00')
/Author ()
/Producer (HP Scan Extended Application)
/Title ()
/Subject ()
>>
endobj
xref
0 10
0000000000 65535 f
0000000009 00000 n
0000522282 00000 n
0000522379 00000 n
0000522588 00000 n
0000522646 00000 n
0000522697 00000 n
0000522746 00000 n
0000522892 00000 n
0000522972 00000 n
trailer
<<
/Size 10
/Root 5 0 R
/Info 6 0 R
/Info 7 0 R
/Info 8 0 R
/Info 9 0 R
>>
startxref
523171
%%EOF
As you can see, the Length is 2200, but there are not 2200 bytes left in the file, and thus the @scanner.pos += out.last[:Length].to_i - 2
(here)[https://github.com/boazsegev/combine_pdf/blob/b966e703fd897ff50832d3823e74791099b82ca3/lib/combine_pdf/parser.rb#L364] causes a RangeError.
I am opening this ticket because I'm 90% sure that this is an invalid PDF, but I wanted to mention it out loud that the change introduced in 1.0.21 is (to me) a regression in capability. I recognize that #184 is a related issue.
For now, I've resolved my issue by reverting to 1.0.20. Not ideal, but sufficient for my purposes for now.
Hi @rdunlop ,
Thank you for opening this issue. I totally understand your concern and I myself was debating this change for his very reason.
This isn't about a performance optimization. I would much rather be able to read malformed PDF files than run faster...
...however, as I explained in #185 , this is required to accommodate properly authored PDF files that are allowed to contain PDF-like markers in their stream data (i.e., a PDF explaining how PDF data looks might contain the PDF endstream
keyword). Issue #184 was an issue that referenced such a valid PDF file as an example.
The choice was either to continue failing on valid PDF files or to patch in a way that limited support for malformed PDF files... I guess there's a way to support both variations, I just didn't see it at the time (though I see it now, it might have a performance penalty).
I'm not high on time, but if you want to submit a PR that prefers valid PDF files and supports some sort of handling for malformed PDF files, that would be great.
Cheers,
Boaz Segev.
This issue happened for me as well. PR seems to fix @boazsegev.
Has there been any updates on this ticket or #205 as yet on whether it will be merged or not? @boazsegev
Thanks for the PR, is there anyway to get this fix merged @boazsegev ?