Parsing specific PDF in 1.0.21 - RangeError: index out of range (works in 1.0.20)
Laykou opened this issue · 7 comments
When trying to parse this PDF rose_production_split_pages.pdf (file was removed), we're getting error:
RangeError:
index out of range
# /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:364:in `pos='
# /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:364:in `_parse_'
# /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:79:in `parse'
# /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/pdf_public.rb:98:in `initialize'
# /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/api.rb:40:in `new'
# /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/api.rb:40:in `parse'
How we call it:
CombinePDF.parse(blob.download, allow_optional_content: true).pages
This happens on version 1.0.21
and 1.0.22
however not on 1.0.20
.
Now we wanted to move to Ruby 3.1 and we need matrix fix which is in 1.0.22
but we cannot upgrade because of this failing PDF example.
@boazsegev For some reason this fix b966e70 broke it
Hi @Laykou
Thank you for opening this issue.
Please note my comments: here for issue #185 and here for issue #191.
I usually prefer lax parsers that allow formatting errors to be ignored when possible. However, issue #185 showed that a specific type of error cannot be safely ignored, which required that the parser become more strict.
I strongly suspect, from the description of the issue, that the specific PDF file is malformed.
Testing the PDF @ https://www.datalogics.com/products/pdf-tools/pdf-checker/ fails ... the testing suite doesn't even recognize the file as a PDF, not to mention listing the errors.
I have been authoring and maintaining this gem by myself for over 7 years and have been looking for a new maintainer for over 2 years. The community is enjoying my work, but not really contributing, so... 🤷🏼♂️ ... please forgive me for not investing more time and effort to solve this issue.
Kindly,
Bo.
Hi @boazsegev ,
It appears that the Length property of the stream can be incorrect in more cases than the presence of the 'endstream' keyword within the content. Anyway, preferring one over another way to extending the scanner position leads to issues.
Many of these issues are acceptable for the end users, provided result looks well. E.g. swallowing the "index is out of range" error would fix the parsing of the file attached. Then it can be combined and work can be done.
Can we swallow the error "index is out of range" and display warning for this case? Would such a PR make sense?
Do you think this could be fixed in a newer version?
Getting index out of range (RangeError)
on a user uploaded PDF in version 1.0.26 as well.
Hi @Laykou
Thank you for opening this issue.
Please note my comments: here for issue #185 and here for issue #191.
I usually prefer lax parsers that allow formatting errors to be ignored when possible. However, issue #185 showed that a specific type of error cannot be safely ignored, which required that the parser become more strict.
I strongly suspect, from the description of the issue, that the specific PDF file is malformed.
Testing the PDF @ https://www.datalogics.com/products/pdf-tools/pdf-checker/ fails ... the testing suite doesn't even recognize the file as a PDF, not to mention listing the errors.
I have been authoring and maintaining this gem by myself for over 7 years and have been looking for a new maintainer for over 2 years. The community is enjoying my work, but not really contributing, so... 🤷🏼♂️ ... please forgive me for not investing more time and effort to solve this issue.
Kindly, Bo.
There are some pull requests created that could possibly solve this problem but so far they have not been merged and the problem occurs even after almost a year after PRs were submitted.
Can you take a look at them?
Hello, getting 'RangeError: index out of range' on 1.0.23 version as well