PDF-hul: various issues with parsing PDFs

Question

PDF-hul: various issues with parsing PDFs

Opened this issue 3 months ago · 0 comments

petervwyatt commented 3 months ago

Some issues noted about parsing PDFs:

{ and } are not PDF delimiter tokens except within Type 4 PostScript functions (i.e. they are PS delimiters only) so using them elsewhere is incorrect. This was a long-standing error in PDF specifications.
PDF-hul header check is for %PDF-1 but spec says it is %PDF- followed by any digit (0-9), . and another `digit so PDF 2.0 files should report as a PDF file, but with an unsupported PDF version until such time as you support PDF 2.0. JHOVE currently reports PDF 2.0 files as a bytestream which is incorrect. See here
PDF-hul crashes if a PDF hex-string contains EOL characters - this is permitted by the PDF spec as whitespace can occur in hex-strings and the EOLs are considered whitespace. (For what it is worth, hex-strings and literal strings are the only 2 types of PDF tokens or keywords that can span multiple lines).
there seem to be assumptions with PDF-hul-xx error codes that a key with an explicit null value is invalid whereas the PDF spec states that such keys should be ignored (same as not present). An easy test is to set /Annots null on any page and compare behaviour to not having an /Annots entry present.
Java exception gets thrown if cross-reference sub-section marker lines (of 2 integers) start with a negative number (i.e. for the object number).
FileSpecification.java does not account for the UF entry added with PDF 1.7. This was noticed from a code review.
there is something strange going on when encountering empty names (i.e. just a '/' followed by nothing, which is a valid PDF name). PDump correctly lists as a Name object with empty string "", but if 2 empty names are appended to a trailer dictionary (i.e. a valid key/value dictionary entry) then JHOVE doesn't work properly...
please consider adding support for UTF-8 text strings introduced with PDF 2.0. This was noted from a code review. Also note that UTF-8 strings do occur in some pre-PDF 2.0 files...