yob/pdf-reader

PDF titles ending w/ NULL character

odinhb opened this issue · 5 comments

Hey

I'm extracting the document title from a pdf generated by ImageMagick, and it returns the expected document title, except the string has an extra NULL character at the end.

It looks something like this:

pdf = PDF::Reader.new(file.path)
pdf.info[:Title] # => "the title as expected\u0000"

Is this a bug in ImageMagick or in pdf-reader? It seems like an encoding issue.

Both Chrome and KDE's Gwenview opens these pdfs and displays the title as expected.

Versions

Edit: OS/Architecture: x86_64 GNU/Linux w/ Kernel 5.11.0-19.1-liquorix-amd64
pdf-reader: 2.4.2
ruby: 2.6.6p146 (2020-03-31 revision 67876) [x86_64-linux]
ImageMagick: 6.9.10-23 Q16 x86_64 20190101

$ convert --version # the commandline version of imagemagick
Version: ImageMagick 6.9.10-23 Q16 x86_64 20190101 https://imagemagick.org
Copyright: © 1999-2019 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP
Delegates (built-in): bzlib djvu fftw fontconfig freetype jbig jng jpeg lcms lqr ltdl lzma openexr pangocairo png tiff webp wmf x xml zlib

irb> puts Magick::Long_version # the ruby library wrapping it, compiled w/ a different version
This is RMagick 2.16.0 ($Date: 2009/12/20 02:33:33 $) Copyright (C) 2009 by Timothy P. Hunter
Built with ImageMagick 6.9.7-4 Q16 x86_64 20170114 http://www.imagemagick.org
Built for ruby 2.6.6
[..]

# (both versions of ImageMagick produce identical results for me)

Additional info

When inspecting pdfs from other sources using sublime text, I can see that the title is set like so:

/Title (the title as expected)

But when inspecting the pdfs generated by imagemagick it looks like this:

/Title (þÿ<0x00>t<0x00>h<0x00>e<0x00> <0x00>t<0x00>i<0x00>t<0x00>l<0x00>e [snip...] <0x00>e<0x00>d<0x00><0x00>)
yob commented

Is this a bug in ImageMagick or in pdf-reader? It seems like an encoding issue.

Unfortunately it's hard to say without looking at the PDF. Are you able to share it, or is it a private document?

When inspecting pdfs from other sources using sublime text, I can see that the title is set like so:
But when inspecting the pdfs generated by imagemagick it looks like this:

The raw content of the PDF file can be deceptive when viewed in a text editor. There's multiple ways to encode text and different PDF producing tools choose different encodings, so comparing the raw content visually isn't always possible.

/Title (þÿ<0x00>t<0x00>h<0x00>e<0x00> <0x00>t<0x00>i<0x00>t<0x00>l<0x00>e [snip...] <0x00>e<0x00>d<0x00><0x00>)

Storing metadata in a 2-byte format is valid, and I can see the final 2-bytes in this extract are <0x00><0x00>. I suspect the PDF title actually does have a NULL byte encoded into it. Maybe it's reasonable for PDF::Reader to strip that though? 🤔

is it a private document?

Not at all. They are created using ImageMagick's convert like so:

convert 1.jpg 2.jpg 3.jpg out.pdf # this pdf's title would be 'out'
# pdf-reader would return me the ruby string "out\u0000"

When inspecting pdfs from other sources using sublime text, I can see that the title is set like so:
But when inspecting the pdfs generated by imagemagick it looks like this:

The raw content of the PDF file can be deceptive when viewed in a text editor. There's multiple ways to encode text and different PDF producing tools choose different encodings, so comparing the raw content visually isn't always possible.

/Title (þÿ<0x00>t<0x00>h<0x00>e<0x00> <0x00>t<0x00>i<0x00>t<0x00>l<0x00>e [snip...] <0x00>e<0x00>d<0x00><0x00>)

Storing metadata in a 2-byte format is valid, and I can see the final 2-bytes in this extract are <0x00><0x00>.

I didn't think too hard about it when I posted this, but I see now that sublime opens the file using the 'Western (Windows 1252)' encoding.

Conveniently this wikipedia article (scroll down to 'Byte order marks by encoding') shows what the UTF BOMs for different encodings of unicode look like when interpreted as Windows-1252. þÿ is UTF-16BE's BOM.

Edit: In ruby you can do this:

String.new("\u0000t\u0000e\u0000s\u0000t\u0000", encoding: "Windows-1252").force_encoding('UTF-16BE')
# => "test\x00"

I suspect the PDF title actually does have a NULL byte encoded into it. Maybe it's reasonable for PDF::Reader to strip that though?

It seems reasonable to me. Some users might want the raw string back if you start messing with it, but I certainly found the ending NULL to be unexpected. I believe NULL terminated strings is the standard way of doing strings in C, so I would expect the ending null to be handled.

I have similar yet different issue.

I work with @odinhb but I'm running on a mac with a newer version of image magic. I think it's most liky a OS dependent issue.

convert --version
Version: ImageMagick 6.9.12-11 Q16 x86_64 2021-05-04 https://imagemagick.org
Copyright: (C) 1999-2021 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules 
Delegates (built-in): bzlib freetype gslib jng jp2 jpeg lcms ltdl lzma png ps tiff webp xml zlib

When I do the same convert command I get a PDF with title in hex. PDFReader manages to read it but it doen't understand that it's "UTF-16BE". If I apply .force_encoding("UTF-16BE").encode("UFT-8") to the title I get the same results as Odin (with a NULL at the end).

/Title <0049006E0076006F006900630065005400720061006E0073006300720069007000740069006F006E005F0032003100330037003100340000>

a_pile_of_memes_with_hex_title.pdf

yob commented

On the trailing Null character: I think I'm inclined to leave it in. I can see in the PDFs that the null character is included and unlike in C there's not really an security issues with leaving it in. There's no harm in you stripping it out if it's a problem for your system though.

The wrong encoding in a_pile_of_memes_with_hex_title.pdf seems like a legit issue though.

Strings in PDF metadata can have one of two encodings: "pdfdoc" or "utf16". We use the BOM to detect which one to use:

if obj[0,2].unpack("C*") == [254, 255]
utf16_to_utf8(obj)
else
pdfdoc_to_utf8(obj)
end

In this file there's no BOM so we assume it's pdfdoc, which looks like an incorrect assumption. The first character (0049) is I in UTF16 🤔

I wonder if the heuristics for deciding which encoding to use should be smarter?

I wonder if the heuristics for deciding which encoding to use should be smarter?

Just a note: when @stoivo opens his file it also struggles with the encoding. The title looks like this: t h e t i t l e a s e x p e c t e d.

It looks like we should report that to/ask the ImageMagick devs about it.

If you want a smarter heuristic, a(n imperfect) solution I saw described in the wikipedia article was to read some bytes from the string, and then see if the byte-pattern [NULL][ASCII-CHAR][NULL][ASCII-CHAR] keeps repeating, in which case it is most likely UTF-16. Not a strong guarantee though, and might not be viable here because titles are usually short.

From the wikipedia article I linked:

If there is no BOM, it is possible to guess whether the text is UTF-16 and its byte order by searching for ASCII characters (i.e. a 0 byte adjacent to a byte in the 0x20-0x7E range, also 0x0A and 0x0D for CR and LF). A large number (i.e. far higher than random chance) in the same order is a very good indication of UTF-16 and whether the 0 is in the even or odd bytes indicates the byte order. However, this can result in both false positives and false negatives.

The first character (0049) is I in UTF16 🤔

Maybe you missed it but the pdf @stoivo shared is titled
'InvoiceTranscription_213714'.

Curiously, Chrome displays the title correctly, but Gwenview displayed an empty title. When I opened properties > details in dolphin, the string looked fine, but when copy-pasting it it looks like this:
'�I�n�v�o�i�c�e�T�r�a�n�s�c�r�i�p�t�i�o�n�_�2�1�3�7�1�4��'

What a wonderful mess. It seems like most software relies on the utf BOM.