kanzure/pdfparanoia

JSTOR watermark

rcallahan opened this issue · 5 comments

This content downloaded from X at T on bottom of all pages

JSTOR has been working since 0.0.10, can you show me a sample that it fails on?

http://diyhpl.us/~bryan/papers2/paperbot/The%20New%20England%20Origins%20of%20Mormonism.pdf

On Thu, Mar 28, 2013 at 9:45 PM, Bryan Bishop notifications@github.comwrote:

JSTOR has been working since 0.0.10, can you show me a sample that it
fails on?


Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-15626583
.[image: Web Bug from
https://github.com/notifications/beacon/wqfBRmzxV38hApHt4ur6UsiolTJx5bYjkACsruXJ0vv7OKxH-fCMWhVyHonLgOnB.gif]

gffa commented

I experience the same issue at this date. Having tested several JSTOR pdfs I can not scrub the watermark from them with pdfparanoia.

fmap commented

The existing JSTOR scrubber stopped working because JSTOR are now adding
watermarks using a different program; including more information, in a way
harder to expunge.

The above patches remove watermark strings as before, but in the process, we're
corrupting the file. mupdf reports:

error: cannot recognize xref format
error: cannot read xref (ofs=2290213)
error: cannot read xref at offset 2290213

Here's what I think's happening:

A PDF object can be thought of as a hierarchy of objects; the most important of
these is the Root entry, which "contains references to other objects defining
the document’s contents, outline, article threads, named destinations, and
other attributes". In the old style generator, the index of the Root entry was
found by consulting the file trailer, which was guaranteed to be at a particular
position near the end of the file. With the new generator, this index is
instead contained in the dictionary of a cross-reference stream, the position
of which is referenced by byte offset at the end of the file.

When we remove watermarks, we're changing the length of objects within the
file, breaking that reference; the offset is no longer accurate. This stops the
root value from being retrieved, KABLAM!

We could solve this by, after manipulating objects within pdfparanoia.eraser,
determining the new location of the xref section, and updating the offset
description accordingly. I'll probably get around to this tomorrow.

fmap commented

Further errors, now. A sample:

error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: cannot find page -1 in page tree