/pymupdf-jbig2-extract

Convert masks in pdf to images

Primary LanguagePython

Some pdf's (those from the Internet Archive, apparently) consist of mostly scans. Structurally, every page is a background image with a mask, occasionally with an invisible OCR text layer, and sometimes a lower-res preview/thumbnail image.

This is mostly a script trying to convert the mask into an image losslessly. The background image (often just yellowing paper) makes the document less-readable. Smaller pdf with single image per page (without requiring compositing) also loads faster, too.

The final extract-all-mask.py script keeps the original front and back cover, but converts every page from 2 to N-1 from image+mask to just the mask.

Out of about 400 such pdf's:

  • There is one (also apparently the oldest by its creation/modification date) which has this strange problem (indented and spaced for readability from original) of every page of 308 refering to every image and mask. Hence the page.clean_contents() line and mutool clean -sg ....
4668 0 obj
<< /Type /Page
   /Parent 924 0 R
   ...
   /Resources<<...
               /XObject<</I1 4673 0 R
                         /I2 4674 0 R
                         /I3 4675 0 R
                         ...
                         /I308 4980 0 R
                         /Im001 4981 0 R
                         ...
                         /Im308 5595 0 R
                         >>
               >>
   /MediaBox[0 0 487 832]
   /StructParents 0
  >>
  • There is one other pdf which seems to have (blank) pages with mask only.

  • The script under gist.github.com has been modified in small ways. The original can be found at that location. (migrated to python 3, and changing resolution and color inversion; see below).

  • For about 90% (i.e. about 40) of the 400, the heuristics (some masks have a "DeviceGray" color space name, and need to be inverted to be black-text-on-white-background. Such images also have resolution 360 dpi instead of 300 dpi) seems to work. But 5 pdfs appear white-text-on-black-background, and about 40 have the wrong resolution with values 150dpi, 300dpi, 350 dpi, 360dpi, 500dpi, and ~642dpi being seen. Script convert-page2.py is intended to be run on the result pdf to quickly see the color of page 2 of result. resolution-page2.py is intended to be run on the source pdf quickly to tell the resolution, plus the every-page-references-every-image and mask-only-page problems.