Image extractor for PDFs using PubLayNet
Setup Environment
- Create a conda env
conda create -n image_ex python=3.7
conda activate image_ex
- install Pytorch and TorchVision. E.g.:
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
# or: pip install torch torchvision
- install opencv and imagemagick
conda install -c conda-forge opencv
conda install -c conda-forge imagemagick
# or: apt install imagemagick
- install Detectron2
- Prebuilds for Linux are easiest.
- For MacOS use:
CC=clang CXX=clang++ python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
- Download weights:
cd weights;source download.sh;cd ..
Extract Image from PDF
After extraction you should have the following files available for
a pdf example_file.pdf
:
example_file.png - the best representative image w.r.t. heuristic
(example_file_best_<number>.png) - [only if NOT shortcut] the images sorted by their representative value
From a set of PDFs:
python extract_image.py --pdf proceedings_*.pdf --cleanup all
Applies the image extraction to the PDF files, applies the histogram heuristic plus first-page priority. It cleans up all temporary files. It detects automatically if running on cuda or cpu.
From a set of PDFs - Even faster:
Uses a shortcut that if the first page contains an image, it uses this one and does not parse the rest.
python extract_image.py --firstpage shortcut --device cuda --pdf proceedings_*.pdf --cleanup
Extract Images from Paper images
No cleanup, no heuristic.. just plain image extraction:
python extract_image.py --input pages_*.png
Usage
usage: extract_image.py [-h] [--input INPUT [INPUT ...]] [--pdf PDF [PDF ...]]
[--confidence-threshold CONFIDENCE_THRESHOLD]
[--opts ...] [--config-file FILE] [--device DEVICE]
[--weights WEIGHTS] [--overwrite]
[--firstpage FIRSTPAGE] [--cleanup CLEANUP]
[--accept N [N ...]]
Extract interesting images from PDFs or paper page images - using PubLayNet
and Detectron2 (pytorch) and a simple histogram heuristic.
optional arguments:
-h, --help show this help message and exit
--input INPUT [INPUT ...]
A list of space separated input images; or a single
glob pattern such as 'directory/*.jpg' (default: None)
--pdf PDF [PDF ...] A list of space separated PDF files.or a single glob
pattern such as 'directory/*.pdf' (default: None)
--confidence-threshold CONFIDENCE_THRESHOLD
Minimum score for instance predictions to be shown
(default: 0.5)
--opts ... Modify config options using the command-line 'KEY
VALUE' pairs (default: [])
--config-file FILE path to config file (default:
configs/DLA_mask_rcnn_R_50_FPN_3x.yaml)
--device DEVICE run on device (cuda/cpu) (default: cpu)
--weights WEIGHTS run on device (default:
weights/DLA_mask_rcnn_R_50_FPN_3x_trimmed.pth)
--overwrite run on device (default: False)
--firstpage FIRSTPAGE
if image on first page give it high priority --
values: ['shortcut','prio','none'] (default: prio)
--cleanup CLEANUP remove tmp files [images, pages, all] (default: none)
--accept N [N ...] accepted classes (default: [4])
License
Apache 2.0
Acknowledgements
This code is heavily based on these repositories:
- https://github.com/ibm-aur-nlp/PubLayNet
- https://github.com/hpanwar08/detectron2
- https://github.com/Mini-Conf/image-extraction
Thank you dear contributors !! Please contact me if you like a different mention