bertsky/workflow-configuration

ocrd-import: resolution when doing pdf conversion

Closed this issue · 1 comments

When ocrd-import attempts to convert PDF input files, it will use the default pixel density of 72 DPI (since there is no native pixel density to vector graphics). This is insufficient for OCR. But the exact value required may depend on the use-case. So there should at least be a parameter (e.g. --render-dpi) what density to use for all vector graphics formats. One could then instruct IM to use convert -density $((2*$render_dpi)) input.pdf -resample $render_dpi output.png.

But what if the PDF contains raster graphics itself? They should not be re-rastered (esp. not by upsampling), but extracted raw (e.g. with pdfimages from poppler). But this would depend on whether these images are full-page (representative) or just embedded figures.

Fixed by 581332d