Usage example for converting page xml to searchable pdf?

Question

Usage example for converting page xml to searchable pdf?

Aysoltan opened this issue 3 years ago · 1 comments

Hi,

Is it possible to convert page xml into searchable pdf? Although I got a pdf after using ocrd-pagetopdf -I OCR-D-OCR -O OCR-D-PDF -P textequiv_level word, it is not searchable. The workflow I used to produce page xml:

ocrd process \ "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \ "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \ "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \ "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \ "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \ "tesserocr-recognize -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P textequiv_level word -P segmentation_level word -P overwrite_segments true" \ "cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP" \ "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"

I don't see the bug there? Any idea, where is the change or addition in the workflow required? Maybe it hat to do with region recognizer?

Best
Aysoltan

Answer 1 · 2022-03-29T09:28:37.000Z

@Aysoltan, if the PDF files created from that workflow have no text layer, then probably the error happened earlier (before ocrd-pagetopdf). Have you checked that OCR-D-OCR does contain TextEquiv elements?
(For example, by running find OCR-D-OCR/ -exec page-extract-words {} ";", or by looking into the PAGE-XML, or by running OCRD-Browser...)