OCR does not generate output for empty pages, crashes

Question

OCR does not generate output for empty pages, crashes

peterdekker opened this issue 6 years ago · 2 comments

I ran the PICCL workflow for a number of images of pages from a book (356411-356419.tif from CirculaireBriefFranseNatie). For pages without text, no Folia is created in the OCR step. This file is then missed in subsequent steps, ultimately leading to a segmentation fault.

See here for the error.log: https://pastebin.ubuntu.com/p/kK2Q9YFPJy/

I think that ideally, OCR output should be generated for empty pages as well. Alternatively, subsequent steps should be able to work with missing files.

@JessedeDoes

Answer 1 · 2019-05-07T13:47:03.000Z

The current solution is indeed rather patchy and not sufficient, sometimes 'empty' hocr files get fed that won't produce a FoLiA file. It looks as if nextflow produces an empty output file in that case which is obviously not valid FoLiA and FoLiA-correct stumbled on it. I'll do an extra check weeding out those zero-byte files before FoLiA-correct (still not very elegant though).

Answer 2 · 2019-06-14T21:02:32.000Z

Should be solved in v0.8.0