OCR does not generate output for empty pages, crashes
peterdekker opened this issue · 2 comments
I ran the PICCL workflow for a number of images of pages from a book (356411-356419.tif from CirculaireBriefFranseNatie). For pages without text, no Folia is created in the OCR step. This file is then missed in subsequent steps, ultimately leading to a segmentation fault.
See here for the error.log
: https://pastebin.ubuntu.com/p/kK2Q9YFPJy/
I think that ideally, OCR output should be generated for empty pages as well. Alternatively, subsequent steps should be able to work with missing files.
The current solution is indeed rather patchy and not sufficient, sometimes 'empty' hocr files get fed that won't produce a FoLiA file. It looks as if nextflow produces an empty output file in that case which is obviously not valid FoLiA and FoLiA-correct stumbled on it. I'll do an extra check weeding out those zero-byte files before FoLiA-correct (still not very elegant though).
Should be solved in v0.8.0