LanguageMachines/PICCL

OCR does not generate output for empty pages, crashes

peterdekker opened this issue · 2 comments

I ran the PICCL workflow for a number of images of pages from a book (356411-356419.tif from CirculaireBriefFranseNatie). For pages without text, no Folia is created in the OCR step. This file is then missed in subsequent steps, ultimately leading to a segmentation fault.

See here for the error.log: https://pastebin.ubuntu.com/p/kK2Q9YFPJy/

I think that ideally, OCR output should be generated for empty pages as well. Alternatively, subsequent steps should be able to work with missing files.

@JessedeDoes

The current solution is indeed rather patchy and not sufficient, sometimes 'empty' hocr files get fed that won't produce a FoLiA file. It looks as if nextflow produces an empty output file in that case which is obviously not valid FoLiA and FoLiA-correct stumbled on it. I'll do an extra check weeding out those zero-byte files before FoLiA-correct (still not very elegant though).

Should be solved in v0.8.0