qurator-spk/dinglehopper

Warn if there is text missing in the ReadingOrder

mikegerber opened this issue · 1 comments

For 00451941.gt.xml, dinglehopper-extract does not extract the header's text DE L'ESPRIT DE L'HOMME.

The header is in TextRegion r3, but the ReadingOrder only includes the main text in r1, so dinglehopper does only extract the main text. This means: The file is buggy, not dinglehopper.

However, we can do better by warning that any region is not included in the extracted text.