Is "ocr_carea" obligatory for representing text blocks?
Closed this issue · 1 comments
Does the sequence of ocr_*
elements represent a strict hierarchy?
<body>
<div class="ocr_page">
<div class="ocr_carea">
<p class="ocr_par">
<span class="ocr_line">
<span class="ocrx_word">
Yield
</span>
</span>
</p>
</div>
</div>
</body>
I.e. Does every level of the hierarchy has to be present or are some of them "omittable"?
abbyy2hocr.xsl
implements the latter while alto2hocr.xsl
implements the first (as included into https://github.com/UB-Mannheim/ocr-fileformat).
No, you cannot assume a strict hierarchy.
ocr_page
is required.
ocr_line
, while not required by the spec, probably should be. You can assume it is there.
ocr_carea
should be used for print space / columns, but is not consistently.
ocr_par
isn't either.
If ocrx_word
are used, they are within ocr_line
. Not by definition but by experience.
I wish I could give you a more stringent answer but the reality is a lot of documents produced over a long time by implementations based on a underdefined specification.