Add "ocr_carea" to hOCR output (of alto2hocr.xsl)
Closed this issue · 8 comments
Using alto2hocr.xsl
on this alto file via ocr-fileformat
results in probably invalid hOCR since it misses ocr_carea
. According to the specification, all parts of the text should be contained in such an element.
Validating the resulting hOCR file results in
$ ocr-validate hocr 00000011.html
[WARN] STDIN Recommended metadata field 'ocr-langs' missing
[ERROR] STDIN:13 Error parsing properties for "<div class="ocr_page" id="Page1" title="image ; bbox 0 0 ; ppageno 0">" : (property need more than 1 value to unpack)
(which is most likely a different problem).
I do not see the ocr_carea in your output - why do you think that its absence causes the error?
Well, your input ALTO file does not contain any ComposedBlock elements - so there is no content to transform into ocr_carea...
<xsl:template match="PrintSpace">
<xsl:apply-templates select="ComposedBlock"/>
<xsl:apply-templates select="TextBlock"/>
</xsl:template>
<xsl:template match="ComposedBlock">
<div class="ocr_carea" id="{mf:getId(@ID,'block',.)}" title="...">
<xsl:apply-templates select="TextBlock"/>
</div>
</xsl:template>
That's exactly the point! Sorry for not making this clear in the first place: As far as I understand the hOCR specs, every text segment has to be enclosed by an ocr_carea
(not only composed blocks). Though maybe I am wrong at this... (@kba FYI)
The docs are a bit unclear. But IMHO I think you cannot create ocr_carea without the respective ALTO elements.
From my point it is not a bug in the transformation so I am closing this.
As far as I understand the hOCR specs, every text segment has to be enclosed by an
ocr_carea
That should be the case but as @filak said, it is underspecified at least. Do not rely on that for transformations :-(
@wrznr I am also getting an essentially empty hocr (when running ocr-transform) from an ABBYY-outputted alto file. Did you manage to find a way to do the conversion? Thanks!
@jtlz2 No. But this is actually not my use case. I plan to go from ALTO to TEI and since I have a method to convert hOCR to TEI, I thought I could use this script as an intermediate step. Due to the unclearness of the hOCR documentation (cf. above), I refrained from this idea.
For your use case, maybe https://gist.github.com/tfmorris/5977784 helps?