alto2hocr: Hyphenation sign is not handled correctly
Closed this issue · 6 comments
Example (the dash is actually normally an em dash):
...
<String WC="0.6233333349" CONTENT="con" HEIGHT="26" WIDTH="79" VPOS="132" HPOS="596" SUBS_TYPE="HypPart1" SUBS_CONTENT="conservation"/>
<HYP CONTENT="-"/>
</TextLine>
<TextLine HEIGHT="43" WIDTH="679" VPOS="175" HPOS="12">
<String WC="0.7411110997" CONTENT="servation" HEIGHT="43" WIDTH="194" VPOS="175" HPOS="12" SUBS_TYPE="HypPart2" SUBS_CONTENT="conservation"/>
...
will be transformed into
...
<span class="ocrx_word" id="word_d1e73" title="bbox 596 132 675 158">con</span>
</span>
<span class="ocr_line" id="line_d1e76" title="bbox 12 175 691 218">
<span class="ocrx_word" id="word_d1e77" title="bbox 12 175 206 218">servation</span>
...
and the dash is missing. For correct presentation (and error computation) it should be part of the first word here. Can this been easily done with the XSLT approach here?
Obviously ALTO uses a <HYP>
tag which is currently not handled by the transformation. I did not find a similar tag in the hOCR specification.
http://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/hocr/417576986_0012.hocr is the result of an extended style sheet where I used a quick hack to produce something useful.
Hm.. I see an extra node in your result which results in another space. Our goal is to receive
<span class="ocrx_word" id="word_d1e73" title="bbox 596 132 675 158">con-</span>
I think we can try to replace these lines with something like
<xsl:template match="String">
<span class="ocrx_word" id="{mf:getId(@ID,'word',.)}" title="{mf:getBox(@HEIGHT,@WIDTH,@VPOS,@HPOS)}">
<xsl:value-of select="@CONTENT"/>
<xsl:value-of select="preceding-sibling::HYP/@CONTENT"/>
</span>
</xsl:template>
But this is untested...
Ping @kba . Here are two examples of ALTO files with hyphenations (just search for HYP
node):
- official example from the ALTO community: http://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml
- one of our examples: http://digi.bib.uni-mannheim.de/fileadmin/digi/417576986/alto/417576986_0078.xml
It looks that CONTENT
is empty in our HYP
tags, i.e. we may have to do something like an conditional statement.
Here's a small change to the alto2hocr.xsl script that should accomplish this: kba/hOCR-to-ALTO@f447ace
This kba/hOCR-to-ALTO@f447ace has been merged - so this issue might be closed...