UB-Mannheim/ocr-fileformat

alto2hocr: Hyphenation sign is not handled correctly

Closed this issue · 6 comments

Example (the dash is actually normally an em dash):

   ...
   <String WC="0.6233333349" CONTENT="con" HEIGHT="26" WIDTH="79" VPOS="132" HPOS="596" SUBS_TYPE="HypPart1" SUBS_CONTENT="conservation"/>
   <HYP CONTENT="­-­"/>
</TextLine>
<TextLine HEIGHT="43" WIDTH="679" VPOS="175" HPOS="12">
   <String WC="0.7411110997" CONTENT="servation" HEIGHT="43" WIDTH="194" VPOS="175" HPOS="12" SUBS_TYPE="HypPart2" SUBS_CONTENT="conservation"/>
   ...

will be transformed into

   ...
   <span class="ocrx_word" id="word_d1e73" title="bbox 596 132 675 158">con</span>
</span>
<span class="ocr_line" id="line_d1e76" title="bbox 12 175 691 218">
   <span class="ocrx_word" id="word_d1e77" title="bbox 12 175 206 218">servation</span>
   ...

and the dash is missing. For correct presentation (and error computation) it should be part of the first word here. Can this been easily done with the XSLT approach here?

Obviously ALTO uses a <HYP> tag which is currently not handled by the transformation. I did not find a similar tag in the hOCR specification.

http://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/hocr/417576986_0012.hocr is the result of an extended style sheet where I used a quick hack to produce something useful.

Hm.. I see an extra node in your result which results in another space. Our goal is to receive

<span class="ocrx_word" id="word_d1e73" title="bbox 596 132 675 158">con-</span>

I think we can try to replace these lines with something like

 <xsl:template match="String">
    <span class="ocrx_word" id="{mf:getId(@ID,'word',.)}" title="{mf:getBox(@HEIGHT,@WIDTH,@VPOS,@HPOS)}">
        <xsl:value-of select="@CONTENT"/>
        <xsl:value-of select="preceding-sibling::HYP/@CONTENT"/>
     </span>
  </xsl:template>

But this is untested...

Ping @kba . Here are two examples of ALTO files with hyphenations (just search for HYP node):

It looks that CONTENT is empty in our HYP tags, i.e. we may have to do something like an conditional statement.

kba commented

Here's a small change to the alto2hocr.xsl script that should accomplish this: kba/hOCR-to-ALTO@f447ace

filak commented

This kba/hOCR-to-ALTO@f447ace has been merged - so this issue might be closed...

I wasn't sure anymore if this was already included. Thank you for the confirmation @filak !