filak/hOCR-to-ALTO

Add "ocr_carea" to hOCR output (of alto2hocr.xsl)

Closed this issue · 8 comments

wrznr commented

Using alto2hocr.xsl on this alto file via ocr-fileformat results in probably invalid hOCR since it misses ocr_carea. According to the specification, all parts of the text should be contained in such an element.

Validating the resulting hOCR file results in

$ ocr-validate hocr 00000011.html 
[WARN] STDIN Recommended metadata field 'ocr-langs' missing
[ERROR] STDIN:13 Error parsing properties for "<div class="ocr_page" id="Page1" title="image ; bbox 0 0  ; ppageno 0">" : (property need more than 1 value to unpack)

(which is most likely a different problem).

filak commented

I do not see the ocr_carea in your output - why do you think that its absence causes the error?

wrznr commented
filak commented

Well, your input ALTO file does not contain any ComposedBlock elements - so there is no content to transform into ocr_carea...

    <xsl:template match="PrintSpace">
            <xsl:apply-templates select="ComposedBlock"/>
            <xsl:apply-templates select="TextBlock"/>
      </xsl:template>

     <xsl:template match="ComposedBlock">
         <div class="ocr_carea" id="{mf:getId(@ID,'block',.)}" title="...">
             <xsl:apply-templates select="TextBlock"/>
         </div>
     </xsl:template>

wrznr commented

That's exactly the point! Sorry for not making this clear in the first place: As far as I understand the hOCR specs, every text segment has to be enclosed by an ocr_carea (not only composed blocks). Though maybe I am wrong at this... (@kba FYI)

filak commented

The docs are a bit unclear. But IMHO I think you cannot create ocr_carea without the respective ALTO elements.

From my point it is not a bug in the transformation so I am closing this.

kba commented

As far as I understand the hOCR specs, every text segment has to be enclosed by an ocr_carea

That should be the case but as @filak said, it is underspecified at least. Do not rely on that for transformations :-(

jtlz2 commented

@wrznr I am also getting an essentially empty hocr (when running ocr-transform) from an ABBYY-outputted alto file. Did you manage to find a way to do the conversion? Thanks!

wrznr commented

@jtlz2 No. But this is actually not my use case. I plan to go from ALTO to TEI and since I have a method to convert hOCR to TEI, I thought I could use this script as an intermediate step. Due to the unclearness of the hOCR documentation (cf. above), I refrained from this idea.

For your use case, maybe https://gist.github.com/tfmorris/5977784 helps?