UB-Mannheim/ocr-fileformat

Conversion to ALTO-2.0 is invalid

Closed this issue · 5 comments

I tried conversion from hOCR to ALTO-2.0 and after that when I tried ocr-validate on that file I got:

mXSDFilename: /usr/local/share/ocr-fileformat/xsd/alto-2-0.xsd
mXMLFilename: /data/000144300/060.alto
/data/000144300/060.alto fails to validate because: 

cvc-pattern-valid: Value '' is not facet-valid with respect to pattern '([a-zA-Z]{1,8})(-[a-zA-Z0-9]{1,8})*' for type 'language'.
At: 1:934

I also tried to convert it to other versions of ALTO but that all failed but it was just for testing because I need version 2.0.

Which hOCR did you use for that test? Could you please add it here to allow reproducing the problem?

I have attached the file. But the same problem is caused by every hOCR I tried to convert. hOCR is created by Tesseract v3.04.01.
060.hocr.zip

kba commented

Thanks for trying @FoxKyong and for asking for ALTO support in tesseract.

Problem is in https://github.com/kba/hOCR-to-ALTO/, I'll look into it.

kba commented

The problem was with mapping language. Should be fixed in kba/hOCR-to-ALTO#1. Can you try

(cd vendor/hOCR-to-ALTO; git pull)

and try the transformation/validation again?

I have tried it and it works. Thanks.