Conversion to ALTO-2.0 is invalid
Closed this issue · 5 comments
I tried conversion from hOCR to ALTO-2.0 and after that when I tried ocr-validate on that file I got:
mXSDFilename: /usr/local/share/ocr-fileformat/xsd/alto-2-0.xsd
mXMLFilename: /data/000144300/060.alto
/data/000144300/060.alto fails to validate because:
cvc-pattern-valid: Value '' is not facet-valid with respect to pattern '([a-zA-Z]{1,8})(-[a-zA-Z0-9]{1,8})*' for type 'language'.
At: 1:934
I also tried to convert it to other versions of ALTO but that all failed but it was just for testing because I need version 2.0.
Which hOCR did you use for that test? Could you please add it here to allow reproducing the problem?
I have attached the file. But the same problem is caused by every hOCR I tried to convert. hOCR is created by Tesseract v3.04.01.
060.hocr.zip
Thanks for trying @FoxKyong and for asking for ALTO support in tesseract.
Problem is in https://github.com/kba/hOCR-to-ALTO/, I'll look into it.
The problem was with mapping language. Should be fixed in kba/hOCR-to-ALTO#1. Can you try
(cd vendor/hOCR-to-ALTO; git pull)
and try the transformation/validation again?
I have tried it and it works. Thanks.