Get ocrx_word classes for solr-ocrhighlighting in abbyy2hocr

Question

Get ocrx_word classes for solr-ocrhighlighting in abbyy2hocr

nemobis opened this issue 5 years ago · 2 comments

If I understand https://literarymachin.es/archiviiify/ correctly, it would be nice to amend https://github.com/OCR-D/format-converters/blob/master/abbyy2hocr.xsl so that it extracts ocrx_word, right? If it's useful I could give a look to it (but I only use xsltproc).

Answer 1 · 2020-07-06T08:09:45.000Z

yes, correct. i tried the first xslt i found on github, but i've not searched for further examples. saxon is needed because is XSLT 2.0, xsltproc only support 1.0

the xslt is producing correct lines (ocrx_line) but not words. the finereader xml (i never looked at it before) is annotating even chars!

ia list -l -f "Abbyy GZ" ITEM

Answer 2 · 2023-03-29T13:09:32.000Z

closing this old ticket. in the meanwhile Archive.org has made a transition from abby to tesseract.
I just rewrote everything https://github.com/atomotic/archiviiify and full-text search has not yet landed here, I will think of a more minimalistic solution rather than the full Solr stack.