Get ocrx_word classes for solr-ocrhighlighting in abbyy2hocr
nemobis opened this issue · 2 comments
If I understand https://literarymachin.es/archiviiify/ correctly, it would be nice to amend https://github.com/OCR-D/format-converters/blob/master/abbyy2hocr.xsl so that it extracts ocrx_word, right? If it's useful I could give a look to it (but I only use xsltproc).
yes, correct. i tried the first xslt i found on github, but i've not searched for further examples. saxon is needed because is XSLT 2.0, xsltproc only support 1.0
the xslt is producing correct lines (ocrx_line) but not words. the finereader xml (i never looked at it before) is annotating even chars!
ia list -l -f "Abbyy GZ" ITEM
closing this old ticket. in the meanwhile Archive.org has made a transition from abby to tesseract.
I just rewrote everything https://github.com/atomotic/archiviiify and full-text search has not yet landed here, I will think of a more minimalistic solution rather than the full Solr stack.