dbmdz/solr-ocrhighlighting

Remove "special characters" during byte offset calculation

jbaiter opened this issue · 1 comments

Currently both byte offset generators (Java and CLI) will not perform any post-processing on the tokens parsed from the OCR. This leads to terms like foobar: in the index (note the colon at the end), since the only tokenizer that can be used is the WhitespaceTokenizer that does not trim special characters like the StandardTokenizer.

This can lead to a severe degradation of the search result quality, since many instances of query terms will not be found.

This has been "fixed" with the inclusion of the NonAlphaTrimFilterFactory, which offers post-processing similar to the effects of using Lucene's StandardTokenizerFactory.