Remove "special characters" during byte offset calculation
jbaiter opened this issue · 1 comments
jbaiter commented
Currently both byte offset generators (Java and CLI) will not perform any post-processing on the tokens parsed from the OCR. This leads to terms like foobar:
in the index (note the colon at the end), since the only tokenizer that can be used is the WhitespaceTokenizer
that does not trim special characters like the StandardTokenizer
.
This can lead to a severe degradation of the search result quality, since many instances of query terms will not be found.
jbaiter commented
This has been "fixed" with the inclusion of the NonAlphaTrimFilterFactory
, which offers post-processing similar to the effects of using Lucene's StandardTokenizerFactory
.