Remove all punctuation from index and query
Closed this issue · 6 comments
Hi, I asked a similar question recently, but I am struggling a bit with punctuation and looking for advice.
If I have something like this in my index
J.J. O'Hara's
I would like to match that on the query
jj oharas
Using fuzzy logic, I think this would involve allowing more edits than I would like. It may have an impact on the amount of irrelevant results I get.
Is there a way of just stripping out all punctuation in both the index and the query?
Is WithDefaultTokenization(options => options.SplitOnPunctuation(false))
ok for the index? Is there something similar for the query? Thanks
Hi @PetesBreenCoding - I think it's really going to depend on the source text you're indexing. If you disable splitting on all punctuation then while it will mean that apostropies are preserved in indexed words, you're also going to get ful stops, commas, etc. all indexed as well.
I could look into the possiblity of create a new token pre-processor that's able to strip certain characters from a token, e.g. stripping '
and .
would mean J.J. O'Hara's
would be indexed against the tokens JJ
and OHaras
. This would be pretty brute-force, and not context aware, i.e. all periods would be stripped, including those at the end of sentences, which could lead to some odd indexed terms if a space is omitted between sentences.
Is WithDefaultTokenization(options => options.SplitOnPunctuation(false)) ok for the index? Is there something similar for the query?
These tokenization rules are also applied to query terms as well. This is necessary otherwise queries against indexes with things like stemming and case insensitivity won't behave correctly.
Thanks @mikegoatly. I think what I'll do is strip punctuation from the values I enter into the index, and the same for the query text. The input text is all simple names/descriptions that are maintained by me, so I can watch out for possible odd indexed terms.
This is probably one of those times where users of the library will understand their source text the best, and having options like this will make life better. I'll keep this open and have a look to see how easy it would be to add in. (I suspect it won't be that hard)
So it turns out this isn't going to be possible without a breaking change in an interface somewhere along the way - the behavior of the query parser wasn't quite as I remembered and it currently follows its own rules for splitting tokens, which means that we end up with inconsistently split words. I'm starting to get together a list of things I'd like to add to the next major release - this is one of the things I'll be adding.
Thanks, @mikegoatly. This would be a good addition for people who maintain the source data themselves and can make sure sentences start with a space etc.
V4 is now released and you can configure the tokenizer to ignore characters
If you use the feature, let me know how you get on and feel free to raise an issue if something's not working for you.