Stop words and string comparison
Closed this issue · 2 comments
In spelunking through the code, I noticed that the StopWordFilterBase
is required to return a Lunr.ISet<string>
and that the default Lunr.Set<T>
implementation of that interface is backed by a HashSet<T>
(https://github.com/bleroy/lunr-core/blob/main/LunrCore/Set.cs#L22). However, the HashSet<T>
used by Lunr.Set<T>
does not have the option of specifying a StringComparer
.
If I'm understanding the flow correctly, that means that the StopWordFilterBase.IsStopWord(string word)
method, which calls StopWords.Contains(word)
by default, will be case-sensitive with no way to easily modify the behavior to be case-insensitive. That seems like it verges on a bug when it comes to stop words, which I think are generally expected to be case-insensitive.
In general, the requirement for StopWordFilterBase
implementations to expose a Lunr.ISet<string> StopWords
property seems like it should be an implementation detail given the actual determination of a stop word goes through the StopWordFilterBase.IsStopWord(string word)
method.
I've got two concrete suggestions:
- Add overloads to the
Lunr.Set<T>
implementation similar to theHashSet<T>
constructors that allow specifying aStringComparer
and flow that through to the backingHashSet<T>
. - Remove the
StopWordFilterBase.StopWords
property and make theStopWordFilterBase.IsStopWord(string word)
method entirely virtual.- That would also require some changes to
EnglishStopWordFilter
to store it's ownLunr.ISet<T>
, but those would be minor. An intermediate abstractStopWordSetFilterBase
could also be created to avoid changingEnglishStopWordFilter
if that's desired instead.
- That would also require some changes to
If you're amendable to either (or both) of these suggestions, I can work up a PR. I might go ahead and work up one anyway since they're fairly minor code changes and would get me familiar with the code base.
Thanks!
Good catch, thanks! Haven't looked at the details yet, but it sounds like option 1 should be easily doable and perfectly adequate.
Cool - PR incoming!