bleroy/lunr-core

Stop words and string comparison

Closed this issue · 2 comments

In spelunking through the code, I noticed that the StopWordFilterBase is required to return a Lunr.ISet<string> and that the default Lunr.Set<T> implementation of that interface is backed by a HashSet<T> (https://github.com/bleroy/lunr-core/blob/main/LunrCore/Set.cs#L22). However, the HashSet<T> used by Lunr.Set<T> does not have the option of specifying a StringComparer.

If I'm understanding the flow correctly, that means that the StopWordFilterBase.IsStopWord(string word) method, which calls StopWords.Contains(word) by default, will be case-sensitive with no way to easily modify the behavior to be case-insensitive. That seems like it verges on a bug when it comes to stop words, which I think are generally expected to be case-insensitive.

In general, the requirement for StopWordFilterBase implementations to expose a Lunr.ISet<string> StopWords property seems like it should be an implementation detail given the actual determination of a stop word goes through the StopWordFilterBase.IsStopWord(string word) method.

I've got two concrete suggestions:

  • Add overloads to the Lunr.Set<T> implementation similar to the HashSet<T> constructors that allow specifying a StringComparer and flow that through to the backing HashSet<T>.
  • Remove the StopWordFilterBase.StopWords property and make the StopWordFilterBase.IsStopWord(string word) method entirely virtual.
    • That would also require some changes to EnglishStopWordFilter to store it's own Lunr.ISet<T>, but those would be minor. An intermediate abstract StopWordSetFilterBase could also be created to avoid changing EnglishStopWordFilter if that's desired instead.

If you're amendable to either (or both) of these suggestions, I can work up a PR. I might go ahead and work up one anyway since they're fairly minor code changes and would get me familiar with the code base.

Thanks!

Good catch, thanks! Haven't looked at the details yet, but it sounds like option 1 should be easily doable and perfectly adequate.

Cool - PR incoming!