MihaiValentin/lunr-languages

Accented letter ê should be replaced by e in the french stemmer

Opened this issue · 1 comments

Currently, "empêchaient" (verb "empêcher" conjugated in past) will be indexed as "empêch" (instead of "empech").

I'm not familiar with http://snowball.tartarus.org/ nor stemmer algorithms but according to http://snowball.tartarus.org/algorithms/french/stemmer.html this is the expected behavior.
For instance, maître will produce maîtr not maitr. I find it odd, because most of the time French people will not type accented letters when searching (because it's quicker to type and most search engine will replace accented letters anyway).

For reference, here's the Lucene implementation: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java

Hi, you can do this separately by doing Unicode folding, as detailed here: https://github.com/dhdaines/lunr.py/blob/fix_skip_docs/docs/languages.md#folding-to-ascii

Or by using lunr-folding