Accented letter ê should be replaced by e in the french stemmer
Opened this issue · 1 comments
Currently, "empêchaient" (verb "empêcher" conjugated in past) will be indexed as "empêch" (instead of "empech").
I'm not familiar with http://snowball.tartarus.org/ nor stemmer algorithms but according to http://snowball.tartarus.org/algorithms/french/stemmer.html this is the expected behavior.
For instance, maître
will produce maîtr
not maitr
. I find it odd, because most of the time French people will not type accented letters when searching (because it's quicker to type and most search engine will replace accented letters anyway).
For reference, here's the Lucene implementation: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java
Hi, you can do this separately by doing Unicode folding, as detailed here: https://github.com/dhdaines/lunr.py/blob/fix_skip_docs/docs/languages.md#folding-to-ascii
Or by using lunr-folding