OrchardCMS/Orchard

Search / Lucene Modules does not work with accentuated characters

Closed this issue · 17 comments

@jtkech created:
https://orchard.codeplex.com/workitem/20059

I propose others options here: https://orchard.codeplex.com/discussions/454927

But the fact that accentuated characters are not implemented is like an "issue" for a french web site. I use a temporary solution with a Lucene filter for special characters. Note that to use this filter I have to HtmlDecode each query term

Thanks

@sebastienros commented:

Agree. The solution is to let admin define the analyzer for each of the indexes, maybe using a setting. This could be extensible but the ones already provided by lucene should be sufficient.

As a workaround you can change it in the code manually. French people are really a pain.

@Jetski5822 commented:

One thing I would like is to push the analyser out as a provider. That way you can push in your own provider without having to override entire classes.

There is the case of the Lucene highlighter, this might require other abstractions, I will investigate.

@Piedone commented:

As a Hungarian I can feel your pain!

hkui commented:

This issue seems to describe the same problem: http://orchard.codeplex.com/workitem/20265

@Codinlab commented:

I proposed a fix for HtmlDecode BodyPart text before indexing (I think it is the only part which needs that).

I also made a copy of Lucene module with a customised French analyzer. It provides significant better results than StandardAnalyzer on french contents. Lucene provided FrenchAnalyser lacks some filters.

Since working with a modified copy of a core module is not a good idea, I would want to find a clean way to use my analyser instead of the default one.

Sebastien and Jetski5822 proposed some solutions, but I don't see how to achieve this.

Is there someone who can give me some advice or being interested in working on that ?

Linked to #3105

#4094 was set to 1.9.x but it was a duplicate of this issue, so setting this to 1.9.x too.

Since recently you can implement a custom ILuceneAnalyzer provider. Having a setting to define which one to use by default should be easy.

Here is a sample implementation : Lucene.FrenchAnalyser

The StandardAnalyzer is used now by default, it filters StandardTokenizer with LowerCaseFilter, StandardFilter and StopFilter. The StopFilter using a list of English stop words, but to be more generic we should use just StandardTokenizer with StandardFilter and LowerCaseFilter. In my opinion, creating an ASCIIFoldingFilter setter option in the dashboard should solve this specific issue.

In the future, @sebastienros suggestion would be the best: "The solution is to let admin define the analyzer for each of the indexes, maybe using a setting. This could be extensible but the ones already provided by lucene should be sufficient." In that way, the admin could easily determine e.g. language-specific indexes with stop words and stemming.

Test if it @TFleury 's module allows to search accentuated characters.

@sebastienros @Piedone
So we need a set of indexName (textbox) -> ILuceneAnalyzerSelector.Name (dropdown) mappings.
What about adding a new Lucene settings menu with a LuceneSettingsPart which would store this set of mappings in a serialized form?
Then the ILuceneAnalyzerProvider implementation would filter the list of ILuceneAnalyzerSelectors by the given name.

I think a site setting would be good for that (just list the analyzer names in the dropdown, this is a technical config so I think we can live without anything more user-friendly).

@sebastienros what do you think?

Your design sounds right.
Ensure that we are also correctly using the same analyzer when parsing the searched text (it has to match the indexing one).
We need this setting to be on a Search Index (not a global search setting) so each index can have different analyzers.
And display a message when the analyzer is changed to inform the user to rebuild the index.

Demo on Tuesday

You missed an exclamation mark, a question mark or a ;-))) smiley from the end! Can do on the Tuesday after that when I won't have other obligations during that time.