elastic/elasticsearch

Merging the terms from multiple sub-analyzers

ofavre opened this issue · 24 comments

Multi-field is great, but searching with multiple analyzers against only one field is simpler/better.
If you have a multi-lingual index, where each document has its source language, you can analyze the text fields using a special analyzer, based on the detected language (maybe even using the _analyzer.path functionality).
But what happens when you misdetected the language somehow, either at index- or at query-time? Some aggressive stemming can have devastating effects.

In such a scenario, having the original words indexed in parallel to the stemmed one would help. Be they in the same field would even letting phrase/slop queries work properly.
The only way to get multiple terms at the same position with ElasticSearch is through the synonym token filter, useless for stemming.

I've been working on a way to merge the terms that multiple analyzers output.
Say you want both to use a simple analyzer, and any of the specialized language-specific analyzer, or anything.
My plugin can make it as simple as the following index setting:

index:
  analysis:
    analyzer:
      # An analyzer using both the "simple" analyzer and the sophisticated "english" analyzer, combining the resulting terms
      combo_en:
        type: combo
        sub_analyzers: [simple, english]

Here is a simple example of what is does:

# What the "simple" analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=simple' -d 'An example'
{
  "tokens" : [ {
    "token" : "an",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "example",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  } ]
}
# What the "english" analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=english' -d 'An example'
{
  "tokens" : [ {
    "token" : "exampl",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

# Now what our combined analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=combo_en' -d 'An example'
{
  "tokens" : [ {
    "token" : "an",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "example",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "exampl",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

Terms are sorted by position, then by start/end offset, so that it's easier to use its output under reasonable assumptions of using a classical analyzer.

Here is the good news! You can find my implementation here: https://github.com/ofavre/elasticsearch/tree/combo-analyzer-v0.16.4 (based on released ElasticSearch version 0.16.4).

EDIT: It is finally available as a plugin, thanks to jprante: https://github.com/yakaz/elasticsearch-analysis-combo.

Consider this issue as a pre- pull-request.

The current implementation should be independent of the used sub-analyzers.
However, I used some tricks in order to clone a Reader in some optimized ways. I think part of those hacks should somehow be integrated within Lucene's core, the combo-analyzer being a contrib, and having some wrapper in ElasticSearch.

Here is the patch I proposed to the Lucene community:
https://issues.apache.org/jira/browse/LUCENE-3392

We'll see how it goes.

This is the solution to a multilingual "_all" field. Can't wait for it.

This seems nice.

Is the analyzer field/path available with your combo plugin?

It is available as a standalone plugin now, see: https://github.com/yakaz/elasticsearch-analysis-combo.

@slorber: Of course! The analyzer used for a field can be controlled using the analyzer field, then the analyzer is called, fed with some data. So any analyzer can be used with this feature.
And this technique is effectively useful when you have a language field, and want to combine a language dependent analyzer (english, spanish, french, etc.), as well as a language agnostic analyzer (simple, whitespace, etc.), just in case you misdetected the language in the first place.

Hello,

Thanks, yes it's obvious it can be used for the _analyzer field since your combo is... an analyzer... Thus i guess i just need to create a combo analyzer for each langage instead of the classic "one analyzer per langage".

Btw i've had quite an appropriate result using multi fields but i think it's a pain and you noticed that.
Perhaps you can tell me how it works with multi fields? I think it's not "obvious" about how store and highlights work on a multifield.

Here's my mapping:
https://gist.github.com/3053540

The pain is:

  • I need to use a boolean/text search on these 3 fields
  • I need to store=yes for all of them or i can't get any highlight
  • I need to add highlights for the 3 fields or i only get the highlights when it was matched by the field analyzer
  • My highlight map has now 3 fields and i must select/merge the most appropriate one (exact match > stemming > ngrams for me)

Do you also noticed that?
When using store=yes for all subfields, are they stored as duplicates in ES?

And how does your combo analyzer solve these problems?

  • I will have only 1 field so only one store=true -> nice
  • But what will be the behavior of highlights?
  • What if 2 analyzers are producing the same tokens, are they consumming 2 * the token space on my index or are merged?
  • How will search behave? What kind of analysis will be performed on the search text for that field before trying to find matches?

And the most important:

  • Would you use that in production
  • How "hacky" is your solution and is there an elegant integration with Lucene/ES planned?

Storing a field means storing the original content. This content is then available for display (hightlighting). This has not much to do with the combo analyzer.

Yes, the tokens take, if they get repeated by the combo analyzer, more space - but only for referencing, positions, frequency for scoring, and the like, not in the dictionary (the index is inverted!) so this is neglectable.

During a Lucene search, the query words are transformed into tokens for matching documents in the index by the analyzer for the field. It is always recommended to use the same analyzer for indexing and for search. Otherwise your search results are getting unpredictable. This holds also for the combo analyzer. The situation is more relaxed, as you will mostly get results if you just use a subanalyzer on the combo analyzed field.

If you like to follow up, I would recommend asking questions on the Elasticsearch mailing list, because not everybody will be able to monitor the github issue tracking system for interesting discussions. More info: https://groups.google.com/group/elasticsearch

Thanks.

By chance do you know if it's possible to embed your plugin in unit tests?

Plugins can be tested, sure, with testng/surefire/junit... the jar and the deps must be on the classpath.

Thanks, didn't know it was so easy, i though we would have to deal with the plugin path property or something...

So, this was closed because it is never being implemented in elastic?
Or because its solved via the plugin?

The proposed patch has never been integrated into Lucene.
The feature has been implemented as a plugin. Get it here!

@nickminutello the reason we never implemented it was that we think it is a bad idea to mix analysis chains like this.

@nickminutello note that we are using the plugin in production since 2012 and it works well until now

The combo analyzer is also here in production since 2012 and we could not live without it.

At least Elasticsearch uses the KeywordRepeatFilter #2753
which is a kind of a lightweight version of the combo analyzer, since it handles the combination of stemmed/unstemmed tokens. So the idea of combining token streams is not a bad idea.

I see the points, but there are workarounds:

  • boosting fields in the index is not the only boosting that is available. There is also query term boosting/weighting, or document boosting by function score
  • if tokens appear more than once in a field they can be radically filtered out by the unique filter (why only_on_same_position? phrase search is no longer reliable anyway when token streams are mixed)
  • skewed IDF is also a challenge when using multiple fields instead of just one field. The effect is small for short text input and BM25 Okapi which has some tunables

So, when strategies exist to work around the effects, mixing tokens from multiple analyzers is still a good idea, especially for multi language search. Many applications here use this, with success.

@jprante what are you doing now in the 5.x versions, since that original yakaz plugin was never updated?

@apatrida in the meantime, I could reorganize my simple use case to a more complex token filter chain, and I dropped multiple language analysis support in favor of ICU case folding, which is not a full substitution though.

After the language_to feature jprante/elasticsearch-langdetect#49 (comment) , I plan to extend my langdetect plugin by a new query similar to simple_query_string which tries to detect the language in a query and set the appropriate language field before the query is executed on the cluster.

But if analyzer chaining is still the only possible method for some use cases, I maybe find time to try to implement such an über-analyzer for ES 5.x.

@jprante I'm in the same now, filter chains, but do run into issues like someone mentioned on one of your projects where you might want to protect a keyword from the next link in the chain, and yet want the rest of the chain to process that token. (really just need to add exception lists to some of the plugins would solve this, like the decompounder). I'll hop over to your langdetectand see where you are headed and see if I can help out anywhere. thanks

Lucene has a KeywordAttribute that can be set by KeywordMarkerFilter and is respected by stemmers etc. for exactly that reason to prevent certain terms to be modified by the next link in the chain. maybe that is useful and already available.

@s1monw but that blocks all future items in the chain from processing it, not just the next link in the chain yes? The issue I was referring too would be better solved with an exclude list in his decompounder because the rest of the chain needs to process the token, just not the decompounder.

the way token filters work is that you can chain them so you can also add one that resets keyword attributes. I think stuff like this should be addressed in a pluggable fashion otherwise you just end up with legacy issues. Also it seems not related to ES so I wonder if you wanna discuss this on the repos where that langdetect is maintained?

@s1monw sure, I was writing here to get alternatives written that you might use instead of what was originally presented (sub-analyzers), then rejected, in this issue. Google leads here, and now this topic gives some alternatives from some of those who originally backed that idea.