Autocomplete suggestion no longer removes duplicate entries as in ES 2.3 !!
seme1 opened this issue ยท 37 comments
I relied on the autocomplete suggester in 2.3 to remove duplicate entries and provide unique suggestions of words/phrases. After upgrading to ES 5, I realized that the suggester is now document-oriented. This changes the logic of the how it can be used with already-built systems. Also, it seems to me that a feature was removed from ES 2.3. In other words, this is not the upgrade I was expecting !!
It also seems that I'm not the only one suffering from this !!
#21676
http://stackoverflow.com/questions/41744712/word-oriented-completion-suggester-elasticsearch-5-x
The "Duplicate filtering" feature outlined in the following article is no more in ES 5
http://rea.tech/implementing-autosuggest-in-elasticsearch/
I'll stick with ES 2.3 till a resolution for this missing feature is found.
@seme1 this change was a requested feature. The design of the new completion suggester is significantly different to what existed before - a suggestion maps to a single document now. This isn't going to change.
I'm afraid your only option is to remain on 2.3 until you've had time to rework your application.
How to "rework my application" ? I am relying on elasticsearch to give users suggestions on what terms/phrases to type when they search. I don't want the auto complete suggestion to give results pointing to individual documents, but rather to the most common/important terms/phrases.
I don't want the auto complete suggestion to give results pointing to individual documents, but rather to the most common/important terms/phrases.
This was never the intention of the completion suggester - it has never favoured the most common words etc. It is a prefix suggester: it looks at the prefix you've typed in and finds completions from a finite list of strings with exactly that prefix.
If you want to use the suggester to suggest search terms which may be duplicated across documents, then you need to index those search terms in a separate index and handle the deduplication yourself.
Sorry this is still a desaster! Is there no way to add deduplication back? This prevents some customers (and myself) to update to Elasticsearch 5. Collecting all suggest items (deduplicated) into a separate index is not going to work, as it again prevents easy updates.
If anybody is interested: I wrote a plugin with a hack that restores the old Completion Suggester. The trick was to add a new field type "legacy_completion" that behaves identical to the 2.x version of this suggester. It is a bit of hack, but works. I will post a GIST soon to show what it does - it is included into my own plugin so I have to extract this first.
Nevertheless: As all this works fine, why not keep the old completion suggester alive for people that are not interested in "document" suggestions, but want Google-like term autocompletion, but still want to have all in one index. The old suggester is perfect for that and the fact that deletions are not taken into account is not a problem at all (for this use case). In addition the deduplication works fine and is fast enough! Payloads are not required for this use case.
I'd suggest to add a field type like my plugin "legacy_completion", but without hacks. I'd suggest to remove payloads for this use-case, too.
Indeed the new suggester (called the document suggester in Lucene) is document based and does not have any ability to remove dups today.
There was some discussion early on about duplicates: #22912 (comment) but I don't think it led to any duplicate removal being added.
@areek can you confirm?
I suppose we (or users) could add a while loop to query for suggestions, and then iterate (with a larger N) if too many of the top N on the first try were duplicates.
I'd be happy to keep the old suggester available with own field type, like my plugin is doing. I will post the plugin code tomorrow. It is quite simple, but a hack that breaks easily if internals change or the old suggester code is completely removed in ES 6. That is what I would like to prevent.
I think we can also add duplicate handling to this suggester as an option in Lucene? I'll open a Lucene issue to explore it; I think it may not be so difficult.
I think we can also add duplicate handling to this suggester as an option in Lucene? I'll open a Lucene issue to explore it; I think it may not be so difficult.
OK, I opened https://issues.apache.org/jira/browse/LUCENE-7686 to add optional deduplication to the document suggester, and iterated to a working patch I think.
Hi Mike,
thanks for opening the Lucene issue! It looks good to me. But nevertheless a Suggester that is document based is not always the best idea. Your patch on the Lucene issue already mentions it: If you have lots of duplicates, this slows down and the idea behind a suggester is broken (as its slow, possibly horrible slow). Let me explain the 2 different types of Suggester "use cases":
-
The new suggester is document based, that is fine if you really want to suggest documents (and also want to filter deleted documents). Basically this suggester just executes a query and returns TopDocs (not really Topdocs, because it uses another scoring, but basically it is the same). This type of suggester works fine if you index for suggestions is unique, e.g., the document title (that should be almost unique) and suggest those in the drop down. If users click on those items, they are directly directed to the document. One example of this is the search engine on Elastic's home page. The risk of duplicates is small there.
-
The old suggester was just "dictionary based" (a variant of the term dictionary that has payloads and some weights). This suggester does not suggest documents, it just suggests terms/phrases you could enter into the search field and execute them. The same like Google is doing - and this is what I would need (and the user who opened the issue). Generally you can do the same also with the new suggester, but you must take care of using a separate, deduplicated index and index the suggester phrases from there. But this makes maintenance hard! If you have structured data and you know that some fields in you documents are useful as auto-suggestion (e.g, names of authors, journal names,... anything that could also work as a facet), then you are done already. This helps users to enter such terms (which are real suggestions, documents are no suggestions, they are already results of search). The problem with stuff like author names is that they may appear in thousands of documents. If you execute the search afterwards you get thousands of documents. This is how you would use a suggester like you see it on Google: just present phrases to search for which may return many documents. For this use case a Document-based suggester as the new one here is not scalable. To make it work as it should, user would need to create 2nd index, then execute an aggregation on the primary index on a field used for suggestions and migrate the buckets of the terms aggregation as suggestions. The backside: This makes adoption hard, as you have to run this in regular intervals and reindex the suggester index. For users with a high frequency of coming/going documents thsi is not gonna to work. The old completion suggester had the deduplication "automatically", because it just worked on the in-memory terms dictionary, but never iterated over documents. The backside: deleted stuff does not disappear. But for this type of suggestion this is not necessarily to happen in a timely manner. If you get a suggestion which is no longer in the index as the "whole phrase", it is still likely return results (for multi-word suggestions). So this was never a problem for the old suggester. I had never any problems with deletions not showing!
My suggestion would be to use the new suggester's dictionary, but just allow it to be run without return document suggestions: Just find the terms/phrases from dictionary that match and return them as suggestions. The score could be document frequency. That would help both worlds. The alternative would be to keep the old suggester alive as alternative (as said before). To me this would work like the still missing Solr "terms component" in ES.
@uschindler
Being able to delete suggestion entries with near-real time effect on the suggester is important for my application. In fact, it was the reason why I tried to upgrade to version 5.1. Of course, I had to abandon the upgrade after I discovered the missing de-duplication (the issue discussed here).
The way I have things work now to deal with the deletions is that every night I simply delete the suggestion index, recreate it and then re-index all the documents.
Thanks @uschindler for a nice summary of suggester use cases and pros/cons of the document based vs term based suggesters. Confusingly, there is also AnalyzingInfixSuggester
(not yet exposed in ES) which is essentially another document based suggester.
I like your idea to use the new suggester's dictionary (its FST): this can make dedup w/ the new suggester very low cost, because the FST has effectively already dedup'd. I'll try to rework my patch to do this ... then we don't need the deduplicating collector.
Thanks @uschindler for the explanation. I have exactly the same problem. In my case the suggestion can be anything one could actually search for (tags, keywords, names, titles, cities, ...). Each document has those fields, so it's normal that there will be some duplicates. The old suggester was really helpful deduplicating the results for easy access.
Now having duplicates I am aggregating them myself, but this only works because I don't have that much data. As the data increases it would take too long to give suggestions.
OK https://issues.apache.org/jira/browse/LUCENE-7686 is now fixed for Lucene 6.5.0; once we upgrade then we can expose the option in ES.
Thanks Mike, looks great. We just need some DSL changes to allow to pass "dedup" to suggester. I am still thinking about a good solution for previous "outputs" (if you have multiple suggestions per document, it is not easy to correlate the correct "output" with the suggestion). But with some JSON tricks this might be easy to solve.
My suggestion would be to use the new suggester's dictionary, but just allow it to be run without return document suggestions: Just find the terms/phrases from dictionary that match and return them as suggestions. The score could be document frequency. That would help both worlds. The alternative would be to keep the old suggester alive as alternative (as said before). To me this would work like the still missing Solr "terms component" in ES.
What you describe here would better fit in the phrase
suggester which works with an n-gram model and scores suggestions based on frequency and co-occurence. The completion suggester is agnostic to frequency and is solely based on the provided weight
to rank the suggestions. Not providing a weight
and relying on duplication to find the best matches is not in the scope of this suggester. The de-duplication is helpful not only to please the suggester but also to generate the weight associated with your suggestion. For instance, a web search engine could normalize and de-duplicate the query logs to create a suggester where the weight
could be the frequency of the query or a complex formula.
My point here is that it's dangerous to use this suggester as "a direct field suggester for my documents".
It's great if duplicates can be removed but if you have a lot of them then maybe that this suggester is not a good fit. For instance if you index books and wants to suggest author names I don't think you should use the completion suggester in the same index even if the FST can de-duplicate. For this use case I think you're right we don't have a good answer in ES unless you're able to do the de-duplication/ranking on your side. The phrase
suggester might be one but autocomplete is not implemented (only did you mean behavior).
For this use case I think you're right we don't have a good answer in ES unless you're able to do the de-duplication/ranking on your side. The phrase suggester might be one but autocomplete is not implemented (only did you mean behavior).
That's exactly my problem! And the example to suggest author names of books is the exact use-case I (and many others) seem to have.
There is another related issue I'm facing. Is it possible to save more than one suggestion and display only one based on a certain attribute I pass to the suggestion engine ?
For example, the author names may be written differently in different languages. I can have the author names saved in different languages within the suggester engine. Based on the current app interface language, I want the suggested author name to match that of the user display language.
For example, the author names may be written differently in different languages. I can have the author names saved in different languages within the suggester engine. Based on the current app interface language, I want the suggested author name to match that of the user display language.
I have a similar problem. The old suggester allowed to attach the "output" to the suggest input term. With the new one, you can no longer correlate the "input" term with the "output" in form of the "_source" document (you would need something like a "highlighter" to do this!). Of course, you can use different suggest fields in parallel (one for each language), but its then still hard to do the right suggestions if you do "term normalization", because you only get documents back. In my case, the autocompleter suggests also terms based on abbreviations (e.g., user starts to type in an abbreviation like "Pb", the suggester autocompletes "Lead" -- maybe bad example but it should just explain). With the outputs on previous completion this was possible, but no longer with the new one. It is impossible to guess from the autocompleted term which output from "_source" you would choose. Especially if you have many suggested terms (authors example)/document.
IMHO, the whole thing should be solved like mentioned by @jimczi : Use the phrase suggester but allow to use it as an "autocompleter" instead of "did you mean". As a side-effect the old completion suggester offered that, including the "weights" (because it defaulted to Suggest-TF if no weights were given).
I would really like to keep the old completion suggester with a different field type (which is possible, otherwise ES could not read old indexes). And my plugin verifies it: I can still use the old suggester with a new field type "legacy_completion". HACK ALARM: I think I should publish it for download!
@uschindler I'm relying on AWS ElasticSearch service. So, implemnting the hack is not currently an option. Besides, I have abandoned the upgrade to ES 5.1, and I'm still using ES 2.3.
@jimczi
However, it would be nice if the comments mentioned here are considered in future versions of ES.
@TheFireCookie: I was just waiting for somebody who asked! I will post it in my github account the next days. I just have to add license headers and extract it from my "Eierlegendewollmilchsau"-Plugin.
Uwe
Hi @TheFireCookie,
here is the plugin source code and readme file. It is set up for Elasticsearch 5.3.0, but it is easy to compile with any 5.x version. Just change the version number in the POM.
+1 Would love to see this fixed in an upcoming release.
Hi, no updates on this issue? Is is something that will eventually get addressed? Is there a more current discussion on this topic?
Thank you!
Maybe I'm missing something but this completely breaks (my) idea of the completion suggester.
Very common scenario; store an array of 'tags' on a document and use the completion suggester to provide autocomplete when entering/searching tags on a UI.
If a single document has multiple tags that are similar and match the query, only one result will be returned.
Example:
document: { tags: ['rick', 'roll'] }
suggest query term: 'r'
output options: ['rick']
expected options: ['rick', 'roll']
If I'm missing something, please let me know.
Any updates from the ES team on this issue ??
I'm unable to upgrade to any version above 5.x due to this issue. I'm also suffering from another issue with ES 2.3 (I reported it bug in ES2.3 #26358 but closed by ES team because it was fixed in ES 5.x). I feel really stuck now.
Any news? The new Lucene version mentioned above is already merged.
Are there any other requirements to solve this issue? It would be great if there were any updates.
The Lucene option mentioned above will be available in 6.1:
https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters-completion.html#skip_duplicates
This doesn't change the design of the completion suggester which will remain document based.
So it is still recommended to maintain an index for the completion suggester which has one document per suggestion but this option can be used to remove duplicates that remain at query time.
Nevertheless: My initial request in this issue was the following: Let the "new" completion suggester live as it is and keep it document based.
My proposal was to add "the old behaviour" available as a separate field type. My plugin located at https://github.com/uschindler/es-legacy-completion-plugin is exactly doing this. It allows to define a field with type "legacy-completion" and this is then indexed using the old codec. But it relies on the codec available inside Elasticsearch (the plugin is just a "hack" to make the old index format accessible to users, the query side is working automatically, because the existence of the codec automatically triggers the old suggester code). But to have it as 2 separate field types with 2 separate handlers would be way better.
Of course if the old codec will go away in Elasticsearch 6, this is fatal for the plugin. Are there any plans to just keep the old legacy complation codec available? E.g., I am unable to migrate to Elasticsearch 6 (I have not yet tried). Based on the usage/download statistics on the plugin, a lot of people use it now in their ES 5.x installations. It was already forked and adapted to several ES version from 5.x series.
So my only wish is: Keep the old codec available, so indexes using it can be still used and created.To allow this add another field type to explicitely use the "legacy suggester". The current version of the legacy completion is perfectly fine for many use cases where you cannot use a separate index, e.g. if you are tagging your documents and want to deliver the tags as autocompletion. E.g., I don't care about deleted documents - and many other do not, too.
So my only wish is: Keep the old codec available, so indexes using it can be still used and created.To allow this add another field type to explicitely use the "legacy suggester". The current version of the legacy completion is perfectly fine for many use cases where you cannot use a separate index, e.g. if you are tagging your documents and want to deliver the tags as autocompletion. E.g., I don't care about deleted documents - and many other do not, too:
I don't understand why you could not use a separate index. Just create a tags
index, set the _id of the document to be the tag name and create one document per tag with a completion field ? Why is this so complicated ? In fact I am not even sure you need a completion field for that if your index is separated, just use the prefix
and fuzzy
rewrite options on a keyword field to match your needs.
It's also not true to think that the legacy
suggester is perfectly fine for this use case.
The completion field is always indexed as a regular field and the postings are augmented with payloads that contain the surface form and the weight for every suggestions. This means that in 2.x the FST in each segment is de-duplicated but you still pay the duplication cost in the postings and this cost is not negligible since payloads don't use compressions. There are other hidden cost regarding duplication in the legacy
suggester but my point here is that it's not efficient to use a completion field, legacy
or not, for such a use case.
IMO we should not keep the old postings format around, the cost to maintain it is way bigger than what it provides. The new suggester has been added to circumvent the limitations of the legacy
, not to provide a new way to retrieve suggestions.
Don't get me wrong I am not saying that your use case should not be handled more easily than "you need another index for this", I am just saying that the completion suggester was not meant to solve this.
I'm having a problem with Completion Suggester i.e - i have normally indexed all my necessary fields using https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html .So i created an index called autoidx and had to manually update all my id's with
the indexing pattern given in the docs of 5.5 .So first every item got indexed(autoidx) then how can we index the suggestions without doing it manually since without that part no result is coming using the POST query.Moreover can we do any boosting in this kind of autocompletion?
Thanks
@riemannzeta1191 could you please ask your question on the discuss forum. This issue is closed and should not be used to answer general questions about the completion suggester.
@uschindler @jimczi I went through this issue and like many others, I am looking for an autocomplete solution that suggests terms/phrases from few fields in my index (which is very big). The problem with making the fields unique offline and putting it in a seperate index is we might have to reindex all the documents every n days (to handle deletion and avoid stale terms), which turns out to be too expensive when the index size is very big. Another approach I can think of is to keep count of the unique terms which proves to be expensive operation as well in a distributed indexing system. It would be ideal if we could handle both duplicates and deletion in an optimal manner.
IMHO, the whole thing should be solved like mentioned by @jimczi : Use the phrase suggester but allow to use it as an "autocompleter" instead of "did you mean". As a side-effect the old completion suggester offered that, including the "weights" (because it defaulted to Suggest-TF if no weights were given).
I am curious to know whether we can make the phrase/term suggester work for autocomplete (as mentioned in the comment above)? Can it be done with some code change? Does anyone have any pointers on how to approach that? I can try it out and create a PR.