Exact match does not function as described in the documentation
AetherUnbound opened this issue · 3 comments
Description
It would appear that using a quoted search (e.g. "this is my search"
vs this is my search
) does not behave as described in the documentation:
# Example 2: Search for audio that is an exact match of Giacomo Puccini curl \ -H "Authorization: Bearer DLBYIcfnKfolaXKcmMC8RIDCavc2hW" \ "https://api.openverse.engineering/v1/audio/?q="Giacomo Puccini""
Firstly, this documentation doesn't escape the quotes correctly, so Puccini
is read in as a bash command.
Additionally, the quoted search appears to do nothing to the results. From the search controller code, it looks like we're boosting exact matches in all cases, but ignoring quotes otherwise:
It seems that we will need to change this logic so that if a term is quoted, only an exact match search is made. Exact matches could still be boosted in regular searches, but non-exact matches should not be returned in quoted searches.
Reproduction
just recreate
- Visit http://localhost:50280/v1/images/?q=food
- Visit http://localhost:50280/v1/images/?q=%22food%22 and observe that the list remains unchanged, even though some posts (those on page 6) are matched on the tag food
Resolution
- 🙋 I would be interested in resolving this bug.
I've labeled this as "high" because we've received direct user feedback about it and it is behavior that contradicts our documentation.
Update from @obulat:
I looked through the code, and It appears that we first remove invalid quotes (if there are 3 quotes, for example) and then use
simple_query_string
[Elastic docs] which actually should only return the results that have the exact matches. Thesimple_query_string
uses the search syntax (so, the quotes that mean exact matches, too) to query for results. And it uses an AND operator, so it should only return when all of the parts of the query match.
I was actually able to find a query string which produced different results when quoted too!
- https://api.openverse.engineering/v1/audio/?q=down%20by%20the%20river - produces 94 audio results
- https://api.openverse.engineering/v1/audio/?q=%22down%20by%20the%20river%22 - produces 17 audio results
@sarayourfriend suggested this may also be the result of stemming (and potentially stop token filtering). E.g. even quoted searches for "skiing" would return results on "ski" and "skis". Elasticsearch has a guide on how we could mix both stemming and exact searches: https://www.elastic.co/guide/en/elasticsearch/reference/current/mixing-exact-search-with-stemming.html. Perhaps that would be something we could add!