WordPress/openverse-api

Exact match does not function as described in the documentation

AetherUnbound opened this issue · 3 comments

Description

It would appear that using a quoted search (e.g. "this is my search" vs this is my search) does not behave as described in the documentation:

# Example 2: Search for audio that is an exact match of Giacomo Puccini
curl \
  -H "Authorization: Bearer DLBYIcfnKfolaXKcmMC8RIDCavc2hW" \
  "https://api.openverse.engineering/v1/audio/?q="Giacomo Puccini""

Firstly, this documentation doesn't escape the quotes correctly, so Puccini is read in as a bash command.

Additionally, the quoted search appears to do nothing to the results. From the search controller code, it looks like we're boosting exact matches in all cases, but ignoring quotes otherwise:

https://href.li/?https://github.com/WordPress/openverse-api/blob/a44f8d19cc3b2612469bbf20a3aac54c44086712/api/catalog/api/controllers/search_controller.py#L340-L356

It seems that we will need to change this logic so that if a term is quoted, only an exact match search is made. Exact matches could still be boosted in regular searches, but non-exact matches should not be returned in quoted searches.

Reproduction

  1. just recreate
  2. Visit http://localhost:50280/v1/images/?q=food
  3. Visit http://localhost:50280/v1/images/?q=%22food%22 and observe that the list remains unchanged, even though some posts (those on page 6) are matched on the tag food

Resolution

  • 🙋 I would be interested in resolving this bug.

I've labeled this as "high" because we've received direct user feedback about it and it is behavior that contradicts our documentation.

Update from @obulat:

I looked through the code, and It appears that we first remove invalid quotes (if there are 3 quotes, for example) and then use simple_query_string [Elastic docs] which actually should only return the results that have the exact matches. The simple_query_string uses the search syntax (so, the quotes that mean exact matches, too) to query for results. And it uses an AND operator, so it should only return when all of the parts of the query match.

I was actually able to find a query string which produced different results when quoted too!

@sarayourfriend suggested this may also be the result of stemming (and potentially stop token filtering). E.g. even quoted searches for "skiing" would return results on "ski" and "skis". Elasticsearch has a guide on how we could mix both stemming and exact searches: https://www.elastic.co/guide/en/elasticsearch/reference/current/mixing-exact-search-with-stemming.html. Perhaps that would be something we could add!