Fuzzy matching not working
skrafft opened this issue · 6 comments
Hi,
The fuzzy matching parameter has no effect:
I tried to return results for https://api.opensanctions.org/search/default?q=Barrrack%20Obama&fuzzy=true and it should return a result since there's only 1 letter changing
I checked in the code https://github.com/opensanctions/yente/blob/main/yente/search/queries.py#L85 and in Elastic Search documentation, it should work but as a matter of fact, it does not.
Searching on Google returns results linked to a wrong mapping but I could not find any problem in the ES mapping either. I ended up updating the text_query function to this:
def text_query(
dataset: Dataset,
schema: Schema,
query: str,
filters: FilterDict = {},
fuzzy: bool = False,
):
if not len(query.strip()):
should = {"match_all": {}}
elif fuzzy and query.find('~') == -1:
should = {
"match": {
"text": {
"query": query,
"fuzziness": "AUTO",
"lenient": True,
"operator":"AND"
}
}
}
else:
should = {
"query_string": {
"query": query,
"fields": ["names^3", "text"],
"default_operator": "and",
}
}
return filter_query([should], dataset=dataset, schema=schema, filters=filters)
The reason for this line fuzzy and query.find('~') == -1
is to not mix fuzziness and ~ operator. If query contains ~, the fuzzy parameter is just ignored
@pudo any comment on this ?
I can open a pull request if needed
Hi. I'm just working on an integration test harness for the API, so this comes in useful. So what I think it turns out to be is: "fuzziness": "AUTO"
brings in a levenshtein tolerance of only 1 for a string the length of "Barrrack Obama". Adding two extra "r" exceeds that threshold. So I guess the best option would be to make fuzziness
default to something other than AUTO, e.g. 2
. I don't want to do this on the public API that we operate, since it's a massive performance penalty, but we could introduce and environment setting?
Hi,
I don't think this is related to the AUTO value. I've tested multiple combination directly on Elastic Search with fuzziness=AUTO,1 or 2 and it does not change the results. As a matter of fact, the query https://api.opensanctions.org/search/default?q=Barrack%20Obama returns 1 result and https://api.opensanctions.org/search/default?q=Barrock%20Obama%fuzzy=true (changing one "a" to one "o") does not return anything.
I think there's something wrong with the mapping but could not figure what so ended up rewriting the query.
Just to be clear: the guy is called Barack Obama
(https://en.wikipedia.org/wiki/Barack_Obama). Barrack Obama
is fuzziness=1, Barrock Obama
is fuzziness=2. Am I total confused here?
That's true but he also has aliases like Barrack Obama in the data so Barrack Obama is a perfect match according to Elastic Search (which makes fuzzy to 1 when you replace a to o). Anyway, searching https://api.opensanctions.org/search/default?q=Barock%20Obama does not return any result either.
so for it to return a result for Barock%20Obama is there something that can be configured or added?
Ok so I've solved this question, but the answer is less than amazing. Basically: ElasticSearch never does fuzzy search on all the terms in a query_string
query - that's something you have to actively indicate by adding a tilde to the fuzzy term: barock~ obama
gives a result.
My take-away: probably a good idea to use /match
in yente most of the time if you're trying to match entities. The search API is just that: a way for people to search on the web site...
cf. https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html