skrub-data/skrub

allowing to use a different distance for the nearest neighbors in fuzzy join

Opened this issue · 1 comments

Problem Description

ATM we use NearestNeighbors with the l2 distance.
if we could choose the distance to use, then using MinHash as the text encoder and "hamming" as the distance would be an approximation of 1 - Jaccard similarity, which I believed is a common choice for fuzzy joining

Feature Description

the Joiner would have a "metric" or "distance" parameter that would be forwarded to NearestNeighbors metric

Alternative Solutions

No response

Additional Context

No response