Suggestion: Significant speedup for `difference_inner_join`

Question

Suggestion: Significant speedup for `difference_inner_join`

Opened this issue 2 years ago · 0 comments

Hi!

The current implementation of difference_inner_join relies on the more general fuzzy_join and (conceptually) does a Cartesian join, followed by a filter on the pairs that are within the given distance. This is a O(N M) time and memory cost (assuming N=length of shorter dataframe, M=length of the longer).

A more efficient way is to represent the longer join column as a [value - dist, value + dist] interval tree. Then, for each element in the shorter column, query this tree in O(log M) time. The resulting cost is O( (N+M) log M), including the build time for the tree. In the case of constant N and M growing to large values (e.g. small sample searched against a huge database), the cost is then O(M log M) instead of O(M^2).

I'm not fluent enough in R and unfortunately have very little time to help out, should this be interesting, but if it helps, see my implementation for Pandas here. I was motivated to write my own since I needed this functionality for a project with big data but sadly difference_inner_join wasn't feasible.