Suggestion: Significant speedup for `difference_inner_join`
Opened this issue · 0 comments
Hi!
The current implementation of difference_inner_join
relies on the more general fuzzy_join
and (conceptually) does a Cartesian join, followed by a filter on the pairs that are within the given distance. This is a O(N M) time and memory cost (assuming N=length of shorter dataframe, M=length of the longer).
A more efficient way is to represent the longer join column as a [value - dist, value + dist]
interval tree. Then, for each element in the shorter column, query this tree in O(log M) time. The resulting cost is O( (N+M) log M), including the build time for the tree. In the case of constant N and M growing to large values (e.g. small sample searched against a huge database), the cost is then O(M log M) instead of O(M^2).
I'm not fluent enough in R and unfortunately have very little time to help out, should this be interesting, but if it helps, see my implementation for Pandas here. I was motivated to write my own since I needed this functionality for a project with big data but sadly difference_inner_join
wasn't feasible.