Iterate rather than recurse
aquach opened this issue · 8 comments
Some of my datasets create trees of depth 600+, which when queried can trigger stack overflows. I could increase the stack size, but I think that querying with iteration vs recursion can solve the issue more generally. I can take a stab at that in a few hours.
It appears that the reason why the tree is so deep is because there are many duplicate elements (d(x, y) = 0), which create degenerate tree structures. When I unique the elements before inserting them, the resulting tree only has depth 7. I guess the recursion is fine as long as you take this into account since the unique depth is so low. Seems like there are a few approaches here which aren't mutually exclusive:
- change add and query to be iterative
- optimize tree to collapse duplicates within a single node
- push the responsibility for duplicates to the client of the tree
It's not stated, but I think I started from the assumption that the elements would be unique. I can see why it might be more convenient as a user of the library not to have to worry about uniqueness (and I think that's valuable), but is there any reason beyond that that you'd need links with d(x, y) = 0?
Storing every element rather than just unique elements is important in cases where the metric distance is an incomplete picture of equality. For example, my case is clustering transactions by similar names:
Transaction = Struct.new(:name, :amount)
And my metric is ->(a, b) { lcs_distance(a.name, b.name) }
. Two different transactions might be equal according to this metric, but that just means they have the same name. I guess this doesn't actually qualify as a real metric since d(x, y) does not <=> x = y, but I hope I've motivated why this construction is still useful.
Replacing iteration with recursion alone won't solve this problem either, incidentally: I just tried it! (See https://github.com/threedaymonk/bktree/tree/iterate).
I now understand why you want to have multiple items with distance zero. I think it's valid, but I think it might be outside the scope of a BK tree, and I'm always wary of adding special cases.
I wonder if it might be better to achieve the same goal by wrapping the tree in something that can handle your requirements. A rough (untested) implementation could be something like this:
class WrappedBKTree
def initialize
@bk_tree = BK::Tree.new(method(:lcs_distance))
@objects = {}
end
def add(object)
if @objects.key?(object.name)
@objects[object.name] << object
else
@objects[object.name] = [object]
@bk_tree.add object.name
end
end
def query(object, threshold)
@bk_tree.query(object.name, threshold).map { |n, d| [@objects[n], d] }
end
end
Oh, thanks for trying it! I checked your branch and I might be missing something, but it seems like you're still recursing in add and query, since they call the helper method add_ and query_ which each call themselves. An iterative solution would probably push nodes onto a queue and visit them in a loop.
Totally valid point that it may be outside the scope of a BK tree. I also considered your wrapped solution, but at the time I thought it would be a better feature to build directly into the BK tree. After realizing that d(x,y) = 0 customarily means x == y, I feel like the wrapping solution is better, which means my PR isn't necessary. If we went down that route, it would probably still be useful to modify the add function to reject duplicates, or to state 'no duplicates' as a constraint, so that we don't hit this bad behavior when a user adds a bunch of duplicates.
Oh yes, you're right, that branch is still recursing, isn't it? It's just not using objects to do so, so it hangs instead of overflowing the stack.
I think rejecting duplicates would be a good idea, so I'll look into that. It should be straightforward.
Thanks for your thoughts and contributions. I hope you feel that this is a good resolution!
👍