cdalitz/kdtree-cpp

Not all neighbors are found in range search

Closed this issue · 9 comments

Issue

I might have found a bug where not all neighbors are found when using range_nearest_neighbors, in a tree with distance of type Maximum (though the distance metric doesn't seem to matter).

Here's what happens:

I have two points p1 = {0.537667,1.83388} and p2 = {0.695205,1.64763}. The distance (maximum) between p1 and p2 is MAX(0.695205-0.537667, 1.83388-1.64763), which is 0.18625.
If i build a tree with these two nodes, then when I call range_nearest_neighbors(p1, 0.2, &result), I get both nodes as a result, as I should.

However, if the tree contains p1, p2 and a lot of other nodes (several thousand), then p2 doesn't appear in the neighbors of p1 anymore.

How to reproduce

I have just modified the main function of demo_kdtree.cpp with my testing. nodes.hpp contains p1, p2 and 4997 other points.

I have tested printing the points being compared in the distance function, and from what I've seen, p2 is never compared to p1.

Thanks for reporting this and for providing a test example. Hopefully, I will find some time to look into this in January or February. Should you already find a possible fix, patches are welcome.

reid-p commented

I'm seeing the same.
I shuffled my data using std::shuffle and get a different result, if that helps.

After some investigating, I've found how to solve the issue I was facing.

It seems to me that the function bounds_overlap_ball, by computing the sum of the coordinate_distance between the point and the lo/up bound of the tree, is reasonning in terms of Manhattan distance, no matter what type of distance is set up in the tree class members.

I've added if(distance_type == 0) distsum = 0.0; right before the end of the for loop in bounds_overlap_ball, which solves the problem for my use case without changing the code too much, but I guess a problem similar to mine could still occur with Euclidian distance. I think a fix for Euclidian distance would be

if(distance_type == 2) {
    distsum += pow(distance->coordinate_distance(...), 2);
    if (sqrt(distsum) > dist) return false;
}

but I haven't tested it. Also, there might be a cleaner way to do this.

Should I make a pull request to add these changes ?

@abouneau-inria This is strange. coordinate_distance() is overwritten for each distance type, so it should return already the squared distance for Euclidean distance. Can you verify that indeed the wrong implementation of coordinate_distance() is called?

Okay, I missed the overwritting.

I just tested, and the right implementation is called for each distance type. However, the function is still summing the distances along each dimension in distsum, and then comparing it to dist.

In the case of the Maximum distance, I think this is wrong because as long as coordinate_distance(point[i], node->lobound[i], i) < dist and coordinate_distance(point[i], node->upbound[i], i) < dist for all i, then point might have a neighbor within bounds. (That's why I fixed my issue by resetting the sum between each iteration).

And as far as Euclidian distances, I guess sqrt(distsum) (rather than distsum itslef) should be compared to dist, am I wrong ?

@abouneau-inria Concerning Euclidean distances, these are always stored as squared values (see the comment in line 137 of kdtree.cpp).

Concerning the Maximum distance, you are right that the implementation is wrong, and this affects not only range queries, but nearest neighbor queries, too. The implementation of the search functions follows https://dl.acm.org/doi/10.1145/355744.355745, and Friedman et al. made the assumption that the total distance is a sum of coordinate distances (eq. 14 in the article). Later they explicitly mention the Maximum distance as an example, but this is obviously wrong and their implementation (appendix II in the article) fails for this case.

Can you verify that a nearest neighbor search fails, too, for the maximum distance?

In principle, this is easy to fix by changing bounds_overlap_ball(), but as it requires some testing it will take some time for me.

Indeed, I missed the line where r is squared in the case of euclidian distance, before being given as parameter as dist to bounds_overlap_ball().

I verified, and nearest neighbor indeed fails for the maximum distance: when requesting the 25 nearest neighbors of p1, p2 is missed.

(Neighbors 0 through 18 are identical)

25 nearest neighbors of (0.537667,1.83388): (current behavior)
0: (0.537667,1.83388) // distance (maximum) = 0
...
19: (0.725338,1.68654) // distance (maximum) = 0.187671
20: (0.349156,1.85928) // distance (maximum) = 0.188511
21: (0.505154,2.02512) // distance (maximum) = 0.19124
22: (0.388609,1.63524) // distance (maximum) = 0.19864
23: (0.606733,1.6345) // distance (maximum) = 0.19938
24: (0.739363,1.71189) // distance (maximum) = 0.201696


25 nearest neighbors of (0.537667,1.83388): (with the fix)
0: (0.537667,1.83388) // distance (maximum) = 0
... 
19: (0.695205,1.64763) // distance (maximum) = 0.18625 <-----
20: (0.725338,1.68654) // distance (maximum) = 0.187671
21: (0.349156,1.85928) // distance (maximum) = 0.188511
22: (0.505154,2.02512) // distance (maximum) = 0.19124
23: (0.388609,1.63524) // distance (maximum) = 0.19864
24: (0.606733,1.6345) // distance (maximum) = 0.19938

@abouneau-inria This should now be fixed. Can you please verify this and let me know so that I can close the issue?

It works correctly now, thank you!