spotify/annoy

Dot index recall issue on last.fm dataset

pkorobov opened this issue · 6 comments

Hi!

I had pretty low recall for dot index on my data. Then I decided to test the lastfm benchmark manually.

Then I realized that this benchmark is actually for an angular index with manually computed 65th component -- sqrt(max_norm^2 - ||x||^2).
I've reproduced results for an angular index, but for a dot index (with the 65th being deleted) recall was again pretty low!

There's the notebook with calculations.

What's going on in the notebook:

  1. Calculate exact dot products using brute force, then calculate top candidates
  2. Construct an angular index with the additional component sqrt(max_norm^2 - ||x||^2), compute recalls for various search_k's
  3. Construct a dot index without the additional component sqrt(max_norm^2 - ||x||^2), again compute recalls for various search_k's

The results are as follows (10 trees):

search_k 1000 10000 50000
Angular index recall (65 components) 0.563 0.736 0.783
Dot index recall (64 components) 0.003 0.034 0.175

But, as I understand, results should be at least comparable.

May there be a bug?
If my results are correct, I would be glad to contribute with some fixes. However, by now I don't see what can be wrong in the source code of dot index.

That's very interesting! @psobot what do you think is going on?

Some thoughts:

  1. This is a bit strange that two_means doesn't use dot_factor at all. I don't know if it is intended to be so, but it seems to me that dot_factor should be involved into computations in the same way as any other component in node->v. At least this would be equivalent to adding the extra component and then just runnning angular.
    So, DotProduct class should probably use it's own version of two_means which takes an extra component into account.
  2. The same is about the distance method. It should add extra components into dot calculation.
  3. Dot tests seem to be weak, in my opinion. I would add some accuracy tests for dot index in PR.
  4. Also, please, can someone explain why there is dot_product^2 term here in margin?
    As I see it, we have a tree node with a plane defined by vector (n_1, ..., n_f, dot_factor) and a query vector q = (q_1, ..., q_f, 0), therefore margin based on them is just a dot product of first f components of them, isn't it?
    However, removing the dot_factor^2 term hurts the performance.

I managed to improve recall a bit by adding usage of dot_factor in the methods mentioned above. However it is still lower than angular index recall.

Results for dot fixes for now:

search_k 1000 10000 50000
master version 0.003 0.034 0.175
Use dot_factor in two_means and distance 0.032 0.367 0.757
Use dot_factor in two_means and distance,
remove dot_factor^2 from margin
0.006 0.054 0.249

Will try to dig a bit more and make a PR.

After a little break I've finally fixed this. I've achieved [0.514, 0.692, 0.75] recall for dot index.
PR is coming.

UPD: Closed the issue by mistake.

Apart from the newly added lastfm test, the notebook above shows comparable scores for two different index types for annoy installed from master. Thank you for your review in the PR, @erikbern and @psobot ! I'm closing the issue.

Awesome, glad we were able to close this out! Sorry it took forever for me to review.

Hi @erikbern! Just wanted to ask, is there going to be a new version of annoy with this PR?