Dot index recall issue on last.fm dataset
pkorobov opened this issue · 6 comments
Hi!
I had pretty low recall for dot index on my data. Then I decided to test the lastfm benchmark manually.
Then I realized that this benchmark is actually for an angular index with manually computed 65th component -- sqrt(max_norm^2 - ||x||^2).
I've reproduced results for an angular index, but for a dot index (with the 65th being deleted) recall was again pretty low!
There's the notebook with calculations.
What's going on in the notebook:
- Calculate exact dot products using brute force, then calculate top candidates
- Construct an angular index with the additional component sqrt(max_norm^2 - ||x||^2), compute recalls for various search_k's
- Construct a dot index without the additional component sqrt(max_norm^2 - ||x||^2), again compute recalls for various search_k's
The results are as follows (10 trees):
search_k | 1000 | 10000 | 50000 |
---|---|---|---|
Angular index recall (65 components) | 0.563 | 0.736 | 0.783 |
Dot index recall (64 components) | 0.003 | 0.034 | 0.175 |
But, as I understand, results should be at least comparable.
May there be a bug?
If my results are correct, I would be glad to contribute with some fixes. However, by now I don't see what can be wrong in the source code of dot index.
Some thoughts:
- This is a bit strange that
two_means
doesn't usedot_factor
at all. I don't know if it is intended to be so, but it seems to me thatdot_factor
should be involved into computations in the same way as any other component innode->v
. At least this would be equivalent to adding the extra component and then just runnning angular.
So, DotProduct class should probably use it's own version oftwo_means
which takes an extra component into account. - The same is about the
distance
method. It should add extra components into dot calculation. - Dot tests seem to be weak, in my opinion. I would add some accuracy tests for dot index in PR.
- Also, please, can someone explain why there is dot_product^2 term here in
margin
?
As I see it, we have a tree node with a plane defined by vector (n_1, ..., n_f, dot_factor) and a query vector q = (q_1, ..., q_f, 0), therefore margin based on them is just a dot product of first f components of them, isn't it?
However, removing the dot_factor^2 term hurts the performance.
I managed to improve recall a bit by adding usage of dot_factor
in the methods mentioned above. However it is still lower than angular index recall.
Results for dot fixes for now:
search_k | 1000 | 10000 | 50000 |
---|---|---|---|
master version | 0.003 | 0.034 | 0.175 |
Use dot_factor in two_means and distance |
0.032 | 0.367 | 0.757 |
Use dot_factor in two_means and distance , remove dot_factor ^2 from margin |
0.006 | 0.054 | 0.249 |
Will try to dig a bit more and make a PR.
After a little break I've finally fixed this. I've achieved [0.514, 0.692, 0.75] recall for dot index.
PR is coming.
UPD: Closed the issue by mistake.
Awesome, glad we were able to close this out! Sorry it took forever for me to review.