guokr/simbase

is "Instant Similarity Query" not supported?

bwlim opened this issue · 4 comments

Recommendation query seems to be allowed to the only registered vectors.
I think, when registering new vector (vadd command) all comparison calculation is performed (O(n)), and when querying recommendations, pre-caculated results are outputted (Constant Time Complexity?)
But in my application, user input arbitrary vector to the system for "instant" recommendation results..
Please give me some guides~

Hi, @bwlim ,

We think two aspects need to be considered on this issue, one is performance, the other is API design.

When instant query is needed, the input of vector, comparison and the output of recommendation should be combined in one API call. Obviously, the cost of the call should be scale at O(n) level.

According to our experience, assuming a 1-k dimensional base and cosinesq score function, the time cost of the call would be:

  • 70k vectors: ~0.1s, this speed should be OK for instant input
  • 700k vectors: ~1s, this speed should be slow for instant input

And one possible way to improve is to change cosinesq score to eclidean or manhattan, but the performance data need to be re-tested.

On the aspect of API design, we think two API can be added:

  • qadd: sourceVKey, targetVKey, sourceVectorComponents...
  • qrec: targetVKey, vectorComponents...

Command qadd is designed for the scenario that we add a vector, register it and then get the recommedation result in only one api call. Before qadd call, you need vmk to create a vector set to hold the inputed vectors.

While command qrec dose not store the inputed vectors, you only test an inputed vector against some predefined vector set.

I think qadd or qrec should meet with your requirements.

Advice are welcomed!

Yes, qadd, qrec can be one solution using current implementation of simbase.
um.. but, isn't there another indexing structure which similarity query performance is less than O(N) (O(logN) would be great!).
Have you ever heard R-tree?
Using R-tree index structure, we can query spatial query(querying all points within a query 2D box(or 3D cube, ...) in O(logN) time complexity. R-tree or R-tree variants are the basic core technology for geospatial database. (Geospatial DB can be also considered as a low dimensional vector database)
I'm not super-expert in this area :D,
I'm wondering there are some similar technologies for similarity vector query system..

When new vector is registered to vector DB, new vector instance is indexed to R-tree or something index structure (maybe in O(logN)), and similarity vector querying is performed by traversing R-tree or something index structure in O(log(N))

For database, query performance of O(N) is not good situation...

How do you think about this? :D

yes, @bwlim , R-tree is one possible way to speed up simbase, but also introduce much complexity of the code. We should balance gain and loss.

For our own use cases:

  • document recommendation: the current implementation is OK for next several years
  • personalised recommendation: because the baseline of registered users is around 10M ~ 50M, we still need to improve performance

For personalised recommendations, currently I prefer take another approach of locally sensitive hashing. Without real experiments on recommendation quality, performance and the development cost, I am not sure which approach we will finally take.

In short term, we will keep current data structure, and will consider your advice in next big release of 0.2.0.

Thanks.

Sorry for closing it, I think we need more time to discuss this issue.