alexklibisz/elastiknn

Lucene benchmarks for Big-ANN challenge

alexklibisz opened this issue · 3 comments

Implement a benchmarking suite for the Big-ANN benchmarks challenge.

  • directly use Lucene (i.e., no Elasticsearch or other HTTP server in the loop)
  • support both Elastiknn's LSH models
  • support Lucene's HNSW models
  • containerized and configured to utilize a single large node (~64 threads, ~128GB RAM) with indexing and search parallelism
  • support grid-searching over a set of model parameters and storing results, similar to the existing elastiknn-benchmarks module

Despite the lack of checkmarks, there has been some progress here.

I wrote up a Gist describing the development setup fo this project: https://gist.github.com/alexklibisz/d3e47e30d3c7468f3ee56844e5ee6f7a

The latest metrics are about 10 QPS on the deep-10M dataset running on my laptop using the setup described in the Gist.

I'll continue experimenting with this as time permits.

One interesting avenue for optimization is Index Sorting. I'll document it here for posterity.

In September I attended Adrien Grand's talk "Speed up your Lucene queries by avoiding searching" at the 2021 Virtual ApacheCon.

Index Sorting was one of the methods that Adrien covered. In short, Index Sorting allows the user to customize how a Lucene index sorts documents stored in its segments. For example, an application might sort documents by date if they are always returned in chronological order.

How does Index Sorting apply to vector search? My current hypothesis is that Index Sorting can be used to more intelligently assign vectors to shards. Specifically, if the vector-to-shard assignment can roughly maintain spatial proximity of vectors, we should be able to minimize the number of segments which must be visited find neighbors.

The initial experiment for Index Sorting involved sorting the vectors by two values:

  1. Distance to from the vector to the origin (vector of all zeros).
  2. Cosine of the angle between the vector and the origin vector.

Sorting the index by these two values led to a ~60% query-time speedup on one of the ten million vector datasets. More testing and optimization is needed, but it looks like a fruitful direction.

Closing this. Will open another, more narrowly-scoped and achievable, ticket. I transcribed the idea from my previous comment to a Github discussion: #375