Try quick select algorithm for KthGreatest implementation
alexklibisz opened this issue · 4 comments
Background
I think there could be an opportunity to speed up approximate queries by re-implementing the kthGreatest method using the quick select algorithm.
At a high level, the kthGreatest method is used to find the kth greatest document frequency. We give it an array of counts, each one representing the number of times a distinct document was matched against a set of query terms. It returns the kth greatest count. Then we perform exact similarity scoring on each of the documents that match or exceed this kth greatest count.
There are some good example implementations of quick select on Leet code:
- Top K Frequent Elements
- [Python] QuickSelect average O(n), explained - Kth Largest Element in an Array - LeetCode
Deliverables
- Implement and benchmark kthGreatest method using quick select
- Report the results on this ticket or a PR, if it's good enough to merge
Related Issues
Blocked by #525
I've partially implemented this in #603. I based much of the quickselect implementation on this excellent gist: https://gist.github.com/unnikked/14c19ba13f6a4bfd00a3
My latest iteration at time of writing is here:
The benchmark is here:
Unfortunately this particular implementation of the quickselect algorithm is somehow actually slower than just sorting. I would speculate that much of this comes from the fact I hae to make a full copy of the array at every iteration. This is necessary as the quickselect method is modifying the array (swapping around values) in order to compute its result, whereas the ArrayHitCounter expects those values to be immutable.
[info] Benchmark Mode Cnt Score Error Units
[info] KthGreatestBenchmarks.kthGreatest thrpt 5 10796.563 ± 0.931 ops/s
[info] KthGreatestBenchmarks.sortBaseline thrpt 5 2965.035 ± 85.854 ops/s
[info] KthGreatestBenchmarks.unnikedRecursive thrpt 5 2171.902 ± 40.547 ops/s
[success] Total time: 153 s (02:33), completed Nov 26, 2023, 11:35:28 PM
Quickselect is about 30% faster when I switch from a fixed pivot to a random pivot, line 19:
[info] Benchmark Mode Cnt Score Error Units
[info] KthGreatestBenchmarks.unnikedRecursive thrpt 5 2788.150 ± 16.839 ops/s
[success] Total time: 51 s, completed Nov 26, 2023, 11:48:26 PM
But it still doesn't touch kthGreatest
This feels similar to using hashmaps instead of arrays to count hits, summarized in this comment: #160 (comment)
Right now I'm benchmarking w/ a dataset of 60k vectors (Fashion Mnist). Optimizations like quickselect and primitive hashmaps might make a positive impact when I'm working with far more vectors. But Fashion Mnist is the benchmark I'm trying to optimize for now.
Closing this for now. Might re-open if/when I'm benchmarking on a larger dataset.