flaxsearch/luwak

Just a quick question

lookfwd opened this issue · 4 comments

Hello, I'm wondering a bit about the efficiency of this code. My understanding is that if we write often, those DocValues won't be cached among manager.acquire() calls here beyond whatever the Operating System might be (potentially) doing behind the scenes. I've also noticed that getSortedDocValues() does double reads, one for the delta-compressed ord and then use it as an offset to the real ID table. This should make the memory-accesses more random (i.e. less linear - going forward).

  • I've been thinking that it might be good idea to put the id inside QueryCacheEntry so that we don't have to look-it-up at that point.
  • We could also bulk it and have a combination of hash/id as a single BinaryDocValue. By using ByteRefs we won't need any extra copies there. We just save the sorting redirection and the second mem-access (hash - I think - is used nowhere else).
  • One last idea would be to not use any of those but try to use the document id to retrieve the cache entry (although I'm sure this is not possible and I miss something very obvious with this idea.)

If we take some of those approaches we maximize the usage of cached data which should be good for performance. What do you think?

We can't use the document id to retrieve cache entries, because they're not stable over internal index merges. But putting id into QueryCacheEntry might work - do you want to try it and open a pull request?

I'll try it. It might be non-trivial in the case of duplicate queries (with different id), I think.

Hi @lookfwd have you had a change to look at this?

Closing this out for now - if you make any progress @lookfwd, feel free to re-open!