Reconstruct batch of non-sequential IDs
bfelbo opened this issue ยท 10 comments
Platform
Running on:
- CPU
- GPU
Interface:
- C++
- Python
Feature Request
The Index class contains methods for reconstructing a single observation and for reconstructing a sequential (e.g. IDs 101-200). However, there's no method for batch retrieving non-sequential IDs.
This would be a great addition. Right now we have to write a for-loop in Python, making many requests from Python to C++. Simply making a reconstruct method that uses a for-loop in C++ would be a big improvement. Later on, index-specific methods could be implemented to improve performance further if needed.
Should be simple to implement.
I think this would be quite useful.
Here's a benchmark showing that reconstruct_n is much faster than reconstruct https://colab.research.google.com/drive/1EpJmlrY2i6DngHc4Ok2jhb4oNZEdavcE?usp=sharing
import faiss
import numpy as np
import time
faiss.omp_set_num_threads(1)
nb_vectors = 100000
dimension = 8
vectors = np.random.rand(nb_vectors, dimension).astype('float32')
flat_index = faiss.IndexFlatIP(8)
flat_index.add(vectors)
N = 10000
start_time = time.perf_counter()
for i in range(N):
flat_index.reconstruct(i)
end_time = time.perf_counter()
ellapsed_time = end_time - start_time
print(f"-> flat reconstruct in {ellapsed_time*1000} ms")
start_time = time.perf_counter()
flat_index.reconstruct_n(0,N)
end_time = time.perf_counter()
ellapsed_time = end_time - start_time
print(f"-> flat reconstruct_n in {ellapsed_time*1000} ms")
Result:
-> flat reconstruct in 25.576860000001034 ms
-> flat reconstruct_n in 0.5635439999878145 ms
Non-sequential ids might be a bit slower than reconstruct_n for a flat index because the memory is not contiguous, but I think it would still be much faster than a loop of reconstruct in python.
Hello!
Any news on this feature request ? Having this method would most probably indeed improve the reconstruction of n non-contiguous embeddings.
Juste found out there is a method search_and_reconstruct which can be used to search and reconstruct vectors. This method is much faster than first searching nearest neighbors and then calling N times reconstruct.
Just to provide a quick comparison, given a simple Flat IVF, searching and reconstructing the 200k nearest neighbors:
- Calling search and then calling 200000 times reconstruct takes 45 secs
- Calling search_and_reconstruct takes 1.5 secs
Hey,
Any news regarding this feature? A batch_reconstruct
would really help me as well, to speed up the implementation of our ICML paper: https://arxiv.org/pdf/2201.12431
Thanks!
Thanks a lot @mdouze ! Much appreciated.
Should we expect a release soon, or should we build from sources to use this?
we plan to release 1.7.3 in sept
Please, could you add this functionality (batch reconstruct and search_and_reconstruct) with binary indexes too?
please open a new issue for this, or better implement it as a PR yourself
I was browsing thru the closed PR and thought it closed without merging. It turns out that the functionality was merged = at least for PQ+IVF indices (different PR?). See:
index.reconstruct_batch(ids)