facebookresearch/faiss

Reconstruct batch of non-sequential IDs

bfelbo opened this issue ยท 10 comments

Platform

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Feature Request

The Index class contains methods for reconstructing a single observation and for reconstructing a sequential (e.g. IDs 101-200). However, there's no method for batch retrieving non-sequential IDs.

This would be a great addition. Right now we have to write a for-loop in Python, making many requests from Python to C++. Simply making a reconstruct method that uses a for-loop in C++ would be a big improvement. Later on, index-specific methods could be implemented to improve performance further if needed.

Should be simple to implement.

I think this would be quite useful.
Here's a benchmark showing that reconstruct_n is much faster than reconstruct https://colab.research.google.com/drive/1EpJmlrY2i6DngHc4Ok2jhb4oNZEdavcE?usp=sharing

import faiss
import numpy as np
import time

faiss.omp_set_num_threads(1)
nb_vectors = 100000
dimension = 8
vectors = np.random.rand(nb_vectors, dimension).astype('float32')

flat_index = faiss.IndexFlatIP(8)
flat_index.add(vectors)

N = 10000
start_time = time.perf_counter()
for i in range(N):
  flat_index.reconstruct(i)
end_time = time.perf_counter()
ellapsed_time = end_time - start_time

print(f"-> flat reconstruct in {ellapsed_time*1000} ms")

start_time = time.perf_counter()
flat_index.reconstruct_n(0,N)
end_time = time.perf_counter()
ellapsed_time = end_time - start_time

print(f"-> flat reconstruct_n in {ellapsed_time*1000} ms")

Result:

-> flat reconstruct in 25.576860000001034 ms
-> flat reconstruct_n in 0.5635439999878145 ms

Non-sequential ids might be a bit slower than reconstruct_n for a flat index because the memory is not contiguous, but I think it would still be much faster than a loop of reconstruct in python.

Hello!

Any news on this feature request ? Having this method would most probably indeed improve the reconstruction of n non-contiguous embeddings.

Juste found out there is a method search_and_reconstruct which can be used to search and reconstruct vectors. This method is much faster than first searching nearest neighbors and then calling N times reconstruct.
Just to provide a quick comparison, given a simple Flat IVF, searching and reconstructing the 200k nearest neighbors:

  • Calling search and then calling 200000 times reconstruct takes 45 secs
  • Calling search_and_reconstruct takes 1.5 secs

Hey,
Any news regarding this feature? A batch_reconstruct would really help me as well, to speed up the implementation of our ICML paper: https://arxiv.org/pdf/2201.12431

Thanks!

Thanks a lot @mdouze ! Much appreciated.

Should we expect a release soon, or should we build from sources to use this?

we plan to release 1.7.3 in sept

Please, could you add this functionality (batch reconstruct and search_and_reconstruct) with binary indexes too?

please open a new issue for this, or better implement it as a PR yourself

I was browsing thru the closed PR and thought it closed without merging. It turns out that the functionality was merged = at least for PQ+IVF indices (different PR?). See:

index.reconstruct_batch(ids)