Reconstruct batch of non-sequential IDs

Question

Reconstruct batch of non-sequential IDs

bfelbo opened this issue 5 years ago · 10 comments

bfelbo commented 5 years ago

Platform

Running on:

CPU
GPU

Interface:

C++
Python

Feature Request

The Index class contains methods for reconstructing a single observation and for reconstructing a sequential (e.g. IDs 101-200). However, there's no method for batch retrieving non-sequential IDs.

This would be a great addition. Right now we have to write a for-loop in Python, making many requests from Python to C++. Simply making a reconstruct method that uses a for-loop in C++ would be a big improvement. Later on, index-specific methods could be implemented to improve performance further if needed.

Answer 1 · 2020-03-30T07:21:25.000Z

Should be simple to implement.

Answer 2 · 2021-02-02T22:27:57.000Z

I think this would be quite useful.
Here's a benchmark showing that reconstruct_n is much faster than reconstruct https://colab.research.google.com/drive/1EpJmlrY2i6DngHc4Ok2jhb4oNZEdavcE?usp=sharing

import faiss
import numpy as np
import time

faiss.omp_set_num_threads(1)
nb_vectors = 100000
dimension = 8
vectors = np.random.rand(nb_vectors, dimension).astype('float32')

flat_index = faiss.IndexFlatIP(8)
flat_index.add(vectors)

N = 10000
start_time = time.perf_counter()
for i in range(N):
  flat_index.reconstruct(i)
end_time = time.perf_counter()
ellapsed_time = end_time - start_time

print(f"-> flat reconstruct in {ellapsed_time*1000} ms")

start_time = time.perf_counter()
flat_index.reconstruct_n(0,N)
end_time = time.perf_counter()
ellapsed_time = end_time - start_time

print(f"-> flat reconstruct_n in {ellapsed_time*1000} ms")

Result:

-> flat reconstruct in 25.576860000001034 ms
-> flat reconstruct_n in 0.5635439999878145 ms

Non-sequential ids might be a bit slower than reconstruct_n for a flat index because the memory is not contiguous, but I think it would still be much faster than a loop of reconstruct in python.

Answer 3 · 2022-06-02T12:18:23.000Z

Hello!

Any news on this feature request ? Having this method would most probably indeed improve the reconstruction of n non-contiguous embeddings.

Answer 4 · 2022-06-02T12:46:22.000Z

Juste found out there is a method search_and_reconstruct which can be used to search and reconstruct vectors. This method is much faster than first searching nearest neighbors and then calling N times reconstruct.
Just to provide a quick comparison, given a simple Flat IVF, searching and reconstructing the 200k nearest neighbors:

Calling search and then calling 200000 times reconstruct takes 45 secs
Calling search_and_reconstruct takes 1.5 secs

Answer 5 · 2022-07-01T13:39:27.000Z

Hey,
Any news regarding this feature? A batch_reconstruct would really help me as well, to speed up the implementation of our ICML paper: https://arxiv.org/pdf/2201.12431

Thanks!

Answer 6 · 2022-08-31T18:39:08.000Z

Thanks a lot @mdouze ! Much appreciated.

Should we expect a release soon, or should we build from sources to use this?

Answer 7 · 2022-09-01T07:52:21.000Z

we plan to release 1.7.3 in sept

Answer 8 · 2023-07-19T18:28:36.000Z

Please, could you add this functionality (batch reconstruct and search_and_reconstruct) with binary indexes too?

Answer 9 · 2023-07-21T04:19:47.000Z

please open a new issue for this, or better implement it as a PR yourself

Answer 10 · 2023-08-11T21:02:52.000Z

I was browsing thru the closed PR and thought it closed without merging. It turns out that the functionality was merged = at least for PQ+IVF indices (different PR?). See:

index.reconstruct_batch(ids)