Aquila-Network/aquila

Faiss Indexer Problem

sopaoglu opened this issue · 12 comments

More than 10.000 vectors are indexed with FAISS. After I index all vectors with FAISS, I queried a vector but it can not find itself. But If I index all vectors with ANNOY, it works as expected. Actually, I am not sure whether it is a bug.

Hi @sopaoglu , here's what's happening in the background when AquilaDB hits 10,000+ documents

  1. AquilaDB starts with Annoy as indexer when no documents are indexed and will immediately switch to FAISS when it hits 10,0001th document
  2. All the documents from Document DB (till that transition moment) will be fetched to build FAISS index at this very moment
  3. When FAISS indexer gets initiated, it will take a few seconds (based on the hardware) to train the indexer (one time activity) with this 10,000+ documents and then to index all the same documents within this indexer.
  4. Because point 3 will take some time (a few seconds), you should be giving a small wait time before kNN query.
  5. I also have ran a test on the database on behalf of this issue just now, and found that kNN is retrieving documents with exact one at the top of the result. So, no bug is there, just give it some time.

Footnote: AquilaDB is originally intended to be an eventual consistent database with Couch protocol. You will see more effects of this and will like this feature once networking and replication features are enabled in AquilaDB later versions.

One more thing: Code refactoring is happening in the background. Once it's done and ready to use, there wouldn't be Annoy index in AquilaDB anymore. If you wanna see code progress, you can checkout refactor branch.

If I want to take 10 nearest neighbors for a vector, the same vector is inside the result set. But I expect that the first result should be itself. Therefore, the vector is fetched to build FAISS index.

yes, the first vector will be the same. I don’t know why you are getting differently. Can you tell me the steps to reproduce the issue. If you are querying with multiple vectors, you should be looking at the top of each k chunks where each chunk represents results for each query vector

First of all, I index vectors of images. We uses python face_recognition library in order to create a vector of the image. After vectors are created, we add vectors which are the size of 1000. The total number of vectors is more than 10000. Then, I use one of the images that was added, first create a vector for the image and query it.

Hello! I have a similar situation:
set of 20 text paragraphs, each was embedded into 1x7268 dim vector (numpy array)
each stored in db
One of the paragraphs from the original set (s_query) is emeded to query and find five closest
db.getNearest(s_query, 5) doesn`t retrieve paragraph s_query among them

Hi @liya-gafurova , you need to index at least docs.vecount documents before starting k-NN retrieval. This value is configurable by modifying DB_config.yml.

Yes, thank you!
But I already had MIN_DOCS2INDEX=5. So I guess it should had been indexed.

I have the following case during one python script:

  1. empty DB. Generate 1000 random vectors, one by one add them to DB
  2. add to DB vec_a, defined by me
  3. add vec_a as a search query (convertToMatrix(), getNearest())
    I expect vec_a to the first, but it isn`t
    code example below
dimension = 7268    # dimensions of each vector
n = 1000    # number of vectors
np.random.seed(1)
db_vectors = np.random.random((n, dimension)).astype('float32')
_vec= np.random.random((1, dimension)).astype('float32')[0]
vec_query = np.copy(_vec)
for vec in db_vectors:
     sample = db.convertDocument(vec, {"text": 'vec'})
     db.addDocuments([sample])

sample = db.convertDocument(_vec, {"text": 'added_manually_2'})
db.addDocuments([sample])
query_vec = db.convertMatrix(vec_query)
result = db.getNearest(query_vec, 5)

result: returns only 'vec' vectors

After that, in another python script:

  1. have DB with 1000 random vectors. Add there one user defined vector vec_b
  2. add vec_b as a search query (convertToMatrix(), getNearest())
dimension = 7268    # dimensions of each vector
n = 1000    # number of vectors
np.random.seed(1)
_vec= np.random.random((1, dimension)).astype('float32')[0]
vec_query = np.copy(_vec)

sample = db.convertDocument(_vec, {"text": 'added_manually_3'})
db.addDocuments([sample])

query_vec = db.convertMatrix(vec_query)
result = db.getNearest(query_vec, 5)

result: 'added_manually_3' vector-- the first

Considering this case, could it be this way:

  • when we add an amount of data to empty DB and query it right away, it doesn't have enough time to create an index. So the result is incorrect?
    Or why it doesn't retrieve for the first time and retrieve correctly the second time.

Also there is a question about the metric used to get the nearest neighbors.
as I understand, the default metric is annoy.angular

  • is it the same as scipy.spatial.distance.cosine()?
  • while testing retrieving 5 closest through getNearest() and scipy.spatial.distance.cosine(), cosine showed different (and sometimes better) result on the same data

@liya-gafurova Yes, AquilaDB is an eventual consistent database. It will take some seconds to finish the vector indexing. Eventual consistency is good because, you wouldn't be indexing data in bulk at once rather would be generating continuously and indexing it inside AquilaDB (Couch) decentralized cluster - wiki/AquilaDB---Couch-Replication .

Please check this official discussion regarding the annoy angular metric: spotify/annoy#363

It's a trade off between data and accuracy. Feel free to modify AquilaDB source code - that way you can try out different metrics in both Annoy and FAISS based on your requirement. It's easy and straight forward. For Annoy, modify https://github.com/a-mma/AquilaDB/blob/3af4f5532f7470be0e2bf935bce92972de1f591b/src/hannoy/index.py#L62 and for FAISS modify, https://github.com/a-mma/AquilaDB/blob/3af4f5532f7470be0e2bf935bce92972de1f591b/src/hfaiss/index.py#L53

Code is rewritten. Bug is irrelevant and covered. closed.