AnswerDotAI/byaldi

Using Existing Index Results in Empty Index

ncoop57 opened this issue · 4 comments

When trying to reuse an existing index I created, I found I got the following error:

[/usr/local/lib/python3.10/dist-packages/colpali_engine/trainer/retrieval_evaluator.py](https://localhost:8080/#) in evaluate_colbert(self, qs, ps, batch_size)
     59                 ).to("cuda")
     60                 scores_batch.append(torch.einsum("bnd,csd->bcns", qs_batch, ps_batch).max(dim=3)[0].sum(dim=2))
---> 61             scores_batch = torch.cat(scores_batch, dim=1).cpu()
     62             scores.append(scores_batch)
     63         scores = torch.cat(scores, dim=0)

RuntimeError: torch.cat(): expected a non-empty list of Tensors

If I set overwrite=True when indexing my pdfs this does not happen. Here is a colab to reproduce: https://colab.research.google.com/drive/1E7I9pki9SiwPs-TsyYIg9E_DIXsEYvy6?usp=sharing

Thanks for reporting!

This is actually an interesting edge-case, I'm not sure what the best behaviour would be here 🤔

The issue occurs because:

class ZoteroApp:
    def __init__(self, model_name, pdfs_folder):
        download_pdfs(pdfs_folder)
        self.rag_model = RAGMultiModalModel.from_pretrained(model_name)
        self.rag_model.index(input_path=pdfs_folder, index_name="zotero_papers", store_collection_with_index=True, overwrite=False)

This creates a new instance of rag_model, and tries to create an index with. Calling it twice in a row results in the second call starting a new model instance, and trying to create an index in the same location. As overwrite is False, doing so doesn't do anything (hence the message:

An index named zotero_papers already exists.
Use overwrite=True to delete the existing index and build a new one.
Exiting indexing without doing anything...

)

So when you try to query the index with the new instance, nothing actually happens, because it's not loaded an index. The best (and currently only practice) to re-use an index is to initialise RAG with the from_index() method, i.e. in your case modifying ZoteroApp to do this:

class ZoteroApp:
    def __init__(self, model_name, pdfs_folder):
        download_pdfs(pdfs_folder)
        index_name = "zotero_papers"
        index_path = os.path.join(".byaldi", index_name)
        if os.path.exists(index_path):
            self.rag_model = RAGMultiModalModel.from_index(index_path)
        else:
            self.rag_model = RAGMultiModalModel.from_pretrained(model_name)
            self.rag_model.index(input_path=pdfs_folder, index_name=index_name, store_collection_with_index=True, overwrite=False)

    def query(self, user_query, k=3):
        results = self.rag_model.search(user_query, k=k)
        return results

should fix the issue (can't run now, I have to head out soon and I'm maxed out on open colab environments), since it'll load the index if it's present (and it is when the second initialisation is called).

I think just a simple error msg should suffice. I'll open a PR <3

facing same issue, cant loaded already computed index and use it, have to create index over and over again

This is addressed in #12 and upcoming associated release 0.0.3. This'll now ValueError rather than just return None with a cutesy print().