Using Existing Index Results in Empty Index
ncoop57 opened this issue · 4 comments
When trying to reuse an existing index I created, I found I got the following error:
[/usr/local/lib/python3.10/dist-packages/colpali_engine/trainer/retrieval_evaluator.py](https://localhost:8080/#) in evaluate_colbert(self, qs, ps, batch_size)
59 ).to("cuda")
60 scores_batch.append(torch.einsum("bnd,csd->bcns", qs_batch, ps_batch).max(dim=3)[0].sum(dim=2))
---> 61 scores_batch = torch.cat(scores_batch, dim=1).cpu()
62 scores.append(scores_batch)
63 scores = torch.cat(scores, dim=0)
RuntimeError: torch.cat(): expected a non-empty list of Tensors
If I set overwrite=True
when indexing my pdfs this does not happen. Here is a colab to reproduce: https://colab.research.google.com/drive/1E7I9pki9SiwPs-TsyYIg9E_DIXsEYvy6?usp=sharing
Thanks for reporting!
This is actually an interesting edge-case, I'm not sure what the best behaviour would be here 🤔
The issue occurs because:
class ZoteroApp:
def __init__(self, model_name, pdfs_folder):
download_pdfs(pdfs_folder)
self.rag_model = RAGMultiModalModel.from_pretrained(model_name)
self.rag_model.index(input_path=pdfs_folder, index_name="zotero_papers", store_collection_with_index=True, overwrite=False)
This creates a new instance of rag_model, and tries to create an index with. Calling it twice in a row results in the second call starting a new model instance, and trying to create an index in the same location. As overwrite
is False
, doing so doesn't do anything (hence the message:
An index named zotero_papers already exists.
Use overwrite=True to delete the existing index and build a new one.
Exiting indexing without doing anything...
)
So when you try to query the index with the new instance, nothing actually happens, because it's not loaded an index. The best (and currently only practice) to re-use an index is to initialise RAG with the from_index()
method, i.e. in your case modifying ZoteroApp to do this:
class ZoteroApp:
def __init__(self, model_name, pdfs_folder):
download_pdfs(pdfs_folder)
index_name = "zotero_papers"
index_path = os.path.join(".byaldi", index_name)
if os.path.exists(index_path):
self.rag_model = RAGMultiModalModel.from_index(index_path)
else:
self.rag_model = RAGMultiModalModel.from_pretrained(model_name)
self.rag_model.index(input_path=pdfs_folder, index_name=index_name, store_collection_with_index=True, overwrite=False)
def query(self, user_query, k=3):
results = self.rag_model.search(user_query, k=k)
return results
should fix the issue (can't run now, I have to head out soon and I'm maxed out on open colab environments), since it'll load the index if it's present (and it is when the second initialisation is called).
I think just a simple error msg should suffice. I'll open a PR <3
facing same issue, cant loaded already computed index and use it, have to create index over and over again