illuin-tech/colpali

Re-indexing issue?

skyxiaobai opened this issue · 2 comments

Problem Description:
I attempted to use the Colpali model for PDF document question answering, aiming to index the PDF file only once when the program starts and then reuse this index for retrieval without re-indexing each time. However, I encountered the following issue:

After indexing the PDF file when the program starts, using the RAG.search function for retrieval results in a RuntimeError: torch.cat(): expected a non-empty list of Tensors error.
Problem Analysis:
Through investigation, I found that the issue might stem from the following aspects:

Empty or corrupted index file: The RAG.index function may not have successfully indexed the PDF file, resulting in an empty or corrupted index file.
Empty query vector: The retriever_evaluator.evaluate_colbert function might fail to generate query vectors, leading to an empty scores_batch variable.
Mismatch between index file and query vectors: The index file and query vectors might be generated using different models or parameters, causing a mismatch that prevents successful retrieval.
Solution Attempts:
I attempted the following solutions, but none resolved the issue:

Check the index file: Ensured that the index file exists and its size is reasonable.
Check the query vector generation process: Examined the retriever_evaluator.evaluate_colbert function to ensure it correctly generates query vectors.
Check the consistency between the index file and query vectors: Ensured that the index file and query vectors use the same model and parameters.

logs:

Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/gradio/queueing.py", line 536, in process_events
response = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/gradio/route_utils.py", line 322, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/gradio/blocks.py", line 1935, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/gradio/blocks.py", line 1518, in call_function
prediction = await fn(*processed_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/gradio/utils.py", line 793, in async_wrapper
response = await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/sda/LLM/colpali/colpali_+_qwen2_vl.py", line 78, in process_query
relevant_results = RAG.search(query, k=3)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/byaldi/RAGModel.py", line 158, in search
return self.model.search(query, k, return_base64_results)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/byaldi/colpali.py", line 480, in search
scores = self._score(qs)
^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/byaldi/utils.py", line 9, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/byaldi/colpali.py", line 448, in _score
scores = retriever_evaluator.evaluate(qs, self.indexed_embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/colpali_engine/trainer/retrieval_evaluator.py", line 13, in evaluate
scores = self.evaluate_colbert(qs, ps)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/minicpm/lib/python3.11/site-packages/colpali_engine/trainer/retrieval_evaluator.py", line 62, in evaluate_colbert
scores_batch = torch.cat(scores_batch, dim=1).cpu()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: torch.cat(): expected a non-empty list of Tensors

Hello seems like a Byaldi issue maybe ? Rag.search() is a function in Byadi, if you want, you can use inference code in this repo to try out your documents without the wrapper and see if the bug persists, or you can raise an issue in Byaldi ?

Thank you. Yeah, I will raise a issue to Byaldi team.