[BUG] Inability to search all documents or a group of documents with GraphRAG
Opened this issue · 1 comments
jradikk commented
Description
Whenever you try to use all documents or a group of documents with GraphRAG, you get either
searching in doc_ids []
INFO:ktem.index.file.pipelines:Skip retrieval because of no selected files: DocumentRetrievalPipeline(
for all documents or something similar to
AssertionError: GraphRAG index not found for file_id: ["d6f18887-0b01-4df0-a30d-997c919d60f1", "08bad9e8-ef3f-4e89-abed-21da7d4f9611"]
for a grpup of documents. However, there is no problem searching any of these documents one by one. Considering, that RAG is mostly used to be able to access a large quantity of different docs, it makes Kotaemon unusable unless you stick with File Collections
Reproduction steps
1. Go to Files, upload more than one document
2. Got to Chat, Click on 'Graph Collection', choose "Select All"
3. Send any kind of message
4. Observe an absence of a reference of any documents and completely unrelated response
Screenshots
No response
Logs
use_quick_index_mode False
reader_mode default
Using reader <kotaemon.loaders.pdf_loader.PDFThumbnailReader object at 0x7f1fdbdd19f0>
Page numbers: 4
Got 4 page thumbnails
Adding documents to doc store
indexing step took 0.2428741455078125
Initializing project at
/app/ktem_app_data/user_data/files/graphrag/da6da42a-8eb2-4c0b-afb2-a8a56fc509d7
/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/datashaper/engine/verbs/convert.py:72: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_datetime without passing `errors` and catch exceptions explicitly instead
datetime_column = pd.to_datetime(column, errors="ignore")
/usr/local/lib/python3.10/site-packages/datashaper/engine/verbs/convert.py:72: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
datetime_column = pd.to_datetime(column, errors="ignore")
User-id: None, can see public conversations: False
User-id: 1, can see public conversations: True
User-id: 1, can see public conversations: True
Session reasoning type None use mindmap (default) use citation (default) language (default)
Session LLM
Reasoning class <class 'ktem.reasoning.simple.FullQAPipeline'>
Reasoning state {'app': {'regen': False}, 'pipeline': {}}
Thinking ...
Retrievers [DocumentRetrievalPipeline(DS=<kotaemon.storages.docstores.lancedb.LanceDBDocumentStore object at 0x7f2004dca7d0>, FSPath=PosixPath('/app/ktem_app_data/user_data/files/index_1'), Index=<class 'ktem.index.file.index.IndexTable'>, Source=<class 'ktem.index.file.index.Source'>, VS=<kotaemon.storages.vectorstores.chroma.ChromaVectorStore object at 0x7f2004dcaef0>, get_extra_table=False, llm_scorer=LLMTrulensScoring(concurrent=True, normalize=10, prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x7f1fd8106290>, system_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x7f1fd8106170>, top_k=3, user_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x7f1fd8106080>), mmr=False, rerankers=[TeiFastReranking(endpoint_url='http://proxy:3000/v1/rerank', is_truncated=True, model_name='jina')], retrieval_mode='hybrid', top_k=10, user_id=1), GraphRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x7f20fe849300>, FSPath=<theflow.base.unset_ object at 0x7f20fe849300>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x7f20fe849300>, VS=<theflow.base.unset_ object at 0x7f20fe849300>, file_ids=[], user_id=<theflow.base.unset_ object at 0x7f20fe849300>), LightRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x7f20fe849300>, FSPath=<theflow.base.unset_ object at 0x7f20fe849300>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x7f20fe849300>, VS=<theflow.base.unset_ object at 0x7f20fe849300>, file_ids=[], user_id=<theflow.base.unset_ object at 0x7f20fe849300>), NanoGraphRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x7f20fe849300>, FSPath=<theflow.base.unset_ object at 0x7f20fe849300>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x7f20fe849300>, VS=<theflow.base.unset_ object at 0x7f20fe849300>, file_ids=[], user_id=<theflow.base.unset_ object at 0x7f20fe849300>)]
searching in doc_ids []
INFO:ktem.index.file.pipelines:Skip retrieval because of no selected files: DocumentRetrievalPipeline(
(vector_retrieval): <function Function._prepare_child.<locals>.exec at 0x7f1fbbfe31c0>
(embedding): <function Function._prepare_child.<locals>.exec at 0x7f1fbbfe32e0>
)
Got 0 retrieved documents
len (original) 0
Got 0 images
Trying LLM streaming
INFO:httpx:HTTP Request: POST http://vllm:8000/v1/chat/completions "HTTP/1.1 200 OK"
Got 0 cited docs
INFO:httpx:HTTP Request: POST http://vllm:8000/v1/chat/completions "HTTP/1.1 200 OK"
Browsers
No response
OS
Linux
Additional information
No response