[BUG] Inability to search all documents or a group of documents with GraphRAG

Question

[BUG] Inability to search all documents or a group of documents with GraphRAG

Opened this issue a month ago · 1 comments

Description

Whenever you try to use all documents or a group of documents with GraphRAG, you get either

searching in doc_ids []
INFO:ktem.index.file.pipelines:Skip retrieval because of no selected files: DocumentRetrievalPipeline(

for all documents or something similar to

AssertionError: GraphRAG index not found for file_id: ["d6f18887-0b01-4df0-a30d-997c919d60f1", "08bad9e8-ef3f-4e89-abed-21da7d4f9611"]

for a grpup of documents. However, there is no problem searching any of these documents one by one. Considering, that RAG is mostly used to be able to access a large quantity of different docs, it makes Kotaemon unusable unless you stick with File Collections

Reproduction steps

1. Go to Files, upload more than one document
2. Got to Chat, Click on 'Graph Collection', choose "Select All"
3. Send any kind of message
4. Observe an absence of a reference of any documents and completely unrelated response

Screenshots

No response

Logs

use_quick_index_mode False
reader_mode default
Using reader <kotaemon.loaders.pdf_loader.PDFThumbnailReader object at 0x7f1fdbdd19f0>
Page numbers: 4
Got 4 page thumbnails
Adding documents to doc store
indexing step took 0.2428741455078125
Initializing project at 
/app/ktem_app_data/user_data/files/graphrag/da6da42a-8eb2-4c0b-afb2-a8a56fc509d7

/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/datashaper/engine/verbs/convert.py:72: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_datetime without passing `errors` and catch exceptions explicitly instead
  datetime_column = pd.to_datetime(column, errors="ignore")
/usr/local/lib/python3.10/site-packages/datashaper/engine/verbs/convert.py:72: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  datetime_column = pd.to_datetime(column, errors="ignore")
User-id: None, can see public conversations: False
User-id: 1, can see public conversations: True
User-id: 1, can see public conversations: True
Session reasoning type None use mindmap (default) use citation (default) language (default)
Session LLM 
Reasoning class <class 'ktem.reasoning.simple.FullQAPipeline'>
Reasoning state {'app': {'regen': False}, 'pipeline': {}}
Thinking ...
Retrievers [DocumentRetrievalPipeline(DS=<kotaemon.storages.docstores.lancedb.LanceDBDocumentStore object at 0x7f2004dca7d0>, FSPath=PosixPath('/app/ktem_app_data/user_data/files/index_1'), Index=<class 'ktem.index.file.index.IndexTable'>, Source=<class 'ktem.index.file.index.Source'>, VS=<kotaemon.storages.vectorstores.chroma.ChromaVectorStore object at 0x7f2004dcaef0>, get_extra_table=False, llm_scorer=LLMTrulensScoring(concurrent=True, normalize=10, prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x7f1fd8106290>, system_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x7f1fd8106170>, top_k=3, user_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x7f1fd8106080>), mmr=False, rerankers=[TeiFastReranking(endpoint_url='http://proxy:3000/v1/rerank', is_truncated=True, model_name='jina')], retrieval_mode='hybrid', top_k=10, user_id=1), GraphRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x7f20fe849300>, FSPath=<theflow.base.unset_ object at 0x7f20fe849300>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x7f20fe849300>, VS=<theflow.base.unset_ object at 0x7f20fe849300>, file_ids=[], user_id=<theflow.base.unset_ object at 0x7f20fe849300>), LightRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x7f20fe849300>, FSPath=<theflow.base.unset_ object at 0x7f20fe849300>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x7f20fe849300>, VS=<theflow.base.unset_ object at 0x7f20fe849300>, file_ids=[], user_id=<theflow.base.unset_ object at 0x7f20fe849300>), NanoGraphRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x7f20fe849300>, FSPath=<theflow.base.unset_ object at 0x7f20fe849300>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x7f20fe849300>, VS=<theflow.base.unset_ object at 0x7f20fe849300>, file_ids=[], user_id=<theflow.base.unset_ object at 0x7f20fe849300>)]
searching in doc_ids []
INFO:ktem.index.file.pipelines:Skip retrieval because of no selected files: DocumentRetrievalPipeline(
  (vector_retrieval): <function Function._prepare_child.<locals>.exec at 0x7f1fbbfe31c0>
  (embedding): <function Function._prepare_child.<locals>.exec at 0x7f1fbbfe32e0>
)
Got 0 retrieved documents
len (original) 0
Got 0 images
Trying LLM streaming
INFO:httpx:HTTP Request: POST http://vllm:8000/v1/chat/completions "HTTP/1.1 200 OK"
Got 0 cited docs
INFO:httpx:HTTP Request: POST http://vllm:8000/v1/chat/completions "HTTP/1.1 200 OK"

Browsers

No response

OS

Linux

Additional information

No response

Answer 1 · 2024-12-03T19:18:13.000Z

Additionally, based on this commit, it seems like groups are intended not to work. Maybe the "select all" option is intended not to work as well?