BaranziniLab/KG_RAG

Error in retrieving context for some diseases

Closed this issue · 7 comments

Hi @karthiksoman,

I am trying to run the true_false_generation notebook and came across this error where it's not able to retrieve context from SPOKE for some diseases.

for index, row in question_df.iterrows():
    question = row["text"]
    context =  retrieve_context(row["text"], vectorstore, embedding_function_for_context_retrieval, node_context_df, CONTEXT_VOLUME, QUESTION_VS_CONTEXT_SIMILARITY_PERCENTILE_THRESHOLD, QUESTION_VS_CONTEXT_MINIMUM_SIMILARITY)
    # print few context lines
    context_lines = context.split("\n")[:3]
    print(context_lines)

Eg: for question and disease : Neurofibromatosis 2 is not associated with Gene NF2 it is failing and showing the error:


IndexError Traceback (most recent call last)
File ~/miniconda3/envs/kg_rag/lib/python3.10/site-packages/tenacity/init.py:382, in Retrying.call(self, fn, *args, **kwargs)
381 try:
--> 382 result = fn(*args, **kwargs)
383 except BaseException: # noqa: B902

File ~/sulab_projects/KG_RAG/kg_rag/utility.py:125, in get_context_using_spoke_api(node_value)
124 context = merge_2['context'].str.cat(sep=' ')
--> 125 context += node_value + " has a " + node_context[0]["data"]["properties"]["source"] + " identifier of " + node_context[0]["data"]["properties"]["identifier"] + " and Provenance of this association is " + node_context[0]["data"]["properties"]["source"] + "."
126 return context

IndexError: list index out of range

The above exception was the direct cause of the following exception:

RetryError Traceback (most recent call last)
Cell In[132], line 3
1 for index, row in question_df.iterrows():
2 question = row["text"]
----> 3 context = retrieve_context(row["text"], vectorstore, embedding_function_for_context_retrieval, node_context_df, CONTEXT_VOLUME, QUESTION_VS_CONTEXT_SIMILARITY_PERCENTILE_THRESHOLD, QUESTION_VS_CONTEXT_MINIMUM_SIMILARITY)
4 # find context first few lines and last few lines
5 context_lines = context.split("\n")[:3]

Cell In[79], line 15
...
--> 326 raise retry_exc from fut.exception()
328 if self.wait:
329 sleep = self.wait(retry_state)

RetryError: RetryError[<Future at 0x7fa2361b66e0 state=finished raised IndexError>]

@janjoy can you post the link to the notebook that you mentioned? I couldn't locate the notebook named 'true_false_generation' in the notebooks directory of KG-RAG.

@karthiksoman https://github.com/BaranziniLab/KG_RAG/blob/main/kg_rag/rag_based_generation/GPT/run_true_false_generation.py trying to run this file and it was giving some errors. So I tried to see for which questions (https://github.com/BaranziniLab/KG_RAG/blob/main/data/benchmark_data/true_false_questions.csv) it is not retrieving context.
One example where it was failing was "Neurofibromatosis 2 is not associated with Gene NF2" statement in the csv file. It was giving error as it was not retrieving any context from SPOKE.
I hope this is clear. Please let me know if you have more questions. Thank you =)

@karthiksoman I checked again and found that SPOKE is not able to retrieve context for these two diseases from the list: Neurofibromatosis 2 and Familial Mediterranean Fever

@janjoy Apologies for the delay! I was on vacation :)

Reason why KG-RAG is not able to fetch the context for these two diseases from SPOKE is because SPOKE got updated and the names of these two diseases also got updated and is currently not in accordance with the names stored in the vector database. That is the reason it is not returning any context for these two diseases.

For example:
When you ask 'Neurofibromatosis 2 is not associated with Gene NF2', KG-RAG extracts 'Neurofibromatosis 2' from the query. But currently, 'Neurofibromatosis 2' is not part of SPOKE graph (after the update, but previously it was). Hence, it is not returning the context, because it does not have that node in the graph.
This happened because the underlying Disease Ontology database (https://disease-ontology.org/) updated their data which got reflected in SPOKE (because SPOKE always synchronize its data with the underlying parent database, in this case Disease Ontology database)
I presume this update should have affected only a handful of disease nodes. If you happen to encounter more such cases, please let me know, so that I can give the file that contains the disease names based on the current version of SPOKE and you may need to re-create the vectorDB so that it will be in-sync with the current version of SPOKE.

@janjoy I am closing this issue since it addressed the reason for your question. Feel free to re-open it if you have more follow-up questions on this.

Hi @karthiksoman , I would like to request the file that contains the disease names based on the current version of SPOKE. Having trouble retrieving content for many such diseases especially while executing the MCQ test questions. Thank you!