FactScore Inference Fails with KeyError: 'original_splitted_sentences'
hideaki-j opened this issue · 2 comments
Hello, thanks for your amazing work!
I want to ask questions about an error KeyError: 'original_splitted_sentences'
that I encountered when trying to generate results for FactScore.
Error
When I run run_long_form_static.py
for FactScore following the command shown in "Run inference using pre-retrieved passages" in README.md, I encounter:
KeyError: 'original_splitted_sentences'
The error originates from the following line:
"cat": item["cat"], "intermediate": intermediate["original_splitted_sentences"][0]})
This error seems to be the same as issue #76 . However, since that issue was retracted, I am reposting it here.
Culprit?
The error occurs when do_retrieve == False
, and the culprit seems to be:
if do_retrieve is False:
...
prediction_tree = {}
return preds[0], prediction_tree
at here, since it always return prediction_tree = {}
, resulting in KeyError
Another issue: always no retrieval
Upon investigating this error, I also found that no retrieval occurs unless using --mode always_retrieve
(i.e., do_retrieve
is always False
even when using adaptive_retrieval
or default
). Therefore, when I run run_long_form_static.py
with the same flags specified in README.md, it always goes to the if do_retrieve is False
path, causing the above error.
Adding the --mode always_retrieve
flag solves the error, but I'm not sure if it was accidentally omitted from the instruction command.
Also, I am not sure that always being do_retrieve == False
is an expected behaviour here - it seems not to be.
Questions
Q1. Is the --mode always_retrieve
flag missing from the command instructions for FactScore, or is the command correct and the cause of the error lies elsewhere?
Q2. With mode == "adaptive_retrieval”
and mode == "default”
, it appears to always go to do_retrieve == False
, but is this expected behavior?
Thanks!
Adding the --mode always_retrieve flag solves the error, but I'm not sure if it was accidentally omitted from the instruction command.
I believe this is the case. 🍻
answer1:
I encountered numerous difficulties during the evaluation of factscore as well. The Selfrag repository only provided scripts for the always retrieval mode. I personally evaluated always retrieval, adaptive retrieval, and no retrieval and found that adaptive retrieval and no retrieval produced the same errors as you did. I spent some time resolving this issue, and you can refer to my scripts for evaluating factscore.
https://github.com/fate-ubw/RAGLAB/blob/main/run/Factscore/2-eval_fact-raglab-selfrag-selfrag_8B-adaptive_retrieval-GPT.sh
answer2:
I encountered the same problem as you, where there are discrepancies in the logic of longform compared to the selfrag paper, and the code related to the construction of prediction_tree = {}
is difficult to understand. I have rewritten the code for the three modes of selfrag longform (always retrieval, adaptive retrieval, no retrieval) in a clearer manner, which you can refer to for understanding the reasoning process of selfrag longform.
https://github.com/fate-ubw/RAGLAB/blob/main/raglab/rag/infer_alg/self_rag_reproduction/selfrag_reproduction.py#L216