We improve the errors of entailer due to incorrect beliefs of entailer by using LLaMa2-7B for scoring the truthfulness of the premise. We change the reasoned scoring of the hypothesis to use the geometric mean of the truthfulness scores of the premises. A short walkthrough of code is available at walkthrough.ipynb
.
- Entailer is defined in
entailer.py
. It uses t5-model from entailer for generating one-step explanation and scoring truthfulness and faithfulness of premises. - llama Entailer is defined in
llama_entailer.py
. It uses llama2-7B model for scoring truthfulness and faithfulness of the premises. - Following are the methods defined in these classes:
truthfulness_score
: scores the truthfulness of the promptfaithfulness_score
: scores the faithfulness of the premises with hypothesis.one_step
: generates a one step explanation for the given hypothesis.generate_entailment_tree
: generate an entailment for the hypothesis for a given max depth.
- Since we are imporving errors of entailer due to incorrect beliefs we generate a full-depth score tree for every hypothesis in the dataset to avoid generating the premises again and again.
truth_faith_score.py
is used to generate score-tree for the hypothesis.- We generate and store the score trees of obqa, quartz and truthfulqa at
results/
.
- Finetuning is available at
truthfulqa_reeval/scripts/finetune_judge.sh
. We modified the code fromyizhongw/truthfulqa_reeval
. - We created dataset using ARC and worldtree datasets it is located at
truthfulqa_reeval/data
.
- We use the finetuned-model to modify the scores in score-tree generated earlier.
modify_scores.py
is used to modify the truthfulness and faithfulness scores using a custom model.- We store the modified score-trees at
results/
- We evaluate the following approaches using the score-trees obtained
- Direct: Choosing the answer corresponding to the highest-scores hypothesis.
- Entailer: Expanding the entailment tree up to a certain depth, then using only the truthfulness scores at the leaf noded and the faithfulness scores at all levels backpropagat- ing the reasoned score.
- Entailer+Direct: In this case, a node is expanded only is the reasoned score, which is based on the truthfulness scores of the child premises and the faithfulness score of the entail- ment is higher than the score of the node itsel
- We observe that using logit-transform and taking the geometric mean of the truthfulness scores works well.
- We show comparison
across different parameters at
results/results.ods