Unable to reproduce Counterfactual Robustness result with ChatGPT
baichuan-assistant opened this issue · 5 comments
Here's what I did:
Step1:
python evalue.py --dataset zh --noise_rate 0.0 --modelname chatgpt
Step2:
python fact_evalue.py --dataset zh --modelname chatgpt
I got file prediction_zh_chatgpt_temp0.7_noise0.0_passage5_correct0.0_result
with content:
{
"all_rate": 0.9473684210526315,
"noise_rate": 0.0,
"tt": 270,
"nums": 285
}
And file prediction_zh_chatgpt_temp0.7_noise0.0_passage5_correct0.0_chatgptresult.json
with content:
{
"reject_rate": 0.0,
"all_rate": 0.9385245901639344,
"correct_rate": 0,
"tt": 229,
"rejecttt": 0,
"correct_tt": 0,
"nums": 244,
"noise_rate": 0.0
}
I failed to see how this matches the results in the paper:
Any ideas?
Hello, you can use --datasset zh_fact
, since zh_fact.json
file is used to evaluate the counterfactural robustness. I will update the readme, thanks for your issue.
Hello, you can use
--datasset zh_fact
, sincezh_fact.json
file is used to evaluate the counterfactural robustness. I will update the readme, thanks for your issue.
Cool. I reran both lines with --dataset zh_fact
. I can now get:
{
"reject_rate": 0.011235955056179775,
"all_rate": 0.1797752808988764,
"correct_rate": 1.0,
"tt": 16,
"rejecttt": 1,
"correct_tt": 1,
"nums": 89,
"noise_rate": 0.0
}
So I guess all_rate
matches Acc[doc]
, 17 vs 17.9, and how reject_rate
matches ED
, both 1%. Correct?
Also where do I get ED* and correct_rate?
Thanks.
Hello, rejecttt: 1
is ED*
(ED*=1/100=1%) . The ED
is obtained by evalue.py
, the output 'fact_check_rate' is the ED
.
Hello,
rejecttt: 1
isED*
(ED*=1/100=1%) . TheED
is obtained byevalue.py
, the output 'fact_check_rate' is theED
.
I see. With evaluate on ChatGPT I got:
{
"all_rate": 0.17391304347826086,
"noise_rate": 0.0,
"tt": 16,
"nums": 92,
"fact_check_rate": 0.0,
"correct_rate": 0,
"fact_tt": 0,
"correct_tt": 0
}
Seems like GPT-3.5 produces unstable result?
Yes, you can set by lower temperature like --temp 0.2
to make the result is more stable.