Unable to reproduce Counterfactual Robustness result with ChatGPT

Question

Unable to reproduce Counterfactual Robustness result with ChatGPT

baichuan-assistant opened this issue 10 months ago · 5 comments

baichuan-assistant commented 10 months ago

Here's what I did:

Step1:
python evalue.py --dataset zh --noise_rate 0.0 --modelname chatgpt

Step2:
python fact_evalue.py --dataset zh --modelname chatgpt

I got file prediction_zh_chatgpt_temp0.7_noise0.0_passage5_correct0.0_result with content:

{
    "all_rate": 0.9473684210526315,
    "noise_rate": 0.0,
    "tt": 270,
    "nums": 285
}

And file prediction_zh_chatgpt_temp0.7_noise0.0_passage5_correct0.0_chatgptresult.json with content:

{
    "reject_rate": 0.0,
    "all_rate": 0.9385245901639344,
    "correct_rate": 0,
    "tt": 229,
    "rejecttt": 0,
    "correct_tt": 0,
    "nums": 244,
    "noise_rate": 0.0
}

I failed to see how this matches the results in the paper:

Any ideas?

Answer 1 · 2024-01-11T11:10:22.000Z

Hello, you can use --datasset zh_fact, since zh_fact.json file is used to evaluate the counterfactural robustness. I will update the readme, thanks for your issue.

Answer 2 · 2024-01-11T11:26:45.000Z

Hello, you can use --datasset zh_fact, since zh_fact.json file is used to evaluate the counterfactural robustness. I will update the readme, thanks for your issue.

Cool. I reran both lines with --dataset zh_fact. I can now get:

{
    "reject_rate": 0.011235955056179775,
    "all_rate": 0.1797752808988764,
    "correct_rate": 1.0,
    "tt": 16,
    "rejecttt": 1,
    "correct_tt": 1,
    "nums": 89,
    "noise_rate": 0.0
}

So I guess all_rate matches Acc[doc], 17 vs 17.9, and how reject_rate matches ED, both 1%. Correct?

Also where do I get ED* and correct_rate?

Thanks.

Answer 3 · 2024-01-11T11:36:34.000Z

Hello, rejecttt: 1 is ED* (ED*=1/100=1%) . The ED is obtained by evalue.py, the output 'fact_check_rate' is the ED.

Answer 4 · 2024-01-11T11:42:05.000Z

Hello, rejecttt: 1 is ED* (ED*=1/100=1%) . The ED is obtained by evalue.py, the output 'fact_check_rate' is the ED.

I see. With evaluate on ChatGPT I got:

{
    "all_rate": 0.17391304347826086,
    "noise_rate": 0.0,
    "tt": 16,
    "nums": 92,
    "fact_check_rate": 0.0,
    "correct_rate": 0,
    "fact_tt": 0,
    "correct_tt": 0
}

Seems like GPT-3.5 produces unstable result?

Answer 5 · 2024-01-11T11:46:18.000Z

Yes, you can set by lower temperature like --temp 0.2 to make the result is more stable.