chen700564/RGB

Rejection Rate Evaluation Issue

jhshen95 opened this issue · 4 comments

I am confused with the calculation of rejection rate:

tt = 0
for i in results:
    label = i['label']
    if noise_rate == 1 and label[0] == -1:
        tt += 1
    elif 0 not in label and 1 in label:
        tt += 1

When noise_rate is set to 1, there are two cases for tt += 1:

  • label[0] == -1, which means the prediction is a rejection
  • there are only 1s in label, which means the prediction is correct

So the final metric for "rejection rate" includes two parts, one for actual rejection and another for correct prediction (covering all true answers). Do I get this right?

I reproduce the rejection rate experiment with Qwen-7B-chat. It turns out that there are 70 out of 300 samples counted as tt, while only 17 samples have label[0] == -1, and other 53 samples are correct prediction (label are all 1s). Does this mean even if noise_rate is 1 the model can still makes correct predictions (due to its knowledge) instead of rejection? So the model with high rejection rate metric may not be a good rejection maker, but a powerful model that knows the answer already.

Yes, you are right. RGB is based on latest news (about 2022.10 - 2023.7) so latest LLM (such as Qwen) may know some answers already. You can use reject_evalue.py to the rejection rate evaluted by ChatGPT (which will not consider whether the answer is correct)

The prompt did not provide a doc, and the provided example seems inappropriate. For example, in the second article, there was no doc provided at all, but the response was yes, and most of the examples were negative, which can have an impact during testing. I don't know if my understanding is correct.
image

and also prompt in the fact error detect
image

The prompt did not provide a doc, and the provided example seems inappropriate. For example, in the second article, there was no doc provided at all, but the response was yes, and most of the examples were negative, which can have an impact during testing. I don't know if my understanding is correct.

Hello, the prompt is used to determine if the answer generated by LLM is a rejection (or identifying factual errors). The Answer or Response in the demonstrations are the answers generated by LLM. We use ChatGPT to judge the Answer or Response to determine if they are a rejection or identifying factual errors ; This judgement does not require the document.