chen700564/RGB

confused by the calculation of accuracy

wzp21142 opened this issue · 6 comments

Is the calculation method for the accuracy metric in the paper consistent with the code in the repository? I'm a bit confused by this piece of code:

tt = 0
for i in results:
    label = i["label"]
    if noise_rate == 1 and label[0] == -1:
        tt += 1
    elif 0 not in label and 1 in label:
        tt += 1
print(tt/len(results))
scores = {
"all_rate": (tt)/len(results),
"noise_rate": noise_rate,
"tt":tt,
"nums": len(results),
}

Benchmarking on the zh.json dataset with a noise_rate of 0.2 implies that out of 1500 prompts, 300 are missing supplementary knowledge. The accuracy calculated from this code will be much lower than the value indicated in the paper's table. Am I misunderstanding something, or was this piece of code previously modified? Thank you!

Hello, in the code results means a list of prediction result. Each i in results is the prediction of one question and contains a label key, i['label'] is the label of predict result. For example, the question "Who won the 2022 Nobel Prize for chemistry?" has there answer parts: ["Carolyn R. Bertozzi", "Morten Meldal", "K. Barry Sharpless"]. If the model can predict all of them, the i['label'] will be [1,1,1].
If the noise_rate != 1, the code will conduct:

elif 0 not in label and 1 in label:
        tt += 1

If the i['label'] is [1,1,1], tt will +1.
If any answer part is not predicted, i.e., 0 in i['label'], tt will not +1.
Finally, tt/len(results) is the accuracy.

noise_rate of 0.2 means if there are 5 external documents in the context, there is 5*0.2=1 document is the noise document which does not contain the correct answer.

Hello, in the code results means a list of prediction result. Each i in results is the prediction of one question and contains a label key, i['label'] is the label of predict result. For example, the question "Who won the 2022 Nobel Prize for chemistry?" has there answer parts: ["Carolyn R. Bertozzi", "Morten Meldal", "K. Barry Sharpless"]. If the model can predict all of them, the i['label'] will be [1,1,1]. If the noise_rate != 1, the code will conduct:

elif 0 not in label and 1 in label:
        tt += 1

If the i['label'] is [1,1,1], tt will +1. If any answer part is not predicted, i.e., 0 in i['label'], tt will not +1. Finally, tt/len(results) is the accuracy.

noise_rate of 0.2 means if there are 5 external documents in the context, there is 5*0.2=1 document is the noise document which does not contain the correct answer.

Yes, I understand the parameters and the calculation process. What actually confused me is that without external information, the noise data will cause a significant drop in the tt/len(results) value. For example, assuming a noise_rate of 0.2, it will lead to 300 negative samples if the model does not remember the answers of the questions, which is basically in line with my personal test results. However, the table in the paper shows that an increased noise_rate only causes a certain degree of decrease in accuracy. So why the accuracy value indicated in the paper's table is still above 80 mostly?

What actually confused me is that without external information, the noise data will cause a significant drop in the tt/len(results) value.

Sorry, I do not understand what is the meaning of "without external information", since external information should be included. If there is not external information, noise_rate will not work. And what is the test model?

What actually confused me is that without external information, the noise data will cause a significant drop in the tt/len(results) value.

Sorry, I do not understand what is the meaning of "without external information", since external information should be included. If there is not external information, noise_rate will not work. And what is the test model?

Sorry for the misunderstanding. What I meant was that in the dataset constructed based on the noise_rate (docs from the function processdata), the model typically cannot get the correct answer from those noise samples that do not contain the correct answer. The test model is a private proprietary model.

the model typically cannot get the correct answer from those noise samples that do not contain the correct answer

In theprocessdata, if the noise_rate < 1, positive documents and noise documents are all included in the input. For example, if there are 5 external documents to be inputed, and the noise_rate is 0.2, the number of noise doucments should be 1. The input like that {instruction}\n Documents: posdoc1\n posdoc2\n noisedoc1\n posdoc3\n posdoc4\n Question: {question}, The model should answer the question based on the positive documents and ignore the noise documents. You may adjust the instruction in config/instruction.yaml to improve the performance.

the model typically cannot get the correct answer from those noise samples that do not contain the correct answer

In theprocessdata, if the noise_rate < 1, positive documents and noise documents are all included in the input. For example, if there are 5 external documents to be inputed, and the noise_rate is 0.2, the number of noise doucments should be 1. The input like that {instruction}\n Documents: posdoc1\n posdoc2\n noisedoc1\n posdoc3\n posdoc4\n Question: {question}, The model should answer the question based on the positive documents and ignore the noise documents. You may adjust the instruction in config/instruction.yaml to improve the performance.

I see now, during the replication process I mistakenly provided the model with 5 separate document fragments instead of merging them together, which led to a discrepancy. Thanks a lot for your patient response!