Issues on reproducing the result on VQA-RAD

Question

Issues on reproducing the result on VQA-RAD

Kelvinz-89757 opened this issue 4 months ago · 2 comments

Thanks for the excellent work!

The issue is that when I used the checkpoint LLaVA-Med++ (VQA-RAD) to do inference on the VQA-RAD dataset, I followed the code provided as follow:

cd MedTrinity-25M
bash ./scripts/med/llava3_med_eval_batch_vqa_rad.sh

I get the result like this:

Metric                   Performance (%)
---------------------  -----------------
Exact Match Score              36.3222
F1 Score                       33.741
Precision                      36.3222
Recall                         32.7501
BLEU Score                      0.088399
BLEU Score (Weight 1)          30.4938
BLEU Score (Weight 2)           6.31218
BLEU Score (Weight 3)           5.47034
yes/no accuracy                72.0588
Closed F1 Score                72.0588
Closed Precision               71.875
Closed Recall                  72.4265

In the paper you have:

For the result in the paper, I would like to know if ‘open’ and ‘closed’ correspond to their respective accuracies?
Another thing is that my results above don’t seem to reproduce the expected outcomes. Could you advise where I might have gone wrong? And to which metric above does ‘open’ should correspond?

Looking forward to you reply! Thank you.

Answer 1 · 2024-09-10T00:21:52.000Z

Hi Kangyu,

Thank you for your feedback and for testing our model. To clarify, 'Closed' in the paper refers to yes/no accuracy for closed-set questions, while 'Open' refers to recall for open-set questions. This aligns with LLaVA-Med metrics. We will update this in the readme.

I apologize for the discrepancy in results. I'll re-run the inference on the VQA-RAD dataset to investigate and will update you with findings and guidance.

Answer 2 · 2024-09-13T14:39:21.000Z

Thanks for the excellent work!

The issue is that when I used the checkpoint LLaVA-Med++ (VQA-RAD) to do inference on the VQA-RAD dataset, I followed the code provided as follow:
cd MedTrinity-25M
bash ./scripts/med/llava3_med_eval_batch_vqa_rad.sh
I get the result like this:
Metric                   Performance (%)
---------------------  -----------------
Exact Match Score              36.3222
F1 Score                       33.741
Precision                      36.3222
Recall                         32.7501
BLEU Score                      0.088399
BLEU Score (Weight 1)          30.4938
BLEU Score (Weight 2)           6.31218
BLEU Score (Weight 3)           5.47034
yes/no accuracy                72.0588
Closed F1 Score                72.0588
Closed Precision               71.875
Closed Recall                  72.4265
In the paper you have: For the result in the paper, I would like to know if ‘open’ and ‘closed’ correspond to their respective accuracies? Another thing is that my results above don’t seem to reproduce the expected outcomes. Could you advise where I might have gone wrong? And to which metric above does ‘open’ should correspond?

Looking forward to you reply! Thank you.

I have solved the problem. The problem comes from the data processing. I used to use data from another source but now I used the data the author provided from
#6 and the result looks fine. Many thanks for the quick feedback!