Issues on reproducing the result on VQA-RAD
Kelvinz-89757 opened this issue · 2 comments
Thanks for the excellent work!
The issue is that when I used the checkpoint LLaVA-Med++ (VQA-RAD) to do inference on the VQA-RAD dataset, I followed the code provided as follow:
cd MedTrinity-25M
bash ./scripts/med/llava3_med_eval_batch_vqa_rad.sh
I get the result like this:
Metric Performance (%)
--------------------- -----------------
Exact Match Score 36.3222
F1 Score 33.741
Precision 36.3222
Recall 32.7501
BLEU Score 0.088399
BLEU Score (Weight 1) 30.4938
BLEU Score (Weight 2) 6.31218
BLEU Score (Weight 3) 5.47034
yes/no accuracy 72.0588
Closed F1 Score 72.0588
Closed Precision 71.875
Closed Recall 72.4265
In the paper you have:
For the result in the paper, I would like to know if ‘open’ and ‘closed’ correspond to their respective accuracies?
Another thing is that my results above don’t seem to reproduce the expected outcomes. Could you advise where I might have gone wrong? And to which metric above does ‘open’ should correspond?
Looking forward to you reply! Thank you.
Hi Kangyu,
Thank you for your feedback and for testing our model. To clarify, 'Closed' in the paper refers to yes/no accuracy for closed-set questions, while 'Open' refers to recall for open-set questions. This aligns with LLaVA-Med metrics. We will update this in the readme.
I apologize for the discrepancy in results. I'll re-run the inference on the VQA-RAD dataset to investigate and will update you with findings and guidance.
Thanks for the excellent work!
The issue is that when I used the checkpoint LLaVA-Med++ (VQA-RAD) to do inference on the VQA-RAD dataset, I followed the code provided as follow:
cd MedTrinity-25M bash ./scripts/med/llava3_med_eval_batch_vqa_rad.sh
I get the result like this:
Metric Performance (%) --------------------- ----------------- Exact Match Score 36.3222 F1 Score 33.741 Precision 36.3222 Recall 32.7501 BLEU Score 0.088399 BLEU Score (Weight 1) 30.4938 BLEU Score (Weight 2) 6.31218 BLEU Score (Weight 3) 5.47034 yes/no accuracy 72.0588 Closed F1 Score 72.0588 Closed Precision 71.875 Closed Recall 72.4265
In the paper you have: For the result in the paper, I would like to know if ‘open’ and ‘closed’ correspond to their respective accuracies? Another thing is that my results above don’t seem to reproduce the expected outcomes. Could you advise where I might have gone wrong? And to which metric above does ‘open’ should correspond?
Looking forward to you reply! Thank you.
I have solved the problem. The problem comes from the data processing. I used to use data from another source but now I used the data the author provided from
#6 and the result looks fine. Many thanks for the quick feedback!