UCSC-VLAA/MedTrinity-25M

Issues on reproducing the result on VQA-RAD

Kelvinz-89757 opened this issue · 2 comments

Thanks for the excellent work!

The issue is that when I used the checkpoint LLaVA-Med++ (VQA-RAD) to do inference on the VQA-RAD dataset, I followed the code provided as follow:

cd MedTrinity-25M
bash ./scripts/med/llava3_med_eval_batch_vqa_rad.sh

I get the result like this:

Metric                   Performance (%)
---------------------  -----------------
Exact Match Score              36.3222
F1 Score                       33.741
Precision                      36.3222
Recall                         32.7501
BLEU Score                      0.088399
BLEU Score (Weight 1)          30.4938
BLEU Score (Weight 2)           6.31218
BLEU Score (Weight 3)           5.47034
yes/no accuracy                72.0588
Closed F1 Score                72.0588
Closed Precision               71.875
Closed Recall                  72.4265

In the paper you have:
image
For the result in the paper, I would like to know if ‘open’ and ‘closed’ correspond to their respective accuracies?
Another thing is that my results above don’t seem to reproduce the expected outcomes. Could you advise where I might have gone wrong? And to which metric above does ‘open’ should correspond?

Looking forward to you reply! Thank you.

Hi Kangyu,

Thank you for your feedback and for testing our model. To clarify, 'Closed' in the paper refers to yes/no accuracy for closed-set questions, while 'Open' refers to recall for open-set questions. This aligns with LLaVA-Med metrics. We will update this in the readme.

I apologize for the discrepancy in results. I'll re-run the inference on the VQA-RAD dataset to investigate and will update you with findings and guidance.

Thanks for the excellent work!

The issue is that when I used the checkpoint LLaVA-Med++ (VQA-RAD) to do inference on the VQA-RAD dataset, I followed the code provided as follow:

cd MedTrinity-25M
bash ./scripts/med/llava3_med_eval_batch_vqa_rad.sh

I get the result like this:

Metric                   Performance (%)
---------------------  -----------------
Exact Match Score              36.3222
F1 Score                       33.741
Precision                      36.3222
Recall                         32.7501
BLEU Score                      0.088399
BLEU Score (Weight 1)          30.4938
BLEU Score (Weight 2)           6.31218
BLEU Score (Weight 3)           5.47034
yes/no accuracy                72.0588
Closed F1 Score                72.0588
Closed Precision               71.875
Closed Recall                  72.4265

In the paper you have: image For the result in the paper, I would like to know if ‘open’ and ‘closed’ correspond to their respective accuracies? Another thing is that my results above don’t seem to reproduce the expected outcomes. Could you advise where I might have gone wrong? And to which metric above does ‘open’ should correspond?

Looking forward to you reply! Thank you.

I have solved the problem. The problem comes from the data processing. I used to use data from another source but now I used the data the author provided from
#6 and the result looks fine. Many thanks for the quick feedback!