Correlations on Composite
akskuchi opened this issue · 5 comments
Hello,
Thanks for the nice contribution.
I am trying to understand how you calculated the correlations with the Composite caption-level likert judgements.
You mentioned in the paper that Composite contains (12K judgements - with F8K (997 imgs), F30K (991 imgs), and MSCOCO (2007 imgs)).
In the judgement .csv
files at the AMT eval link you provided, there are 3 judgements for each F8K img, 4 for each F30K img, and 4 for each MSCOCO img. This is adding upto ~
15K judgements.
Is there a reason why you considered only 12K or am I missing something?
Hi there! thanks for your interest in our work. great question! I don't know the answer offhand. It's been a while since I've run these correlations, so I am a bit out of date on the specifics.
From what i remember, there are (3995 images * 3 judgments per) = 11985 (which is what we did). this is the standard number of judgments used for composite, see, for example: https://aclanthology.org/2021.findings-emnlp.395.pdf and https://arxiv.org/pdf/2106.14019.pdf . We have some details about composite in the paper/appendix; perhaps the answer is in there somewhere --- what do you think?
happy to chat more about it; curious to see if you find anything :-)
Hello, thanks for your response :)
I will look into the work you've linked.
closing for now, feel free to re-open if i can be helpful.
hello, I'm confused too when I download composite dataset from https://imagesdg.wordpress.com/image-to-scene-description-graph/, there are 3 humanjugde score for flickr8k and 4 humanjudge score for flickr30k and MSCOCO, just as @akskuchi said.
Could it be that one of the 4 human judgments is made on the reference? As described in the paper, I remember removing human judgments made over the reference captions (which were used to compute the reference-backed metrics).