COCO challenge result reproduction
ForJadeForest opened this issue · 16 comments
I want to reproduce the coco chanllenge results, but I had a very different results.
My idea is to calculate the average for the prediciton of 12 models, then use the 12 dim vector to calculate the correlation with the human metric(M1, M2).
The data I use is mscoco val2014.
This file is downloaded from https://cocodataset.org/#captions-leaderboard.
The prediction of 12 models is from the bert score issue with the similar experiment. Tiiiger/bert_score#79 (comment)
Next is my reprodution code:
def cal_metric(model_name, clip_model, images_filename, device, image_features):
root_dir = Path('test_dataset/coco_captioning_challenge') / model_name
data_path = list(root_dir.glob('captions_val2014*_results.json'))[0]
with open(data_path, 'r') as f:
data = json.load(f)
id2text = {
d['image_id']: d['caption'] for d in data
}
text = [id2text[filename2id(k)] for k in images_filename]
res = get_clip_score(clip_model, image_features, text, device)
return res
model2company = {
'kolarmartin': 'Brno University',
'karpathy': 'NeuralTalk',
'rakshithShetty': 'PicSOM',
'junhua.mao': 'm-RNN',
'OriolVinyals': 'Google',
'myamaguchi': 'MIL',
'Q.Wu': 'ACVT',
'jeffdonahue': 'Berkeley LRCN',
'mRNN_share.JMao': 'm-RNN',
'TsinghuaBigeye': 'Tsinghua Bigeye',
'ryank': 'MLBL',
'kelvin_xu': 'Montreal/Toronto'
}
model_list = model2company.keys()
model, transform = clip.load("ViT-B/32", device=device, jit=False)
model.eval()
human_metric = pd.read_csv('./test_dataset/coco_captioning_challenge/leaderboard.csv')
images_filename, images_features = init_image_features(clip_model, images_root_dir, device)
clip_score_res = []
human_metric_res_m1 = []
human_metric_res_m2 = []
for model_name in model_list:
mean_score, per_score, _ = cal_metric(model_name, clip_model, image_paths, device, images_features)
clip_score_res.append(mean_score)
human_metric_res_m1.append(human_metric[model2company[model_name]]['M1'])
human_metric_res_m2.append(human_metric[model2company[model_name]]['M2'])
m1_spearmanr, m1_p_value = stats.spearmanr(clip_score_res, human_metric_res_m1)
print(f'CLIPScore for M1 Spearmanr: {m1_spearmanr}, p-value: {m1_p_value}')
m2_spearmanr, m2_p_value = stats.spearmanr(clip_score_res, human_metric_res_m2)
print(f'CLIPScore for M2 Spearmanr: {m2_spearmanr}, p-value: {m2_p_value}')
The results I got:
CLIPScore for M1 Spearmanr: 0.6984224745578604, p-value: 0.011522247104936925
CLIPScore for M2 Spearmanr: 0.6984224745578604, p-value: 0.011522247104936925
Is there anything wrong?
Hi there!
Thanks for your interest in our work. I haven't looked at this code in a while, but can dig back through things if it's helpful.
- Are these results really that different than what was reported? from the paper, we report clipscore Spearman ρM1/ρM2 = .59/.63, and RefCLIP-S Spearman ρM1/ρM2 = .69/.74 (all p < .05).
- Appendix A details some concerns we have with this setup --- have you had a chance to take a look at our thoughts there?
Jack
Yes, I post my result in the end, which provides the same correlation for M1 and M2. I think my test method is wrong.
CLIPScore for M1 Spearmanr: 0.6984224745578604, p-value: 0.011522247104936925
CLIPScore for M2 Spearmanr: 0.6984224745578604, p-value: 0.011522247104936925
I have seen the appendix A, it's very helpful for other experiments. However, it's not mention the method of mscoco experiments, which is more about why the evaluation is outlived.
Do you use the mean scores of 12 models calculated by clip score to compare with the human metrics?
"""
the clip_score_res, human_metric_res_m1 are a 1*12 list
"""
stats.spearmanr(clip_score_res, human_metric_res_m1)
It does seem unlikely (though not impossible, given the small number of data points) that you would get exactly the same spearman for both M1 and M2. It's difficult for me to debug partial code but at a high level nothing about your method seems incorrect. The high-level procedure, if I recall correctly, is: 1) for each of the 12 submissions compute the mean eval metric over all instances. 2. Calculate the spearman between mean evaluation metric over all instances and m1/m2.
Great! But I still get the same results. :(
Can you show me the clip score result of the 12 models or the experiment script file to help me debug? I can show my full code if it doesn't bother you.
Just to walk through:
This is the raw csv that I used:
"","","M1","M2","M3","M4","M5","date"
"","Human","0.638","0.675","4.836","3.428","0.352","2015-03-23"
"","Google","0.273","0.317","4.107","2.742","0.233","2015-05-29"
"","MSR","0.268","0.322","4.137","2.662","0.234","2015-04-08"
"","Montreal/Toronto","0.262","0.272","3.932","2.832","0.197","2015-05-14"
"","MSR Captivator","0.250","0.301","4.149","2.565","0.233","2015-05-28"
"","Berkeley LRCN","0.246","0.268","3.924","2.786","0.204","2015-04-25"
"","m-RNN","0.223","0.252","3.897","2.595","0.202","2015-05-30"
"","Nearest Neighbor","0.216","0.255","3.801","2.716","0.196","2015-05-15"
"","PicSOM","0.202","0.250","3.965","2.552","0.182","2015-05-26"
"","Brno University","0.194","0.213","3.079","3.482","0.154","2015-05-29"
"","m-RNN (Baidu/ UCLA)","0.190","0.241","3.831","2.548","0.195","2015-05-26"
"","MIL","0.168","0.197","3.349","2.915","0.159","2015-05-29"
"","MLBL","0.167","0.196","3.659","2.420","0.156","2015-04-10"
"","NeuralTalk","0.166","0.192","3.436","2.742","0.147","2015-04-15"
"","ACVT","0.154","0.190","3.516","2.599","0.155","2015-05-26"
"","Tsinghua Bigeye","0.100","0.146","3.510","2.163","0.116","2015-04-23"
"","Random","0.007","0.020","1.084","3.247","0.013","2015-05-29"
I then, for each system computed the CLIPScore for all generated captions and images, keeping in mind that we defined clipscore to be max(cos(v, t), 0). Then, I averaged all of the scores for each algorithm. Then I took the spearman correlation of those averages.
I do find it a bit odd that your m1/m2 are exactly the same. I can dig more (there are many small details for exact number replication: e.g., making sure you're computing clipscore on GPU so that it uses float16, numerical precision issues when loading resized images, etc.), but maybe you can double-check that your m1/m2 spearmans being the same is correct first?
I notice that the eval results of junhua.mao
and mRNN_share.JMao
are the same. However, their scores on the leaderboard are different. The m-RNN
is 0.223 for M1 and m-RNN (Baidu/ UCLA)
is 0.19 for M1.
I want to know whether the two eval results use the m-RNN
M1/M2 scores or use the m-RNN (Baidu/ UCLA)
M1/M2 scores.
Or one uses the m-RNN
, the other uses m-RNN (Baidu/ UCLA)
?
That is true --- we also noted that as one of the reasons why this evaluation has arguably outlived its utility (from appendix):
There’s reason to believe that some teams give
dev set predictions with different models vs. test
set predictions. For example, the dev set predictions are identical between the two submissions:
m-RNN and m-RNN (Baidu/ UCLA), but
the test set predictions differ (and achieve significantly different scores).
So which M1/M2 scores (m-RNN or m-RNN (Baidu/ UCLA)) do you use to calculate the spearman correlation for the same dev set predictions? I mean it's more suitable to use the same M1/M2 scores to compare with the clip score.
btw, I use the original val2014 data, and the image preprocessing is the same as your clipscore.py
class CLIPImageDataset(torch.utils.data.Dataset):
def __init__(self, data):
self.data = data
# only 224x224 ViT-B/32 supported for now
self.preprocess = self._transform_test(224)
def _transform_test(self, n_px):
return Compose([
Resize(n_px, interpolation=Image.BICUBIC),
CenterCrop(n_px),
lambda image: image.convert("RGB"),
ToTensor(),
Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
])
def __getitem__(self, idx):
c_data = self.data[idx]
image = Image.open(c_data)
image = self.preprocess(image)
return {'image':image}
def __len__(self):
return len(self.data)
That is true --- we also noted that as one of the reasons why this evaluation has arguably outlived its utility (from appendix):
There’s reason to believe that some teams give dev set predictions with different models vs. test set predictions. For example, the dev set predictions are identical between the two submissions: m-RNN and m-RNN (Baidu/ UCLA), but the test set predictions differ (and achieve significantly different scores).
I used the data as shown, i.e., the same CLIPScore for both despite their different test set scores (even though this probably doesn't make sense), mostly because this was the setup used by prior work.
That preprocessing looks good --- how do you compute the clipscore?
I know the clip score is the same but the M1/M2 score can be different, which may lead to the spearman being different.
This result is for both using m-RNN (Baidu/ UCLA) M1/M2 scores.
CLIPScore for M1 Spearmanr: 0.5855461150333579, p-value: 0.04546014116529277
CLIPScore for M2 Spearmanr: 0.6702033846767349, p-value: 0.017084995998985886
Ref CLIPScore for M1 Spearmanr: 0.6490390672658904, p-value: 0.02239391465507145
Ref CLIPScore for M2 Spearmanr: 0.7478058818498304, p-value: 0.0051653772841870225
This result is both using m-RNN M1/M2 scores.
CLIPScore for M1 Spearmanr: 0.6984224745578604, p-value: 0.011522247104936925
CLIPScore for M2 Spearmanr: 0.6984224745578604, p-value: 0.011522247104936925
Ref CLIPScore for M1 Spearmanr: 0.7478058818498304, p-value: 0.0051653772841870225
Ref CLIPScore for M2 Spearmanr: 0.7478058818498304, p-value: 0.0051653772841870225
I think the first (use m-RNN (Baidu/ UCLA) M1/M2 scores) is more reasonable, but it's still a little different.
I use the get_clip_score()
function in your clipscore.py
.
def get_clip_score(model, images, candidates, device, w=2.5):
'''
get standard image-text clipscore.
images can either be:
- a list of strings specifying filepaths for images
- a precomputed, ordered matrix of image features
'''
if isinstance(images, list):
# need to extract image features
images = extract_all_images(images, model, device)
candidates = extract_all_captions(candidates, model, device)
#as of numpy 1.21, normalize doesn't work properly for float16
if version.parse(np.__version__) < version.parse('1.21'):
images = sklearn.preprocessing.normalize(images, axis=1)
candidates = sklearn.preprocessing.normalize(candidates, axis=1)
else:
warnings.warn(
'due to a numerical instability, new numpy normalization is slightly different than paper results. '
'to exactly replicate paper results, please use numpy version less than 1.21, e.g., 1.20.3.')
images = images / np.sqrt(np.sum(images**2, axis=1, keepdims=True))
candidates = candidates / np.sqrt(np.sum(candidates**2, axis=1, keepdims=True))
per = w*np.clip(np.sum(images * candidates, axis=1), 0, None)
return np.mean(per), per, candidates
What happens if you pretend you didn't notice that "m-RNN" and "m-RNN (Baidu/ UCLA)" were the same dev set predictions? I.e., use the same mean clip score for both and use their differing M1/M2 scores? I believe that is the "standard" setting, which I agree, does not make sense (as we mention in the paper).
Here are the results:
{
......
'junhua.mao': 'm-RNN',
'mRNN_share.JMao': 'm-RNN (Baidu/ UCLA)',
......
}
CLIPScore for M1 Spearmanr: 0.6408609618356389, p-value: 0.02473585790362511
CLIPScore for M2 Spearmanr: 0.6831155307478788, p-value: 0.014338998433725771
Ref CLIPScore for M1 Spearmanr: 0.6972003870519587, p-value: 0.01173063611371156
Ref CLIPScore for M2 Spearmanr: 0.7464973841162387, p-value: 0.00528801957619195
{
'junhua.mao': 'm-RNN (Baidu/ UCLA)',
'mRNN_share.JMao': 'm-RNN',
}
CLIPScore for M1 Spearmanr: 0.6408609618356389, p-value: 0.02473585790362511
CLIPScore for M2 Spearmanr: 0.6831155307478788, p-value: 0.014338998433725771
Ref CLIPScore for M1 Spearmanr: 0.6972003870519587, p-value: 0.01173063611371156
Ref CLIPScore for M2 Spearmanr: 0.7464973841162387, p-value: 0.00528801957619195
Thanks! It looks like the results are fairly close for refclipscore at least ... Are you using a GPU?
all the results are using gpu with fp16.
OK --- I see, thanks for the info! If that's the case, I am not sure of the reason for the difference right now, but am confident it can be tracked down. On the one hand, I'm somewhat disinclined to invest time into this particular evaluation because I think it's statistically unstable due to the fact that the correlation is only over 12 datapoints (e.g., the disparity between the reported numbers and the numbers you're getting could be due to a small difference). And, the numbers you're seeing are actually more favorable than the ones in the paper, so; if anything, these results suggest under-reporting. On the other hand, I do want to help out if you (like me in the past) need to run on this setup for comparison-to-prior-work purposes. I will see what I can dig up.
That's great! Thanks for your help!
If you find something useful, can you post it in this issue?
I decide to use the results temporarily that both junhua.mao
and mRNN_share.JMao
use the m-RNN (Baidu/ UCLA)
scores.
Thanks again for your excellent work!