Cannot reproduce MM-Vet score
TideDra opened this issue · 4 comments
Hi, I try to reproduce your results, and the MME score and MMHal-Bench score I got is roughly consistent with the results you report in the paper, but the MM-Vet score I got is 48.2, while your result is 49.9. Moreover, the MM-Vet score of the raw Qwen-VL-Chat baseline I got is also 48.2, which means this score is not improved after dpo, while your baseline score is 45.7.
I'm using the latest Qwen-VL-Chat checkpoint and your raw codebase. I wonder what causes the difference of MM-Vet score of both baseline model and dpo model. Thanks!
Hi, thanks for your question. Our evaluation results are based on the Qwen-VL-Chat
ckpt and the results we obtained using the calculator.ipynb
provided by MM-Vet are attached:
,rec,ocr,know,gen,spat,math,total,std,runs
QwenVL-Chat,52.3,34.6,43.1,39.7,34.7,18.8,45.7,0.0,[45.7]
Silkie,55.4,37.8,46.3,42.0,42.1,22.7,49.9,0.0,[49.9]
For the score difference, what's the version of GPT evaluator you are using? We are using "gpt-4-0613" as the evaluator.
Hi, thanks for your question. Our evaluation results are based on the
Qwen-VL-Chat
ckpt and the results we obtained using thecalculator.ipynb
provided by MM-Vet are attached:,rec,ocr,know,gen,spat,math,total,std,runs
QwenVL-Chat,52.3,34.6,43.1,39.7,34.7,18.8,45.7,0.0,[45.7]
Silkie,55.4,37.8,46.3,42.0,42.1,22.7,49.9,0.0,[49.9]For the score difference, what's the version of GPT evaluator you are using? We are using "gpt-4-0613" as the evaluator.
I'm using the huggingface space provided by MM-Vet. Here is my results:
,rec,ocr,know,gen,spat,math,total,std,runs
Qwen-VL-Chat_raw_eval_code,51.9,41.7,41.2,38.2,44.4,30.0,48.5,0.0,[48.5]
silkie_merged_raw_eval_code,54.5,36.8,46.1,44.5,38.4,18.8,48.3,0.0,[48.3]
My evaluation code is modified from the inference code provided by official Qwen-VL codebase:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
import json
from torch.utils.data import Dataset
import os
from tqdm import tqdm
# Note: The default behavior now has injection attack prevention off.
class MMVetDataset(Dataset):
def __init__(self,data_root) -> None:
super().__init__()
self.data_root = data_root
with open(os.path.join(data_root, "mm-vet.json"), "r") as f:
data = json.load(f)
self.data = [(k,v) for k,v in data.items()]
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return {'id':self.data[index][0],
'image':os.path.join(self.data_root,'images',self.data[index][1]['imagename']),
'question':self.data[index][1]['question']}
tokenizer = AutoTokenizer.from_pretrained("/mnt/gozhang/code/VLFeedback/ckpts/silkie_merged", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("/mnt/gozhang/code/VLFeedback/ckpts/silkie_merged", device_map="cuda", trust_remote_code=True).eval()
# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained("/mnt/gozhang/code/VLFeedback/ckpts/silkie_merged", trust_remote_code=True)
dataset = MMVetDataset("/mnt/gozhang/code/VLFeedback/data_dir/mm-vet")
results = {}
bar = tqdm(total=len(dataset))
for data in iter(dataset):
# 1st dialogue turn
image = data['image']
question = data['question']
query = tokenizer.from_list_format([
{'image': image}, # Either a local path or an url
{'text': question},
])
response, history = model.chat(tokenizer, query=query, history=None)
results[data['id']] = response
bar.update(1)
with open('mmvet_results.json','w') as f:
json.dump(results,f,indent=4)
Can you share your score evaluated by this huggingface space? Maybe there is difference between the space and the notebook script.
Below is the decoding script I am using:
# some code for preparing the Qwen-VL-Chat Model and Tokenizer
test_set = json.load(open("mm-vet.json"))
ret = {}
for sample_id in tqdm(test_set):
image_file = os.path.join("images", test_set[sample_id]["imagename"])
query = f'<img>{image_file}</img>\n{test_set[sample_id]["question"]}'
response, _ = model.chat(tokenizer, query=query, history=None)
ret[sample_id] = response
# save results
with open(f"results/{ckpt_name}.json", "w") as f:
json.dump(ret, f, indent=4)
Decoded Results
Results from the Model Space
,rec,ocr,know,gen,spat,math,total,std,runs
Qwen-VL-Chat 52.2,34.1, 43.5, 39.5, 34.0, 18.8, 45.6, 0.0, [45.6]
Silkie, 55.7, 37.0, 46.8, 42.4, 42.0, 18.8, 49.5, 0.0, [49.5]
The results are consistent with the results from the notebook.
Below is the decoding script I am using:
# some code for preparing the Qwen-VL-Chat Model and Tokenizer test_set = json.load(open("mm-vet.json")) ret = {} for sample_id in tqdm(test_set): image_file = os.path.join("images", test_set[sample_id]["imagename"]) query = f'<img>{image_file}</img>\n{test_set[sample_id]["question"]}' response, _ = model.chat(tokenizer, query=query, history=None) ret[sample_id] = response # save results with open(f"results/{ckpt_name}.json", "w") as f: json.dump(ret, f, indent=4)Decoded Results
Results from the Model Space
,rec,ocr,know,gen,spat,math,total,std,runs Qwen-VL-Chat 52.2,34.1, 43.5, 39.5, 34.0, 18.8, 45.6, 0.0, [45.6] Silkie, 55.7, 37.0, 46.8, 42.4, 42.0, 18.8, 49.5, 0.0, [49.5]
The results are consistent with the results from the notebook.
Thanks! I tried your decoding script and I got your results. So the difference comes from the prompt format. tokenizer.from_list_format
adds Picture 1:<img>{img_path}</img>\n
prefix to the prompt, while your prompt has no Picture 1:
prefix. This is really weird because the training data actually uses this prefix, and the inference should follow this setting to get best performance. Anyway, the score 49.5 proves your work is effective.