MMStar-Benchmark/MMStar

How are the values of MG and ML calculated?

mary-0830 opened this issue · 3 comments

Hi, author,
For example, the score of the Yi-34B model, my understanding is Swv = 21.9, Sv = 36.1, is that so? MG = Sv-Swv = 14.2

How is the value of St obtained?

Can you give an example? I ran the scores of mm and to with VLMEVALkit, and I don't quite understand how to calculate these two scores.

Thanks!!!

Hi, author, For example, the score of the Yi-34B model, my understanding is Swv = 21.9, Sv = 36.1, is that so? MG = Sv-Swv = 14.2

How is the value of St obtained?

Can you give an example? I ran the scores of mm and to with VLMEVALkit, and I don't quite understand how to calculate these two scores.

Thanks!!!

We have updated the evaluation guidelines. We showcase the calculation of MG and ML for LLaVA-Next-34B. By the way, we will be glad if you can submit your own results to our leaderboard! Enjoy it!

Hi, author,
Thanks for your reply! I understand the calculation method, but I still have a question. Why change the model in the third step? Is its base model the same?

In the case of not looking at the graph in a multimodal large model (i.e., --gen-mode to), doesn't that mean it's already an LLM score?

Thanks for your answer again.

Hi, author, Thanks for your reply! I understand the calculation method, but I still have a question. Why change the model in the third step? Is its base model the same?

In the case of not looking at the graph in a multimodal large model (i.e., --gen-mode to), doesn't that mean it's already an LLM score?

Thanks for your answer again.

We change the LVLM into its corresponding language model in the third step. Although the LVLM operates solely through its text encoder (LLM) in the “to” mode, it's important to note that most LVLMs unlocked the parameters of their LLMs during multi-modal training. Therefore, we have calculated the performance difference between the original LLM and the LLM after multi-modal training on the same benchmark. This is done to reflect, to a certain extent, data leakage during the multi-modal training process.