How are the values of MG and ML calculated?

Question

How are the values of MG and ML calculated?

mary-0830 opened this issue 9 months ago · 3 comments

Hi, author,
For example, the score of the Yi-34B model, my understanding is Swv = 21.9, Sv = 36.1, is that so? MG = Sv-Swv = 14.2

How is the value of St obtained?

Can you give an example? I ran the scores of mm and to with VLMEVALkit, and I don't quite understand how to calculate these two scores.

Thanks!!!

Answer 1 · 2024-04-08T10:44:02.000Z

Hi, author, For example, the score of the Yi-34B model, my understanding is Swv = 21.9, Sv = 36.1, is that so? MG = Sv-Swv = 14.2

How is the value of St obtained?

Can you give an example? I ran the scores of mm and to with VLMEVALkit, and I don't quite understand how to calculate these two scores.

Thanks!!!

We have updated the evaluation guidelines. We showcase the calculation of MG and ML for LLaVA-Next-34B. By the way, we will be glad if you can submit your own results to our leaderboard! Enjoy it!

Answer 2 · 2024-04-09T01:12:56.000Z

Hi, author,
Thanks for your reply! I understand the calculation method, but I still have a question. Why change the model in the third step? Is its base model the same?

In the case of not looking at the graph in a multimodal large model (i.e., --gen-mode to), doesn't that mean it's already an LLM score?

Thanks for your answer again.

Answer 3 · 2024-04-09T10:42:00.000Z

Hi, author, Thanks for your reply! I understand the calculation method, but I still have a question. Why change the model in the third step? Is its base model the same?

In the case of not looking at the graph in a multimodal large model (i.e., --gen-mode to), doesn't that mean it's already an LLM score?

Thanks for your answer again.

We change the LVLM into its corresponding language model in the third step. Although the LVLM operates solely through its text encoder (LLM) in the “to” mode, it's important to note that most LVLMs unlocked the parameters of their LLMs during multi-modal training. Therefore, we have calculated the performance difference between the original LLM and the LLM after multi-modal training on the same benchmark. This is done to reflect, to a certain extent, data leakage during the multi-modal training process.