OpenGVLab/LAMM

What are the metrics of `Omnibenchmark`, `ScienceQA`, `MMBench`, `SEED`, and `MME` benchmarks?

zhimin-z opened this issue · 3 comments

We no longer use the benchmark from LAMM as the benchmark for evaluating MLLMs. Instead, we have adopted our latest work, ChEF, as the benchmark for evaluation. In this framework, scenarios such as Omnibenchmark, ScienceQA, MMBench, SEED, and MME, etc., use the PPL inferencer, with the metric being Accuracy. For more details, please refer to our paper ChEF. For the usage of ChEF, please refer to the Tutorial. Of course, if you wish to use the original LAMM evaluation method, we have also fully implemented the LAMM evaluation pipeline within the ChEF framework. Please refer to LAMM Recipes for details. To be noted, the evaluation method of LAMM is no longer recommended.

We no longer use the benchmark from LAMM as the benchmark for evaluating MLLMs. Instead, we have adopted our latest work, ChEF, as the benchmark for evaluation. In this framework, scenarios such as Omnibenchmark, ScienceQA, MMBench, SEED, and MME, etc., use the PPL inferencer, with the metric being Accuracy. For more details, please refer to our paper ChEF. For the usage of ChEF, please refer to the Tutorial. Of course, if you wish to use the original LAMM evaluation method, we have also fully implemented the LAMM evaluation pipeline within the ChEF framework. Please refer to LAMM Recipes for details. To be noted, the evaluation method of LAMM is no longer recommended.

Thanks for your replies. Currently, do the evaluation results on the LAMM website's leaderboard from ChEF? @Coach257

Yes, all the evaluation results on the leaderboard are from ChEF.