MMMU-Benchmark/MMMU

Question about "Text as Input"

fxmeng opened this issue · 2 comments

fxmeng commented

Thank you for your valuable MMMU benchmark.
I have a question regarding your paper. You mentioned in the article that each data point contains at least one image. Then, how were the results for Llama2 7B, FLAN-T5-XXL, Vicuna-13B and GPT-4 Text without OCR or LLaVA Caption obtained?

Thank you for reaching out and for your interest in our MMMU benchmark. I understand your question regarding how the results for models like Llama2 7B, FLAN-T5-XXL, Vicuna-13B, and GPT-4 Text were obtained, especially since they do not incorporate OCR or LLaVA Caption capabilities.

In our benchmark, when we refer to 'text-only' evaluations, we mean that these evaluations do not consider image inputs. This is similar to other benchmarks that focus exclusively on text. Essentially, we ignore the image information in the questions. For example, a prompt in our benchmark might be:

  • |
    Question: What architectural style is the building in <image 1>?

    Options:

    (A) Gothic
    (B) Baroque
    (C) Modernist
    (D) Brutalist

    Answer:

In this instance, even though there is a reference to <image 1>, the image is not actually inputted into the model.

In our paper, we explain this as follows:

For text-only LLMs (Large Language Models), we consider some of the most capable models available, including GPT-4, Llama2-7B, FLAN-T5-XXL, and Vicuna-13B. These models are adopted either as the text encoder or decoder in the selected LLMs. To evaluate whether an external image-to-text tool could enhance the performance of these text-only LLMs on the MMMU benchmark, we employed OCR (Optical Character Recognition) through MMOCR1 or captioning via LLaVA-1.5. This approach provides recognized text information to the text-only LLMs.

Therefore, for these specific models, our evaluation approach involved testing their performance on tasks that typically involve images, but without providing the image data. This allowed us to focus on the text-processing capabilities of these models.

I hope this clarifies your query. Please feel free to reach out if you have any more questions or need further information.

fxmeng commented

Thank you for your reply. It is very insightful work.