Evaluation Prompt for mPLUG-Owl2

Question

Evaluation Prompt for mPLUG-Owl2

vateye opened this issue a year ago · 8 comments

Could I possibly know the evaluation prompt for mPLUG-Owl2? It seems that the subpar performance might be a result of an inappropriate prompt.

For example, the prompt short-answer generation would be:

<|image|>{QUESTION}\nAnswer the question using a single word or phrase.

For multiple-choice question, the prompt would be:

<|image|>{QUESTION}\n{OPTIONS}\nAnswer with the option’s letter from the given choices directly.

Answer 1 · 2023-11-30T08:46:24.000Z

Yes, we add the scripts for evaluation MMMU correctly. See the script Here

Here is the results on validation:

{"Overall": 32.666666666666664, "Accounting": 6.666666666666667, "Agriculture": 50.0, "Architecture_and_Engineering": 26.666666666666668, "Art": 50.0, "Art_Theory": 60.0, "Basic_Medical_Science": 26.666666666666668, "Biology": 20.0, "Chemistry": 13.333333333333334, "Clinical_Medicine": 36.666666666666664, "Computer_Science": 26.666666666666668, "Design": 50.0, "Diagnostics_and_Laboratory_Medicine": 33.33333333333333, "Economics": 40.0, "Electronics": 16.666666666666664, "Energy_and_Power": 36.666666666666664, "Finance": 13.333333333333334, "Geography": 26.666666666666668, "History": 43.333333333333336, "Literature": 76.66666666666667, "Manage": 26.666666666666668, "Marketing": 36.666666666666664, "Materials": 26.666666666666668, "Math": 33.33333333333333, "Mechanical_Engineering": 33.33333333333333, "Music": 23.333333333333332, "Pharmacy": 36.666666666666664, "Physics": 20.0, "Psychology": 20.0, "Public_Health": 26.666666666666668, "Sociology": 43.333333333333336}

validation_231130084216_s0_metrics.json
validation_231130084216_s0_prediction_groupby_category.json
validation_231130084216_s0_results.json

Answer 2 · 2023-11-30T08:46:59.000Z

@xiangyue9607 @drogozhang @NipElement Hope you can re-evaluate the results and revised in the paper.

Answer 3 · 2023-11-30T09:10:03.000Z

Thanks for pointing out this issue and implementing the evaluation script.

We used the correct text prompt structure.
By checking your code implementation, the main difference between your prompt and ours is that you replaced all the <image 1> <image 2> with <|image|>.
However, we only appended it at the beginning of the prompt by using the same code in the mPLUG-Owl2 README.md (see below) and keep all the <image 1> <image 2>... (the same with what we did to the prompts of all other models).

inp = DEFAULT_IMAGE_TOKEN + query

Sorry for the prompt error. We will run the test set with your script and update the leadboard/paper very soon.

Answer 4 · 2023-11-30T09:27:10.000Z

We are trying to evaluate on the test split, but some images are missing. For example, the <image 6> appears in the options but only with five images at most are provided. E.g,:

{'id': 'test_Architecture_and_Engineering_39', 'question': 'Suggest and apply the suitable image transformation (arithmetic) operation on Image 1 and Image 2 of an area in order to reduce the overall noise contribution for Image 1.<image 1><image 2>', 'options': "['<image 3>', '<image 4>', '<image 5>', '<image 6>']", 'explanation': '?', 'image_1': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=317x133 at 0x7F4F912EC410>, 'image_2': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=294x126 at 0x7F4F912ED810>, 'image_3': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=450x182 at 0x7F4F912ECB90>, 'image_4': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=452x189 at 0x7F4F912EDFD0>, 'image_5': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=429x162 at 0x7F4F912EEB10>, 'img_type': "['Tables']", 'answer': '?', 'topic_difficulty': 'Medium', 'question_type': 'multiple-choice', 'subfield': 'Surveying and Mapping'}
{'id': 'test_Architecture_and_Engineering_109', 'question': ' Match List I with List II and select the correct answer using the codes given below the lists:<image 1><image 2>', 'options': "['<image 3>', '<image 4>', '<image 5>', '<image 6>']", 'explanation': '?', 'image_1': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=174x169 at 0x7F4F912D5C90>, 'image_2': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=188x172 at 0x7F4F912D4BD0>, 'image_3': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=197x21 at 0x7F4F912D69D0>, 'image_4': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=194x51 at 0x7F4F912D49D0>, 'image_5': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=194x42 at 0x7F4F912D42D0>, 'img_type': "['Diagrams']", 'answer': '?', 'topic_difficulty': 'Medium', 'question_type': 'multiple-choice', 'subfield': 'Surveying and Mapping'}

Answer 5 · 2023-11-30T18:24:03.000Z

Thank you for pointing out the issue with the test set.
We have now re-uploaded the data for the 12 questions where the number of images exceeded five.
We are not releasing the answers for the test set at the moment. Currently, we are building the evaluation server for the test set.
If you encounter any further issues, please feel free to let us know.

Answer 6 · 2023-12-03T10:32:16.000Z

Thank you for pointing out the issue with the test set. We have now re-uploaded the data for the 12 questions where the number of images exceeded five. We are not releasing the answers for the test set at the moment. Currently, we are building the evaluation server for the test set. If you encounter any further issues, please feel free to let us know.

Here are the prediction results of mPLUG-Owl2 on the latest test set.
test_231203182439_s0_results.json

test_231203182439_s0_prediction_groupby_category.json

Answer 7 · 2023-12-03T11:36:15.000Z

Hi, here is the results. Now It works much better, thanks for your prompt engineering!

{'Overall-Art and Design': {'num': 1163, 'acc': 0.485}, 'Art': {'num': 231, 'acc': 0.576}, 'Art_Theory': {'num': 429, 'acc': 0.534}, 'Design': {'num': 169, 'acc': 0.598}, 'Music': {'num': 334, 'acc': 0.302}, 'Overall-Business': {'num': 1428, 'acc': 0.256}, 'Accounting': {'num': 380, 'acc': 0.287}, 'Economics': {'num': 267, 'acc': 0.292}, 'Finance': {'num': 355, 'acc': 0.203}, 'Manage': {'num': 245, 'acc': 0.224}, 'Marketing': {'num': 181, 'acc': 0.287}, 'Overall-Science': {'num': 2426, 'acc': 0.249}, 'Biology': {'num': 345, 'acc': 0.272}, 'Chemistry': {'num': 603, 'acc': 0.239}, 'Geography': {'num': 565, 'acc': 0.297}, 'Math': {'num': 505, 'acc': 0.188}, 'Physics': {'num': 408, 'acc': 0.252}, 'Overall-Health and Medicine': {'num': 1752, 'acc': 0.328}, 'Basic_Medical_Science': {'num': 326, 'acc': 0.399}, 'Clinical_Medicine': {'num': 325, 'acc': 0.323}, 'Diagnostics_and_Laboratory_Medicine': {'num': 162, 'acc': 0.34}, 'Pharmacy': {'num': 430, 'acc': 0.312}, 'Public_Health': {'num': 509, 'acc': 0.297}, 'Overall-Humanities and Social Science': {'num': 947, 'acc': 0.467}, 'History': {'num': 278, 'acc': 0.46}, 'Literature': {'num': 112, 'acc': 0.741}, 'Sociology': {'num': 252, 'acc': 0.444}, 'Psychology': {'num': 305, 'acc': 0.39}, 'Overall-Tech and Engineering': {'num': 2784, 'acc': 0.296}, 'Agriculture': {'num': 287, 'acc': 0.324}, 'Architecture_and_Engineering': {'num': 551, 'acc': 0.294}, 'Computer_Science': {'num': 371, 'acc': 0.318}, 'Electronics': {'num': 256, 'acc': 0.145}, 'Energy_and_Power': {'num': 432, 'acc': 0.394}, 'Materials': {'num': 458, 'acc': 0.266}, 'Mechanical_Engineering': {'num': 429, 'acc': 0.282}, 'Overall': {'num': 10500, 'acc': 0.321}}

We will update the leaderboard and paper soon.

Answer 8 · 2023-12-04T05:04:46.000Z

We have updated the leaderboard and updated paper will be online soon.
Feel free to re-open the issue if you have any further questions.