Evaluation Prompt for mPLUG-Owl2
vateye opened this issue · 8 comments
Could I possibly know the evaluation prompt for mPLUG-Owl2? It seems that the subpar performance might be a result of an inappropriate prompt.
For example, the prompt short-answer generation would be:
<|image|>{QUESTION}\nAnswer the question using a single word or phrase.
For multiple-choice question, the prompt would be:
<|image|>{QUESTION}\n{OPTIONS}\nAnswer with the option’s letter from the given choices directly.
Yes, we add the scripts for evaluation MMMU correctly. See the script Here
Here is the results on validation:
{"Overall": 32.666666666666664, "Accounting": 6.666666666666667, "Agriculture": 50.0, "Architecture_and_Engineering": 26.666666666666668, "Art": 50.0, "Art_Theory": 60.0, "Basic_Medical_Science": 26.666666666666668, "Biology": 20.0, "Chemistry": 13.333333333333334, "Clinical_Medicine": 36.666666666666664, "Computer_Science": 26.666666666666668, "Design": 50.0, "Diagnostics_and_Laboratory_Medicine": 33.33333333333333, "Economics": 40.0, "Electronics": 16.666666666666664, "Energy_and_Power": 36.666666666666664, "Finance": 13.333333333333334, "Geography": 26.666666666666668, "History": 43.333333333333336, "Literature": 76.66666666666667, "Manage": 26.666666666666668, "Marketing": 36.666666666666664, "Materials": 26.666666666666668, "Math": 33.33333333333333, "Mechanical_Engineering": 33.33333333333333, "Music": 23.333333333333332, "Pharmacy": 36.666666666666664, "Physics": 20.0, "Psychology": 20.0, "Public_Health": 26.666666666666668, "Sociology": 43.333333333333336}
validation_231130084216_s0_metrics.json
validation_231130084216_s0_prediction_groupby_category.json
validation_231130084216_s0_results.json
@xiangyue9607 @drogozhang @NipElement Hope you can re-evaluate the results and revised in the paper.
Thanks for pointing out this issue and implementing the evaluation script.
We used the correct text prompt structure.
By checking your code implementation, the main difference between your prompt and ours is that you replaced all the <image 1> <image 2> with <|image|>.
However, we only appended it at the beginning of the prompt by using the same code in the mPLUG-Owl2 README.md (see below) and keep all the <image 1> <image 2>... (the same with what we did to the prompts of all other models).
inp = DEFAULT_IMAGE_TOKEN + query
Sorry for the prompt error. We will run the test set with your script and update the leadboard/paper very soon.
We are trying to evaluate on the test split, but some images are missing. For example, the <image 6> appears in the options but only with five images at most are provided. E.g,:
{'id': 'test_Architecture_and_Engineering_39', 'question': 'Suggest and apply the suitable image transformation (arithmetic) operation on Image 1 and Image 2 of an area in order to reduce the overall noise contribution for Image 1.<image 1><image 2>', 'options': "['<image 3>', '<image 4>', '<image 5>', '<image 6>']", 'explanation': '?', 'image_1': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=317x133 at 0x7F4F912EC410>, 'image_2': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=294x126 at 0x7F4F912ED810>, 'image_3': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=450x182 at 0x7F4F912ECB90>, 'image_4': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=452x189 at 0x7F4F912EDFD0>, 'image_5': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=429x162 at 0x7F4F912EEB10>, 'img_type': "['Tables']", 'answer': '?', 'topic_difficulty': 'Medium', 'question_type': 'multiple-choice', 'subfield': 'Surveying and Mapping'}
{'id': 'test_Architecture_and_Engineering_109', 'question': ' Match List I with List II and select the correct answer using the codes given below the lists:<image 1><image 2>', 'options': "['<image 3>', '<image 4>', '<image 5>', '<image 6>']", 'explanation': '?', 'image_1': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=174x169 at 0x7F4F912D5C90>, 'image_2': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=188x172 at 0x7F4F912D4BD0>, 'image_3': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=197x21 at 0x7F4F912D69D0>, 'image_4': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=194x51 at 0x7F4F912D49D0>, 'image_5': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=194x42 at 0x7F4F912D42D0>, 'img_type': "['Diagrams']", 'answer': '?', 'topic_difficulty': 'Medium', 'question_type': 'multiple-choice', 'subfield': 'Surveying and Mapping'}
Thank you for pointing out the issue with the test set.
We have now re-uploaded the data for the 12 questions where the number of images exceeded five.
We are not releasing the answers for the test set at the moment. Currently, we are building the evaluation server for the test set.
If you encounter any further issues, please feel free to let us know.
Thank you for pointing out the issue with the test set. We have now re-uploaded the data for the 12 questions where the number of images exceeded five. We are not releasing the answers for the test set at the moment. Currently, we are building the evaluation server for the test set. If you encounter any further issues, please feel free to let us know.
Here are the prediction results of mPLUG-Owl2 on the latest test set.
test_231203182439_s0_results.json
Hi, here is the results. Now It works much better, thanks for your prompt engineering!
{'Overall-Art and Design': {'num': 1163, 'acc': 0.485}, 'Art': {'num': 231, 'acc': 0.576}, 'Art_Theory': {'num': 429, 'acc': 0.534}, 'Design': {'num': 169, 'acc': 0.598}, 'Music': {'num': 334, 'acc': 0.302}, 'Overall-Business': {'num': 1428, 'acc': 0.256}, 'Accounting': {'num': 380, 'acc': 0.287}, 'Economics': {'num': 267, 'acc': 0.292}, 'Finance': {'num': 355, 'acc': 0.203}, 'Manage': {'num': 245, 'acc': 0.224}, 'Marketing': {'num': 181, 'acc': 0.287}, 'Overall-Science': {'num': 2426, 'acc': 0.249}, 'Biology': {'num': 345, 'acc': 0.272}, 'Chemistry': {'num': 603, 'acc': 0.239}, 'Geography': {'num': 565, 'acc': 0.297}, 'Math': {'num': 505, 'acc': 0.188}, 'Physics': {'num': 408, 'acc': 0.252}, 'Overall-Health and Medicine': {'num': 1752, 'acc': 0.328}, 'Basic_Medical_Science': {'num': 326, 'acc': 0.399}, 'Clinical_Medicine': {'num': 325, 'acc': 0.323}, 'Diagnostics_and_Laboratory_Medicine': {'num': 162, 'acc': 0.34}, 'Pharmacy': {'num': 430, 'acc': 0.312}, 'Public_Health': {'num': 509, 'acc': 0.297}, 'Overall-Humanities and Social Science': {'num': 947, 'acc': 0.467}, 'History': {'num': 278, 'acc': 0.46}, 'Literature': {'num': 112, 'acc': 0.741}, 'Sociology': {'num': 252, 'acc': 0.444}, 'Psychology': {'num': 305, 'acc': 0.39}, 'Overall-Tech and Engineering': {'num': 2784, 'acc': 0.296}, 'Agriculture': {'num': 287, 'acc': 0.324}, 'Architecture_and_Engineering': {'num': 551, 'acc': 0.294}, 'Computer_Science': {'num': 371, 'acc': 0.318}, 'Electronics': {'num': 256, 'acc': 0.145}, 'Energy_and_Power': {'num': 432, 'acc': 0.394}, 'Materials': {'num': 458, 'acc': 0.266}, 'Mechanical_Engineering': {'num': 429, 'acc': 0.282}, 'Overall': {'num': 10500, 'acc': 0.321}}
We will update the leaderboard and paper soon.
We have updated the leaderboard and updated paper will be online soon.
Feel free to re-open the issue if you have any further questions.