THUDM/AgentBench

dbbench-std: Task Output Seems Correct But MD5 Mismatches

Opened this issue · 1 comments

I looked into one particular DbBench task. GPT4 seems to have give the right answer but MD5 doesn't match.

Steps to reproduce the behavior:

  1. Run a task with line #106 of dbbench/standard.jsonl:
    {"description": "The film titled 'New Movie' will be added to the Filmography table with the lead actor role and a note of '-' for the year 2019.", "label": ["INSERT INTO Filmography (Year, Title, Role, Notes) VALUES ('2019', 'New Movie', 'Lead Actor', '-')"], "create": {"database": "fetaqa", "init": "fetaqa_init.sql"}, "table": {"table_name": "Filmography", "table_info": {"columns": [{"name": "Year", "type": "INT"}, {"name": "Title", "type": "TEXT"}, {"name": "Role", "type": "TEXT"}, {"name": "Notes", "type": "TEXT"}], "rows": [["1985", "Back to the Future", "Jennifer Parker", "-"], ["2008", "Still Waters Burn", "Laura Harper", "-"], ["2011", "Alien Armageddon", "Eileen Daly", "-"], ["2013", "You Are Not Alone", "Cristina's Mom", "Short film"], ["2013", "Max", "Mom", "Short film"], ["2014", "Starship: Rising", "Captain Savage", "-"], ["2015", "EP/Executive Protection", "Pam Travis", "-"], ["2015", "Back in Time", "Herself", "Back to the Future documentary"], ["2015", "Back to the 2015 Future", "Jennifer Parker", "Short film"], ["2017", "Vitals", "Margaret Parks", "-"], ["2018", "Groove Street", "Julie", "-"], ["1999", "The Matrix", "Trinity", "-"], ["2005", "Batman Begins", "Rachel Dawes", "-"], ["2010", "Inception", "Mal", "-"], ["2012", "The Avengers", "Black Widow/Natasha Romanoff", "-"], ["2014", "Interstellar", "Brand", "-"], ["2016", "La La Land", "Mia Dolan", "-"], ["2017", "Wonder Woman", "Wonder Woman/Diana Prince", "-"], ["2019", "Avengers: Endgame", "Black Widow/Natasha Romanoff", "-"], ["2021", "The Suicide Squad", "Harley Quinn", "-"], ["2022", "Black Panther: Wakanda Forever", "Okoye", "-"]]}}, "evaluation": "", "example": "", "type": ["INSERT"], "heads": ["Year", "Title", "Role", "Notes"], "add_description": "The name of this table is Filmography, and the headers of this table are Year,Title,Role,Notes.", "source": "fetaqa", "answer_md5": "[('ae2213ddbcb907c43fd757035b363328',)]"}

  2. Get the output SQL command and MD5 from the output/runs.jsonl file:

image

  1. Print out the modified table in dbbench.interaction.execute:

image

  1. Get the MD5 from the dataset and compared the one in the output:

image

  • OS: Ubuntu 22.04
  • Python: 3.9

This is only one example I collected. There are many errors of similar kind. Can you help me identify the issues I am facing, please?

zhc7 commented

Hi, @wchen-github . The answer md5 is calculated based on the label field in the data entry. As you can see, the correct answer is assumed to be INSERT INTO Filmography (Year, Title, Role, Notes) VALUES ('2019', 'New Movie', 'Lead Actor', '-'). Capitalized Lead Actor is probably causing the difference in hash. We'll try to do better in data filtering and validation. There shouldn't be many similar exceptions. Thank you for your report!