neulab/gemini-benchmark

Common-sense QA checklist

Closed this issue · 4 comments

  • Output generation code that supports litellm checked into the repo
  • System outputs for OpenAI models checked in to the repo
  • Zeno visualization code checked into the repo
  • Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
  • Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
  • System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
  • Overall numerical results added to the paper
  • Analysis is done of the results and text and examples are added to the paper
  • (Optional) Also created results for Mixtral (through Together)

Should we include which model's results for GPT (gpt-4-1106-preview, gpt-3.5-turbo-1106, or gpt-3.5-turbo)?

You can put the output files in individual folders like it's done for the math_reasoning problems currently.

Yeah, I just want to confirm the models we choose to compare since evaluating each may require a lot of time/money.

Ahh sorry misunderstood you. Probably something everyone should be aware of.