Common-sense QA checklist

Question

Closed this issue 9 months ago · 4 comments

Output generation code that supports litellm checked into the repo
System outputs for OpenAI models checked in to the repo
Zeno visualization code checked into the repo
Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
Overall numerical results added to the paper
Analysis is done of the results and text and examples are added to the paper
(Optional) Also created results for Mixtral (through Together)

Answer 1 · 2023-12-13T15:09:15.000Z

Should we include which model's results for GPT (gpt-4-1106-preview, gpt-3.5-turbo-1106, or gpt-3.5-turbo)?

Answer 2 · 2023-12-13T15:32:21.000Z

You can put the output files in individual folders like it's done for the math_reasoning problems currently.

Answer 3 · 2023-12-13T15:35:55.000Z

Yeah, I just want to confirm the models we choose to compare since evaluating each may require a lot of time/money.

Answer 4 · 2023-12-13T15:52:19.000Z

Ahh sorry misunderstood you. Probably something everyone should be aware of.