Common-sense QA checklist
Closed this issue · 4 comments
neubig commented
- Output generation code that supports litellm checked into the repo
- System outputs for OpenAI models checked in to the repo
- Zeno visualization code checked into the repo
- Zeno project shared on the slack and with the "Benchmarking Gemini" Zeno group.
- Confirmation that the results using OpenAI models are reasonable and more-or-less match previous work
- System outputs for Gemini (through Vertex AI) checked in to the repo and uploaded to the Zeno project
- Overall numerical results added to the paper
- Analysis is done of the results and text and examples are added to the paper
- (Optional) Also created results for Mixtral (through Together)
yuzc19 commented
Should we include which model's results for GPT (gpt-4-1106-preview, gpt-3.5-turbo-1106, or gpt-3.5-turbo)?
Sparkier commented
You can put the output files in individual folders like it's done for the math_reasoning problems currently.
yuzc19 commented
Yeah, I just want to confirm the models we choose to compare since evaluating each may require a lot of time/money.
Sparkier commented
Ahh sorry misunderstood you. Probably something everyone should be aware of.