yuchenlin/ZeroEval

Difference with Lighteval and LM-Eval-Harness

Closed this issue · 2 comments

Hi Bill,

Thanks for your great work! I am evaluating LMs and accidentally finding your repo. Just curious, what is the goal of this project and what will be the unique difference compared with lighteval and lm-eval-harness?

Thank you!

Hey Shizhe,

Thanks for the question. Here are a few points:

  • We want to unify the evaluation with all LLMs including those API-only LLMs such as GPTs, Gemini, etc. I believe lighteval and lm-eval-harness might not support them well especially for those new models coming out every week.
  • We want to add a few more tasks and focus on the zero-shot cot + structured output setting, which lighteval might support as well but it's not easy for us to do the customization. We find our codebase is relatively easy for us to modify to support new eval tasks such as CRUX and ZebraLogic.

openai's simple-evals is very close to our goal but it seems that they do not plan to maintain the codebase for new models. We aim to keep updated regularly for supporting new models and welcome the contributions from the community.

Thanks for the question again! I hope it helps! :D

Hi Bill,

Thank you so much for your explanation! It is quite clear to me.
Thank you for your contribution!