Add Claude and Google models into benchmark

Question

Add Claude and Google models into benchmark

Opened this issue 6 months ago · 5 comments

Thanks a lot a such uniq benchmark! Could you please add great models from Claude to the benchmark: Haiku, Sonnet and Opus?
And, Google Gemini Pro and Gemini Flash also.

Answer 1 · 2024-06-15T07:04:35.000Z

Thanks for your interest in our work! Currently we had time and resources to evaluate only some of the main models in the field. We'll try to extend the model selection in the future, and any help from the community is highly appreciated! New models can be added to the leaderboard via pull request of evaluation results to this repository.

Answer 2 · 2024-07-28T17:00:09.000Z

Yes, thanks so much for all your hard work. I just wanted to mention that the Claude and Google models are looking great at the moment. And the new Llama 405b is pretty impressive too. When you get a chance, I'd love to see your benchmarks for all of them. Thanks again!

Answer 3 · 2024-07-29T11:45:20.000Z

We have just added results for LLama-3.1-Instruct (8B and 70B). LLama-3.1 70B shows very strong results on long contexts (32k+), even outperforming GPT-4.

Answer 4 · 2024-07-29T11:50:41.000Z

It's great! Thank you.

May I ask what prevents you from adding Claude, Gemini and llama 405b to the benchmark?

If you wish, I can assist with providing an OpenRouter api key for that purpose. You can contact me via telegram: https://t.me/rodion_m_tg
@yurakuratov

Answer 5 · 2024-11-20T08:28:24.000Z

We have evaluated Gemini 1.5 Pro 002 on BABILong tasks:

It is currently the strongest LLM we have evaluated.

These results are included in updated paper.