booydar/babilong

Add Claude and Google models into benchmark

Opened this issue · 5 comments

Thanks a lot a such uniq benchmark! Could you please add great models from Claude to the benchmark: Haiku, Sonnet and Opus?
And, Google Gemini Pro and Gemini Flash also.

Thanks for your interest in our work! Currently we had time and resources to evaluate only some of the main models in the field. We'll try to extend the model selection in the future, and any help from the community is highly appreciated! New models can be added to the leaderboard via pull request of evaluation results to this repository.

Yes, thanks so much for all your hard work. I just wanted to mention that the Claude and Google models are looking great at the moment. And the new Llama 405b is pretty impressive too. When you get a chance, I'd love to see your benchmarks for all of them. Thanks again!

We have just added results for LLama-3.1-Instruct (8B and 70B). LLama-3.1 70B shows very strong results on long contexts (32k+), even outperforming GPT-4.

image

It's great! Thank you.

May I ask what prevents you from adding Claude, Gemini and llama 405b to the benchmark?

If you wish, I can assist with providing an OpenRouter api key for that purpose. You can contact me via telegram: https://t.me/rodion_m_tg
@yurakuratov

We have evaluated Gemini 1.5 Pro 002 on BABILong tasks:
image
image
It is currently the strongest LLM we have evaluated.

These results are included in updated paper.