Benchmark agent workflows: try the models of your choice on the framework that you want

This repo is the engine for the evaluations displayed in our Agents v2.0 announcement post.

You can use it to test agents on different frameworks:

On different benchmarks:

GAIA
our custom agent reasoning benchmark that includes tasks from GSM8K, HotpotQA and GAIA

And with different models (cf benchmark below).

We also implement LLM-judge evaluation, with parallel processing for faster results.

aymeric-roucher/agent_reasoning_benchmark