ai-benchmark

There are 5 repositories under ai-benchmark topic.

microsoft/WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
Language:Python765 8 3781
TheAgentCompany/TheAgentCompany
An agent benchmark with tasks in a simulated software company.
Language:Python546 12 32079
kaykycampos/gta-benchmark
GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities
8 1 00
Habitante/gta-benchmark
GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities
Language:Python7 1 10
petmal/MindTrial
MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI), custom tasks in YAML, and HTML/CSV reports.
Language:Go10