ai-benchmark

There are 5 repositories under ai-benchmark topic.

  • microsoft/WindowsAgentArena

    Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

    Language:Python76583781
  • TheAgentCompany/TheAgentCompany

    An agent benchmark with tasks in a simulated software company.

    Language:Python5461232079
  • kaykycampos/gta-benchmark

    GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

  • Habitante/gta-benchmark

    GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

    Language:Python7110
  • petmal/MindTrial

    MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI), custom tasks in YAML, and HTML/CSV reports.

    Language:Go10