/drbench

An enterprise deep research benchmark

Primary LanguagePythonApache License 2.0Apache-2.0

DRBench – The Benchmark for Enterprise Deep Research Agents

πŸš€ Coming Soon

⭐ Star & Watch this repo to be notified about the release drbench_banner.png

DRBench is the first of its kind benchmark designed to evaluate deep research agents on complex, open-ended enterprise deep research tasks.

It tests an agent’s ability to conduct multi-hop, insight-driven research across public and private data sourcesβ€”just like a real enterprise analyst.


🧠 Why drbench?

  • πŸ”Ž Real Deep Research Tasks
    Not simple fact lookups. It has tasks like "What changes should we make to our product roadmap to ensure compliance?" which require multi-step reasoning, synthesis, and reporting.

  • 🏒 Enterprise Context Grounding
    Each task is rooted in realistic user personas (e.g., Product Developer) and organizational settings (e.g., ServiceNow), for deep understanding and contextual awareness.

  • 🧩 Multi-Modal, Multi-Source Reasoning
    Agents must search, retrieve, and reason across:

    • Internal chat logs πŸ’¬
    • Cloud file systems πŸ“‚
    • Spreadsheets πŸ“Š
    • PDFs πŸ“„
    • Websites 🌐
    • Emails πŸ“§
  • 🧠 Insight-Centric Evaluation
    Reports are scored based on whether agents extract the most critical insights and properly cite their sources.


πŸ“¦ What You'll Get

βœ… The first benchmark for deep research across hybrid enterprise environments
βœ… A suite of real-world tasks across Enterprise UseCases like CRM βœ… A realistic simulated enterprise stack (chat, docs, email, web, etc.)
βœ… A task generation framework blending web-based facts and local context
βœ… A lightweight, scalable evaluation mechanism for insightfulness and citation


πŸ§ͺ Project Status

We’re putting the final polish on the benchmark, evaluation tools, and baseline agents.
Public release coming soon!


🀝 Get Involved

Interested in early access, collaboration, or feedback?


🀝 Core Contributers