A new benchmark for measuring LLM's capability to detect bugs in large codebase.
Primary LanguageJupyter Notebook