llm-as-a-judge

There are 42 repositories under llm-as-a-judge topic.

  • agenta

    Agenta-AI/agenta

    The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

    Language:Python3.4k28737393
  • prometheus-eval/prometheus-eval

    Evaluate your LLM's response with Prometheus and GPT4 💯

    Language:Python1k24264
  • metauto-ai/agent-as-a-judge

    👩‍⚖️ Coding Agent-as-a-Judge

    Language:Python66241096
  • haizelabs/verdict

    Inference-time scaling for LLMs-as-a-judge.

    Language:Jupyter Notebook3083522
  • IAAR-Shanghai/xFinder

    [ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

    Language:Python1774127
  • IAAR-Shanghai/xVerify

    xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

    Language:Python1383107
  • martin-wey/CodeUltraFeedback

    CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)

    Language:Python72245
  • KID-22/LLM-IR-Bias-Fairness-Survey

    This is the repo for the survey of Bias and Fairness in IR with LLMs.

  • lupantech/ineqmath

    Solving Inequality Proofs with Large Language Models.

    Language:Python527
  • MJ-Bench/MJ-Bench

    Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

    Language:Jupyter Notebook49135
  • circle-guard-bench

    whitecircle-ai/circle-guard-bench

    First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

    Language:Python452
  • docling-project/docling-sdg

    A set of tools to create synthetically-generated data from documents

    Language:Python3501513
  • zhaochen0110/Timo

    Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

    Language:Python24112
  • minnesotanlp/cobbler

    Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

    Language:Jupyter Notebook21202
  • PKU-ONELab/Themis

    The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

    Language:Python20061
  • OussamaSghaier/CuREV

    Harnessing Large Language Models for Curated Code Reviews

    Language:Python15122
  • mcp-as-a-judge

    OtherVibes/mcp-as-a-judge

    MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations

    Language:Python145
  • root-signals/rs-sdk

    Root Signals SDK

    Language:Python12201
  • root-signals/root-signals-mcp

    MCP for Root Signals Evaluation Platform

    Language:Python10204
  • aws-samples/genai-system-evaluation

    A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

    Language:Jupyter Notebook9301
  • HillPhelmuth/LlmAsJudgeEvalPlugins

    LLM-as-judge evals as Semantic Kernel Plugins

    Language:C#8101
  • PKU-ONELab/LLM-evaluator-reliability

    The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?

    Language:Python8001
  • UMass-Meta-LLM-Eval/llm_eval

    A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

    Language:Python81291
  • Alab-NII/llm-judge-extract-qa

    LLM-as-a-judge for Extractive QA datasets

    Language:Python5401
  • romaingrx/llm-as-a-jailbreak-judge

    Explore techniques to use small models as jailbreaking judges

    Language:Python5100
  • emory-irlab/conqret-rag

    Controversial Questions for Argumentation and Retrieval

    Language:Python4300
  • fcn06/swarm

    A Multi Agent Systems Framework, written in Rust. Domain Agents, specialists, can use tools. Workflow Agents can load or define a workflow and monitor execution. LLM as a Judge is used for evaluation. Discovery Service and Memory Service empower agent interactions.

    Language:Rust4012
  • trotacodigos/Rubric-MQM

    The code for ACL 2025 "RUBRIC-MQM : Span-Level LLM-as-judge in Machine Translation For High-End Models"

    Language:Python4
  • aws-samples/model-as-a-judge-eval

    Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.

    Language:Jupyter Notebook3201
  • djokester/groqeval

    Use groq for evaluations

    Language:Python3260
  • trustyai-explainability/vllm_judge

    A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.

    Language:Python22
  • Debatable-Intelligence

    noy-sternlicht/Debatable-Intelligence

    Official implementation for the paper "Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation"

    Language:Python11
  • PKU-ONELab/NLG-DualEval

    The official repository for our ACL 2025 paper: A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability

    Language:Python10
  • Francesco-Sovrano/PROB-SWE

    Replication package for PROBE-SWE: a dynamic benchmark to generate, validate, and analyze data-induced cognitive biases in GPAI on typical software-engineering dilemmas.

    Language:Jupyter Notebook
  • nluninja/studentsbot

    An intelligent chatbot to provide information about courses, exams, services and procedures of the Catholic University using RAG (Retrieval-Augmented Generation) technologies

    Language:Python
  • sshh12/perf-review

    What if AI models were judging your performance review or resume? This system reveals the hidden biases and preferences of AI judges by running competitive tournaments between different writing styles and optimization strategies.

    Language:Python