llm-as-a-judge
There are 42 repositories under llm-as-a-judge topic.
Agenta-AI/agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
prometheus-eval/prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
metauto-ai/agent-as-a-judge
👩⚖️ Coding Agent-as-a-Judge
haizelabs/verdict
Inference-time scaling for LLMs-as-a-judge.
IAAR-Shanghai/xFinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
IAAR-Shanghai/xVerify
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
martin-wey/CodeUltraFeedback
CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)
KID-22/LLM-IR-Bias-Fairness-Survey
This is the repo for the survey of Bias and Fairness in IR with LLMs.
lupantech/ineqmath
Solving Inequality Proofs with Large Language Models.
MJ-Bench/MJ-Bench
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
whitecircle-ai/circle-guard-bench
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
docling-project/docling-sdg
A set of tools to create synthetically-generated data from documents
zhaochen0110/Timo
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
minnesotanlp/cobbler
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
PKU-ONELab/Themis
The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.
OussamaSghaier/CuREV
Harnessing Large Language Models for Curated Code Reviews
OtherVibes/mcp-as-a-judge
MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations
root-signals/rs-sdk
Root Signals SDK
root-signals/root-signals-mcp
MCP for Root Signals Evaluation Platform
aws-samples/genai-system-evaluation
A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques
HillPhelmuth/LlmAsJudgeEvalPlugins
LLM-as-judge evals as Semantic Kernel Plugins
PKU-ONELab/LLM-evaluator-reliability
The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?
UMass-Meta-LLM-Eval/llm_eval
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
Alab-NII/llm-judge-extract-qa
LLM-as-a-judge for Extractive QA datasets
romaingrx/llm-as-a-jailbreak-judge
Explore techniques to use small models as jailbreaking judges
emory-irlab/conqret-rag
Controversial Questions for Argumentation and Retrieval
fcn06/swarm
A Multi Agent Systems Framework, written in Rust. Domain Agents, specialists, can use tools. Workflow Agents can load or define a workflow and monitor execution. LLM as a Judge is used for evaluation. Discovery Service and Memory Service empower agent interactions.
trotacodigos/Rubric-MQM
The code for ACL 2025 "RUBRIC-MQM : Span-Level LLM-as-judge in Machine Translation For High-End Models"
aws-samples/model-as-a-judge-eval
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
djokester/groqeval
Use groq for evaluations
trustyai-explainability/vllm_judge
A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.
noy-sternlicht/Debatable-Intelligence
Official implementation for the paper "Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation"
PKU-ONELab/NLG-DualEval
The official repository for our ACL 2025 paper: A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
Francesco-Sovrano/PROB-SWE
Replication package for PROBE-SWE: a dynamic benchmark to generate, validate, and analyze data-induced cognitive biases in GPAI on typical software-engineering dilemmas.
nluninja/studentsbot
An intelligent chatbot to provide information about courses, exams, services and procedures of the Catholic University using RAG (Retrieval-Augmented Generation) technologies
sshh12/perf-review
What if AI models were judging your performance review or resume? This system reveals the hidden biases and preferences of AI judges by running competitive tournaments between different writing styles and optimization strategies.