llm-as-a-judge

There are 23 repositories under llm-as-a-judge topic.

Agenta-AI/agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM Observability all in one place.
Language:Python2.3k 26 624281
prometheus-eval/prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
Language:Python893 3 3855
metauto-ai/agent-as-a-judge
🤠 Agent-as-a-Judge and DevAI dataset
Language:Python379 3 650
IAAR-Shanghai/xFinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
Language:Python159 4 117
martin-wey/CodeUltraFeedback
CodeUltraFeedback: aligning large language models to coding preferences
Language:Python71 2 45
KID-22/LLM-IR-Bias-Fairness-Survey
This is the repo for the survey of Bias and Fairness in IR with LLMs.
52 3 03
MJ-Bench/MJ-Bench
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
Language:Jupyter Notebook43 1 35
zhaochen0110/Timo
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
Language:Python20 1 12
minnesotanlp/cobbler
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Language:Jupyter Notebook19 2 02
PKU-ONELab/Themis
The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.
Language:Python19 0 41
OussamaSghaier/CuREV
Harnessing Large Language Models for Curated Code Reviews
Language:Python12 1 01
root-signals/rs-python-sdk
Root Signals Python SDK
Language:Python110
UMass-Meta-LLM-Eval/llm_eval
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
Language:Python8 1 291
aws-samples/genai-system-evaluation
A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques
Language:Jupyter Notebook7 3 0
PKU-ONELab/LLM-evaluator-reliability
The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?
Language:Python7 0 01
HillPhelmuth/LlmAsJudgeEvalPlugins
LLM-as-judge evals as Semantic Kernel Plugins
Language:C#6 1 01
aws-samples/model-as-a-judge-eval
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
Language:Jupyter Notebook3 2 01
emory-irlab/conqret-rag
Controversial Questions for Argumentation and Retrieval
Language:Python3 3 00
romaingrx/llm-as-a-jailbreak-judge
Explore techniques to use small models as jailbreaking judges
Language:Python3 1 00
djokester/groqeval
Use groq for evaluations
Language:Python2 2 60
herrxy/LLM-evaluator-reliability
The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?
Language:Python00
catarinaopires/mllm-as-a-judge
ImageCLEFmed 2013 case-based retrieval task relevance judgments expansion via an MLLM-as-a-Judge approach.
Language:Python
rafaelsandroni/antibodies
Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)
Language:Python1 0

llm-as-a-judge

Agenta-AI/agenta

prometheus-eval/prometheus-eval

metauto-ai/agent-as-a-judge

IAAR-Shanghai/xFinder

martin-wey/CodeUltraFeedback

KID-22/LLM-IR-Bias-Fairness-Survey

MJ-Bench/MJ-Bench

zhaochen0110/Timo

minnesotanlp/cobbler

PKU-ONELab/Themis

OussamaSghaier/CuREV

root-signals/rs-python-sdk

UMass-Meta-LLM-Eval/llm_eval

aws-samples/genai-system-evaluation

PKU-ONELab/LLM-evaluator-reliability

HillPhelmuth/LlmAsJudgeEvalPlugins

aws-samples/model-as-a-judge-eval

emory-irlab/conqret-rag

romaingrx/llm-as-a-jailbreak-judge

djokester/groqeval

herrxy/LLM-evaluator-reliability

catarinaopires/mllm-as-a-judge

rafaelsandroni/antibodies