awesome-evaluation-lm

Collection Of Automated Language Model Assessment

Project Name	GitHub Link	Description
PandaLM	PandaLM	A universal evaluation framework designed to assess both foundation models and chat models.
MiniCheck	MiniCheck	A toolkit for evaluating the quality of model outputs, with a focus on assessing factual consistency.
ChatEval	ChatEval	An evaluation framework designed specifically for chatbots, including both automatic and human evaluation methods.
auto-j	auto-j	A tool for automatically evaluating the fluency and grammatical correctness of text generated by language models.
LLMBar	LLMBar	An evaluation framework that includes a large collection of benchmark tests for evaluating the performance of large language models on various tasks.
JudgeLM	JudgeLM	A platform for evaluating and comparing different language models.
LAMM	LAMM	A tool that focuses on evaluating the factual consistency of models, providing test datasets and evaluation metrics.
Prometheus	Prometheus	An evaluation framework that includes test datasets and metrics for evaluating the performance of models on question answering tasks.
PCRM	PCRM	A Prompt-based Chat Model evaluation method that provides evaluation metrics and code.
TIGERScore	TIGERScore	An evaluation framework that focuses on assessing the fluency and coherence of text generated by models.

voidful/awesome-evaluation-lm