Collection Of Automated Language Model Assessment
Project Name | GitHub Link | Description |
---|---|---|
PandaLM | PandaLM | A universal evaluation framework designed to assess both foundation models and chat models. |
MiniCheck | MiniCheck | A toolkit for evaluating the quality of model outputs, with a focus on assessing factual consistency. |
ChatEval | ChatEval | An evaluation framework designed specifically for chatbots, including both automatic and human evaluation methods. |
auto-j | auto-j | A tool for automatically evaluating the fluency and grammatical correctness of text generated by language models. |
LLMBar | LLMBar | An evaluation framework that includes a large collection of benchmark tests for evaluating the performance of large language models on various tasks. |
JudgeLM | JudgeLM | A platform for evaluating and comparing different language models. |
LAMM | LAMM | A tool that focuses on evaluating the factual consistency of models, providing test datasets and evaluation metrics. |
Prometheus | Prometheus | An evaluation framework that includes test datasets and metrics for evaluating the performance of models on question answering tasks. |
PCRM | PCRM | A Prompt-based Chat Model evaluation method that provides evaluation metrics and code. |
TIGERScore | TIGERScore | An evaluation framework that focuses on assessing the fluency and coherence of text generated by models. |