llm-evaluation-framework
There are 16 repositories under llm-evaluation-framework topic.
promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
confident-ai/deepeval
The LLM Evaluation Framework
Psycoy/MixEval
The official evaluation suite and dynamic data release for MixEval.
parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
zhuohaoyu/KIEval
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
aws-samples/fm-leaderboarder
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
honeyhiveai/realign
Realign is an evaluation and experimentation framework for AI applications.
Networks-Learning/prediction-powered-ranking
Code for "Prediction-Powered Ranking of Large Language Models", Arxiv 2024.
parea-ai/parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
stair-lab/melt
Multilingual Evaluation Toolkits
yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
yuzu-ai/ShinRakuda
Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.
jaaack-wang/multi-problem-eval-llm
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
yukinagae/genkit-promptfoo-sample
Sample implementation demonstrating how to use Firebase Genkit with Promptfoo
yukinagae/promptfoo-sample
Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models