llm-evaluation-framework

There are 16 repositories under llm-evaluation-framework topic.

promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language:TypeScript4.4k 19 634324
confident-ai/deepeval
The LLM Evaluation Framework
Language:Python3.1k 19 255246
Psycoy/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language:Python206 1 2930
parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language:Python73 2 16
zhuohaoyu/KIEval
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Language:Python32 3 22
aws-samples/fm-leaderboarder
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
Language:Python18 7 105
honeyhiveai/realign
Realign is an evaluation and experimentation framework for AI applications.
Language:Python10 2 01
Networks-Learning/prediction-powered-ranking
Code for "Prediction-Powered Ranking of Large Language Models", Arxiv 2024.
Language:Python7 3 01
parea-ai/parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language:TypeScript4 0 11
stair-lab/melt
Multilingual Evaluation Toolkits
Language:Python3 2 193
yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
Language:TypeScript30
yuzu-ai/ShinRakuda
Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.
Language:Python20
jaaack-wang/multi-problem-eval-llm
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
Language:Jupyter Notebook1 1 0
nagababumo/Building-and-Evaluating-Advanced-RAG
Language:Jupyter Notebook1 01
yukinagae/genkit-promptfoo-sample
Sample implementation demonstrating how to use Firebase Genkit with Promptfoo
Language:TypeScript
yukinagae/promptfoo-sample
Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

llm-evaluation-framework

promptfoo/promptfoo

confident-ai/deepeval

Psycoy/MixEval

parea-ai/parea-sdk-py

zhuohaoyu/KIEval

aws-samples/fm-leaderboarder

honeyhiveai/realign

Networks-Learning/prediction-powered-ranking

parea-ai/parea-sdk-ts

stair-lab/melt

yukinagae/genkitx-promptfoo

yuzu-ai/ShinRakuda

jaaack-wang/multi-problem-eval-llm

nagababumo/Building-and-Evaluating-Advanced-RAG

yukinagae/genkit-promptfoo-sample

yukinagae/promptfoo-sample