llm-eval
There are 37 repositories under llm-eval topic.
promptfoo/promptfoo
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Arize-ai/phoenix
AI Observability & Evaluation
Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
iterative/datachain
ETL, Analytics, Versioning for Unstructured Data
uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
fiddlecube/fiddlecube-sdk
Generate ideal question-answers for testing RAG
Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
kuk/rulm-sbs2
Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat
multinear/multinear
Develop reliable AI apps
Auto-Playground/ragrank
🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.
alan-turing-institute/prompto
An open source library for asynchronous querying of LLM endpoints
Supahands/llm-comparison-backend
This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end
honeyhiveai/realign
Realign is a testing and simulation framework for AI applications.
genia-dev/vibraniumdome
LLM Security Platform.
Networks-Learning/prediction-powered-ranking
Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.
pyladiesams/eval-llm-based-apps-jan2025
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
prompt-foundry/python-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Python
prompt-foundry/typescript-sdk
The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
harlev/eva-l
LLM Evaluation Framework
harshagrawal523/GenerativeAgents
Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.
parea-ai/parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
jaaack-wang/multi-problem-eval-llm
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
yuzu-ai/ShinRakuda
Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.
genia-dev/vibraniumdome-docs
LLM Security Platform Docs
prompt-foundry/go-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Go.
prompt-foundry/ruby-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Ruby.
yukinagae/promptfoo-sample
Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models
prompt-foundry/dotnet-sdk
The prompt engineering, prompt management, and prompt evaluation tool for C# and .NET
prompt-foundry/kotlin-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Kotlin.
cuiyuheng/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
daqh/llm-eval
This project applies the LLM-Eval framework to the PersonaChat dataset to assess response quality in a conversational context. Using GPT-4o-mini via the OpenAI API, the system generates scores (on a 0-5 or 0-100 scale) for four evaluation metrics: context, grammar, relevance, and appropriateness.
kdcyberdude/punjabi-llm-eval
First Punjabi LLM Eval.
yukinagae/genkit-promptfoo-sample
Sample implementation demonstrating how to use Firebase Genkit with Promptfoo