llm-eval

There are 37 repositories under llm-eval topic.

promptfoo/promptfoo
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language:TypeScript8.4k 23 985698
Arize-ai/phoenix
AI Observability & Evaluation
Language:Jupyter Notebook7k 47 4.3k570
Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
Language:Python4.5k 34 483316
iterative/datachain
ETL, Analytics, Versioning for Unstructured Data
Language:Python2.7k 17 347124
uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Language:Python2.3k 19 149196
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language:Python292 5 117
fiddlecube/fiddlecube-sdk
Generate ideal question-answers for testing RAG
Language:Python126 1 03
Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
Language:Python85 3 46
parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language:Python78 1 27
kuk/rulm-sbs2
Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat
Language:Jupyter Notebook61 3 12
multinear/multinear
Develop reliable AI apps
Language:Python44 0 40
Auto-Playground/ragrank
🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.
Language:Python34 0 1013
alan-turing-institute/prompto
An open source library for asynchronous querying of LLM endpoints
Language:Python26 15 620
Supahands/llm-comparison-backend
This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end
Language:Python20 3 12
honeyhiveai/realign
Realign is a testing and simulation framework for AI applications.
Language:Python17 1 01
genia-dev/vibraniumdome
LLM Security Platform.
Language:Python11 2 12
Networks-Learning/prediction-powered-ranking
Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.
Language:Jupyter Notebook9 2 01
pyladiesams/eval-llm-based-apps-jan2025
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
Language:Jupyter Notebook75
prompt-foundry/python-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Python
Language:Python6 0 00
prompt-foundry/typescript-sdk
The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
Language:TypeScript6 0 01
harlev/eva-l
LLM Evaluation Framework
Language:Python4 1 00
harshagrawal523/GenerativeAgents
Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.
Language:Python4 0 00
parea-ai/parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language:TypeScript4 0 21
yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
Language:TypeScript4 1 80
jaaack-wang/multi-problem-eval-llm
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
Language:Jupyter Notebook3 1 0
yuzu-ai/ShinRakuda
Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.
Language:Python3 1 00
genia-dev/vibraniumdome-docs
LLM Security Platform Docs
Language:MDX2 2 00
prompt-foundry/go-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Go.
1 0 00
prompt-foundry/ruby-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Ruby.
1 0 00
yukinagae/promptfoo-sample
Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models
1 1 0
prompt-foundry/dotnet-sdk
The prompt engineering, prompt management, and prompt evaluation tool for C# and .NET
0 0 00
prompt-foundry/kotlin-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Kotlin.
0 0 00
cuiyuheng/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Language:Python
daqh/llm-eval
This project applies the LLM-Eval framework to the PersonaChat dataset to assess response quality in a conversational context. Using GPT-4o-mini via the OpenAI API, the system generates scores (on a 0-5 or 0-100 scale) for four evaluation metrics: context, grammar, relevance, and appropriateness.
Language:Python1 0
kdcyberdude/punjabi-llm-eval
First Punjabi LLM Eval.
Language:Python0 0
yukinagae/genkit-promptfoo-sample
Sample implementation demonstrating how to use Firebase Genkit with Promptfoo
Language:TypeScript

llm-eval

promptfoo/promptfoo

Arize-ai/phoenix

Giskard-AI/giskard

iterative/datachain

uptrain-ai/uptrain

athina-ai/athina-evals

fiddlecube/fiddlecube-sdk

Re-Align/just-eval

parea-ai/parea-sdk-py

kuk/rulm-sbs2

multinear/multinear

Auto-Playground/ragrank

alan-turing-institute/prompto

Supahands/llm-comparison-backend

honeyhiveai/realign

genia-dev/vibraniumdome

Networks-Learning/prediction-powered-ranking

pyladiesams/eval-llm-based-apps-jan2025

prompt-foundry/python-sdk

prompt-foundry/typescript-sdk

harlev/eva-l

harshagrawal523/GenerativeAgents

parea-ai/parea-sdk-ts

yukinagae/genkitx-promptfoo

jaaack-wang/multi-problem-eval-llm

yuzu-ai/ShinRakuda

genia-dev/vibraniumdome-docs

prompt-foundry/go-sdk

prompt-foundry/ruby-sdk

yukinagae/promptfoo-sample

prompt-foundry/dotnet-sdk

prompt-foundry/kotlin-sdk

cuiyuheng/opencompass

daqh/llm-eval

kdcyberdude/punjabi-llm-eval

yukinagae/genkit-promptfoo-sample