llm-evaluation

There are 144 repositories under llm-evaluation topic.

langfuse/langfuse
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Language:TypeScript7.5k 29 794682
promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language:TypeScript5k 21 739407
Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
Language:Python4.2k 33 463278
confident-ai/deepeval
The LLM Evaluation Framework
Language:Python4.1k 23 312332
NVIDIA/garak
the LLM vulnerability scanner
Language:Python3.1k 31 602267
Marker-Inc-Korea/AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Language:Python3k 23 580229
Helicone/helicone
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
Language:TypeScript2.8k 14 168284
PacktPublishing/LLM-Engineers-Handbook
The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices
Language:Python2.2k 33 11366
Agenta-AI/agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM Observability all in one place.
Language:Python1.8k 22 609230
lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.
Language:TypeScript1.4k 5 4272
microsoft/prompty
Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.
Language:Python617 12 6452
relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
Language:Python457 4 2431
onejune2018/Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.
450 9 143
kimtth/awesome-azure-openai-llm
a curated list of 🌌 Azure OpenAI, 🦙Large Language Models, and references with notes.
Language:Python336 10 147
palico-ai/palico-ai
Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework
Language:TypeScript335 3 9926
Value4AI/Awesome-LLM-in-Social-Science
Awesome papers involving LLMs in Social Science.
332 8 324
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language:Python244 5 115
Psycoy/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language:Python230 1 3437
iMeanAI/WebCanvas
Connect agents to live web environments evaluation.
Language:Python225 4 1012
PetroIvaniuk/llms-tools
A list of LLMs Tools & Projects
152 2 023
villagecomputing/superpipe
Superpipe - optimized LLM pipelines for structured data
Language:Python107 2 22
rungalileo/hallucination-index
Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.
104 7 07
kolenaIO/autoarena
Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation
Language:TypeScript101 7 07
raga-ai-hub/raga-llm-hub
Framework for LLM evaluation, guardrails and security
Language:Python100 2 310
allenai/CommonGen-Eval
Evaluating LLMs with CommonGen-Lite
Language:Python87 6 13
hkust-nlp/dart-math
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
Language:Jupyter Notebook86 1 63
loganrjmurphy/LeanEuclid
LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.
Language:Lean80 7 15
Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
Language:Python77 4 46
parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language:Python75 2 26
alopatenko/LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
Language:HTML74 6 04
Addepto/contextcheck
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
Language:Python546
deshwalmahesh/PHUDGE
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.
Language:Jupyter Notebook49 1 17
azminewasi/Awesome-LLMs-ICLR-24
It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.
43 1 01
AntonioGr7/pratical-llms
A collection of hand on notebook for LLMs practitioner
Language:Jupyter Notebook39 3 09
Babelscape/ALERT
Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"
Language:Python35 3 07
Chainlit/literalai-cookbooks
Cookbooks and tutorials on Literal AI
Language:Jupyter Notebook35 5 310

llm-evaluation

langfuse/langfuse

promptfoo/promptfoo

Giskard-AI/giskard

confident-ai/deepeval

NVIDIA/garak

Marker-Inc-Korea/AutoRAG

Helicone/helicone

PacktPublishing/LLM-Engineers-Handbook

Agenta-AI/agenta

lmnr-ai/lmnr

microsoft/prompty

relari-ai/continuous-eval

onejune2018/Awesome-LLM-Eval

kimtth/awesome-azure-openai-llm

palico-ai/palico-ai

Value4AI/Awesome-LLM-in-Social-Science

athina-ai/athina-evals

Psycoy/MixEval

iMeanAI/WebCanvas

PetroIvaniuk/llms-tools

villagecomputing/superpipe

rungalileo/hallucination-index

kolenaIO/autoarena

raga-ai-hub/raga-llm-hub

allenai/CommonGen-Eval

hkust-nlp/dart-math

loganrjmurphy/LeanEuclid

Re-Align/just-eval

parea-ai/parea-sdk-py

alopatenko/LLMEvaluation

Addepto/contextcheck

deshwalmahesh/PHUDGE

azminewasi/Awesome-LLMs-ICLR-24

AntonioGr7/pratical-llms

Babelscape/ALERT

Chainlit/literalai-cookbooks