llm-evaluation
There are 144 repositories under llm-evaluation topic.
langfuse/langfuse
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
confident-ai/deepeval
The LLM Evaluation Framework
NVIDIA/garak
the LLM vulnerability scanner
Marker-Inc-Korea/AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Helicone/helicone
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
PacktPublishing/LLM-Engineers-Handbook
The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices
Agenta-AI/agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM Observability all in one place.
lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.
microsoft/prompty
Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.
relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
onejune2018/Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
kimtth/awesome-azure-openai-llm
a curated list of 🌌 Azure OpenAI, 🦙Large Language Models, and references with notes.
palico-ai/palico-ai
Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework
Value4AI/Awesome-LLM-in-Social-Science
Awesome papers involving LLMs in Social Science.
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Psycoy/MixEval
The official evaluation suite and dynamic data release for MixEval.
iMeanAI/WebCanvas
Connect agents to live web environments evaluation.
PetroIvaniuk/llms-tools
A list of LLMs Tools & Projects
villagecomputing/superpipe
Superpipe - optimized LLM pipelines for structured data
rungalileo/hallucination-index
Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.
kolenaIO/autoarena
Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation
raga-ai-hub/raga-llm-hub
Framework for LLM evaluation, guardrails and security
allenai/CommonGen-Eval
Evaluating LLMs with CommonGen-Lite
hkust-nlp/dart-math
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
loganrjmurphy/LeanEuclid
LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.
Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
alopatenko/LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
Addepto/contextcheck
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
deshwalmahesh/PHUDGE
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.
azminewasi/Awesome-LLMs-ICLR-24
It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.
AntonioGr7/pratical-llms
A collection of hand on notebook for LLMs practitioner
Babelscape/ALERT
Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"
Chainlit/literalai-cookbooks
Cookbooks and tutorials on Literal AI