evaluation-framework
There are 230 repositories under evaluation-framework topic.
confident-ai/deepeval
The LLM Evaluation Framework
EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of language models.
promptfoo/promptfoo
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MaurizioFD/RecSys2019_DeepLearning_Evaluation
This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.
relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
ServiceNow/AgentLab
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
TonicAI/tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
aiverify-foundation/moonshot
Moonshot - A simple and modular tool to evaluate and red-team any LLM application.
JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
diningphil/PyDGN
A research library for automating experiments on Deep Graph Networks
zeno-ml/zeno
AI Data Management & Evaluation Platform
lartpang/PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
symflower/eval-dev-quality
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
bijington/expressive
Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.
microsoft/eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
empirical-run/empirical
Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application
alibaba-damo-academy/MedEvalKit
MedEvalKit: A Unified Medical Evaluation Framework
HKUSTDial/NL2SQL360
🔥[VLDB'24] Official repository for the paper “The Dawn of Natural Language to SQL: Are We Fully Ready?”
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
AI21Labs/lm-evaluation
Evaluation suite for large-scale language models.
kaiko-ai/eva
Evaluation framework for oncology foundation models (FMs)
EuroEval/EuroEval
The robust European language model benchmark.
tsenst/CrowdFlow
Optical Flow Dataset and Benchmark for Visual Crowd Analysis
X-PLUG/WritingBench
WritingBench: A Comprehensive Benchmark for Generative Writing
codefuse-ai/codefuse-evaluation
Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中
haeyeoni/lidar_slam_evaluator
LiDAR SLAM comparison and evaluation framework
Borda/BIRL
BIRL: Benchmark on Image Registration methods with Landmark validations
hpclab/rankeval
Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.
pyrddlgym-project/pyRDDLGym
A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.
jinzhuoran/RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
OPTML-Group/Diffusion-MU-Attack
The official implementation of ECCV'24 paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.
nouhadziri/DialogEntailment
The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"
pentoai/vectory
Vectory provides a collection of tools to track and compare embedding versions.
ashafaei/OD-test
OD-test: A Less Biased Evaluation of Out-of-Distribution (Outlier) Detectors (PyTorch)