evaluation-framework
There are 150 repositories under evaluation-framework topic.
EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of language models.
promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
confident-ai/deepeval
The LLM Evaluation Framework
MaurizioFD/RecSys2019_DeepLearning_Evaluation
This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.
huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
TonicAI/tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Psycoy/MixEval
The official evaluation suite and dynamic data release for MixEval.
diningphil/PyDGN
A research library for automating experiments on Deep Graph Networks
zeno-ml/zeno
AI Data Management & Evaluation Platform
aiverify-foundation/moonshot
Moonshot - A simple and modular tool to evaluate and red-team any LLM application.
lartpang/PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
bijington/expressive
Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.
empirical-run/empirical
Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application
symflower/eval-dev-quality
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
AI21Labs/lm-evaluation
Evaluation suite for large-scale language models.
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
tsenst/CrowdFlow
Optical Flow Dataset and Benchmark for Visual Crowd Analysis
microsoft/eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
Borda/BIRL
BIRL: Benchmark on Image Registration methods with Landmark validations
haeyeoni/lidar_slam_evaluator
LiDAR SLAM comparison and evaluation framework
hpclab/rankeval
Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.
codefuse-ai/codefuse-evaluation
Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中
BMW-InnovationLab/SORDI-AI-Evaluation-GUI
This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.
nouhadziri/DialogEntailment
The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"
pyrddlgym-project/pyRDDLGym
A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.
kaiko-ai/eva
Evaluation framework for oncology foundation models (FMs)
pentoai/vectory
Vectory provides a collection of tools to track and compare embedding versions.
jinzhuoran/RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
ashafaei/OD-test
OD-test: A Less Biased Evaluation of Out-of-Distribution (Outlier) Detectors (PyTorch)
OPTML-Group/Diffusion-MU-Attack
The official implementation of ECCV'24 paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.
powerflows/powerflows-dmn
Power Flows DMN - Powerful decisions and rules engine
sb-ai-lab/Sim4Rec
Simulator for training and evaluation of Recommender Systems
SpikeInterface/spiketoolkit
Python-based tools for pre-, post-processing, validating, and curating spike sorting datasets.