evaluation-framework

There are 150 repositories under evaluation-framework topic.

EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of language models.
Language:Python7.3k 39 1.2k2k
promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language:TypeScript5k 21 741407
Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
Language:Python4.2k 33 463278
confident-ai/deepeval
The LLM Evaluation Framework
Language:Python4.1k 23 312333
MaurizioFD/RecSys2019_DeepLearning_Evaluation
This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.
Language:Python985 39 14249
huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Language:Python898 30 172109
relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
Language:Python457 4 2431
TonicAI/tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Language:Python274 14 3729
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language:Python245 5 115
Psycoy/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language:Python230 1 3437
diningphil/PyDGN
A research library for automating experiments on Deep Graph Networks
Language:Python220 7 1413
zeno-ml/zeno
AI Data Management & Evaluation Platform
Language:Svelte215 8 23111
aiverify-foundation/moonshot
Moonshot - A simple and modular tool to evaluate and red-team any LLM application.
Language:Python191 7 3140
lartpang/PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Language:Python168 3 2420
bijington/expressive
Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.
Language:C#165 10 8126
empirical-run/empirical
Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application
Language:TypeScript151 6 912
symflower/eval-dev-quality
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
Language:Go139 4 1315
AI21Labs/lm-evaluation
Evaluation suite for large-scale language models.
Language:Python124 5 214
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
Language:Python115 4 1117
tsenst/CrowdFlow
Optical Flow Dataset and Benchmark for Visual Crowd Analysis
Language:Python110 7 1122
microsoft/eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
Language:Python104 16 814
Borda/BIRL
BIRL: Benchmark on Image Registration methods with Landmark validations
Language:Python92 12 2826
haeyeoni/lidar_slam_evaluator
LiDAR SLAM comparison and evaluation framework
Language:Python91 2 216
hpclab/rankeval
Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.
Language:Python88 14 1911
codefuse-ai/codefuse-evaluation
Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中
Language:Python83 3 412
BMW-InnovationLab/SORDI-AI-Evaluation-GUI
This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.
Language:Python76 2 03
nouhadziri/DialogEntailment
The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"
Language:Python74 7 95
pyrddlgym-project/pyRDDLGym
A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.
Language:Python74 8 4317
kaiko-ai/eva
Evaluation framework for oncology foundation models (FMs)
Language:Python72 6 3254
pentoai/vectory
Vectory provides a collection of tools to track and compare embedding versions.
Language:Python71 5 50
jinzhuoran/RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
Language:Python63 3 85
ashafaei/OD-test
OD-test: A Less Biased Evaluation of Out-of-Distribution (Outlier) Detectors (PyTorch)
Language:Python62 5 411
OPTML-Group/Diffusion-MU-Attack
The official implementation of ECCV'24 paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.
Language:Python60 0 143
powerflows/powerflows-dmn
Power Flows DMN - Powerful decisions and rules engine
Language:Java51 9 1106
sb-ai-lab/Sim4Rec
Simulator for training and evaluation of Recommender Systems
Language:Jupyter Notebook50 6 23
SpikeInterface/spiketoolkit
Python-based tools for pre-, post-processing, validating, and curating spike sorting datasets.
Language:Python49 9 029

evaluation-framework

EleutherAI/lm-evaluation-harness

promptfoo/promptfoo

Giskard-AI/giskard

confident-ai/deepeval

MaurizioFD/RecSys2019_DeepLearning_Evaluation

huggingface/lighteval

relari-ai/continuous-eval

TonicAI/tonic_validate

athina-ai/athina-evals

Psycoy/MixEval

diningphil/PyDGN

zeno-ml/zeno

aiverify-foundation/moonshot

lartpang/PySODEvalToolkit

bijington/expressive

empirical-run/empirical

symflower/eval-dev-quality

AI21Labs/lm-evaluation

nlp-uoregon/mlmm-evaluation

tsenst/CrowdFlow

microsoft/eureka-ml-insights

Borda/BIRL

haeyeoni/lidar_slam_evaluator

hpclab/rankeval

codefuse-ai/codefuse-evaluation

BMW-InnovationLab/SORDI-AI-Evaluation-GUI

nouhadziri/DialogEntailment

pyrddlgym-project/pyRDDLGym

kaiko-ai/eva

pentoai/vectory

jinzhuoran/RWKU

ashafaei/OD-test

OPTML-Group/Diffusion-MU-Attack

powerflows/powerflows-dmn

sb-ai-lab/Sim4Rec

SpikeInterface/spiketoolkit