evaluation-framework
There are 122 repositories under evaluation-framework topic.
EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of language models.
promptfoo/promptfoo
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
confident-ai/deepeval
The LLM Evaluation Framework
MaurizioFD/RecSys2019_DeepLearning_Evaluation
This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.
huggingface/lighteval
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
relari-ai/continuous-eval
Open-Source Evaluation for GenAI Application Pipelines
TonicAI/tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
diningphil/PyDGN
A research library for automating experiments on Deep Graph Networks
zeno-ml/zeno
AI Data Management & Evaluation Platform
bijington/expressive
Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.
lartpang/PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
empirical-run/empirical
Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application
AI21Labs/lm-evaluation
Evaluation suite for large-scale language models.
tsenst/CrowdFlow
Optical Flow Dataset and Benchmark for Visual Crowd Analysis
Borda/BIRL
BIRL: Benchmark on Image Registration methods with Landmark validations
hpclab/rankeval
Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.
haeyeoni/lidar_slam_evaluator
LiDAR SLAM comparison and evaluation framework
BMW-InnovationLab/SORDI-AI-Evaluation-GUI
This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.
nouhadziri/DialogEntailment
The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"
pentoai/vectory
Vectory provides a collection of tools to track and compare embedding versions.
codefuse-ai/codefuse-evaluation
Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中
ashafaei/OD-test
OD-test: A Less Biased Evaluation of Out-of-Distribution (Outlier) Detectors (PyTorch)
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
powerflows/powerflows-dmn
Power Flows DMN - Powerful decisions and rules engine
aiverify-foundation/moonshot
Moonshot - A simple and modular tool to evaluate and red-team any LLM application.
SpikeInterface/spiketoolkit
Python-based tools for pre-, post-processing, validating, and curating spike sorting datasets.
kaiko-ai/eva
Evaluation framework for oncology foundation models (FMs)
sb-ai-lab/Sim4Rec
Simulator for training and evaluation of Recommender Systems
yupidevs/pactus
Framework to evaluate Trajectory Classification Algorithms
kolenaIO/kolena
Python client for Kolena's machine learning testing platform
symflower/eval-dev-quality
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
srcclr/efda
Evaluation Framework for Dependency Analysis (EFDA)
cowjen01/repsys
Framework for Interactive Evaluation of Recommender Systems
GAIR-NLP/scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
OPTML-Group/Diffusion-MU-Attack
The official implementation of the paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.