evaluation-framework

There are 230 repositories under evaluation-framework topic.

  • confident-ai/deepeval

    The LLM Evaluation Framework

    Language:Python10.8k46503931
  • EleutherAI/lm-evaluation-harness

    A framework for few-shot evaluation of language models.

    Language:Python10.1k441.5k2.7k
  • promptfoo/promptfoo

    Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

    Language:TypeScript8.4k23985698
  • huggingface/lighteval

    Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

    Language:Python1.9k29393340
  • MaurizioFD/RecSys2019_DeepLearning_Evaluation

    This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

    Language:Python9853714251
  • relari-ai/continuous-eval

    Data-Driven Evaluation for LLM-Powered Applications

    Language:Python50542436
  • ServiceNow/AgentLab

    AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

    Language:Python40556985
  • TonicAI/tonic_validate

    Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

    Language:Python318143831
  • athina-ai/athina-evals

    Python SDK for running evaluations on LLM generated responses

    Language:Python2925121
  • aiverify-foundation/moonshot

    Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

    Language:Python26983148
  • JinjieNi/MixEval

    The official evaluation suite and dynamic data release for MixEval.

    Language:Python24513440
  • PyDGN

    diningphil/PyDGN

    A research library for automating experiments on Deep Graph Networks

    Language:Python22361413
  • zeno

    zeno-ml/zeno

    AI Data Management & Evaluation Platform

    Language:Svelte216723111
  • lartpang/PySODEvalToolkit

    PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection

    Language:Python18012522
  • symflower/eval-dev-quality

    DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

    Language:Go17951608
  • bijington/expressive

    Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

    Language:C#173108127
  • microsoft/eureka-ml-insights

    A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

    Language:Python16815818
  • empirical-run/empirical

    Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application

    Language:TypeScript1606913
  • alibaba-damo-academy/MedEvalKit

    MedEvalKit: A Unified Medical Evaluation Framework

    Language:Python14413
  • HKUSTDial/NL2SQL360

    🔥[VLDB'24] Official repository for the paper “The Dawn of Natural Language to SQL: Are We Fully Ready?”

    Language:Python1311814
  • nlp-uoregon/mlmm-evaluation

    Multilingual Large Language Models Evaluation Benchmark

    Language:Python13141118
  • AI21Labs/lm-evaluation

    Evaluation suite for large-scale language models.

    Language:Python1285216
  • kaiko-ai/eva

    Evaluation framework for oncology foundation models (FMs)

    Language:Python124534428
  • EuroEval/EuroEval

    The robust European language model benchmark.

    Language:Python123761731
  • tsenst/CrowdFlow

    Optical Flow Dataset and Benchmark for Visual Crowd Analysis

    Language:Python12261122
  • X-PLUG/WritingBench

    WritingBench: A Comprehensive Benchmark for Generative Writing

    Language:Python11801713
  • codefuse-ai/codefuse-evaluation

    Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中

    Language:Python1013413
  • haeyeoni/lidar_slam_evaluator

    LiDAR SLAM comparison and evaluation framework

    Language:Python982215
  • Borda/BIRL

    BIRL: Benchmark on Image Registration methods with Landmark validations

    Language:Python93112826
  • hpclab/rankeval

    Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.

    Language:Python88141912
  • pyRDDLGym

    pyrddlgym-project/pyRDDLGym

    A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.

    Language:Python8374521
  • jinzhuoran/RWKU

    RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024

    Language:Python82297
  • OPTML-Group/Diffusion-MU-Attack

    The official implementation of ECCV'24 paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.

    Language:Python800143
  • nouhadziri/DialogEntailment

    The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"

    Language:Python74795
  • pentoai/vectory

    Vectory provides a collection of tools to track and compare embedding versions.

    Language:Python71550
  • ashafaei/OD-test

    OD-test: A Less Biased Evaluation of Out-of-Distribution (Outlier) Detectors (PyTorch)

    Language:Python625411