evaluation-framework

There are 122 repositories under evaluation-framework topic.

  • EleutherAI/lm-evaluation-harness

    A framework for few-shot evaluation of language models.

    Language:Python5.4k358591.4k
  • promptfoo/promptfoo

    Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

    Language:TypeScript3.2k17397207
  • confident-ai/deepeval

    The LLM Evaluation Framework

    Language:Python2k15191144
  • MaurizioFD/RecSys2019_DeepLearning_Evaluation

    This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

    Language:Python9823914250
  • huggingface/lighteval

    LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.

    Language:Python391327948
  • relari-ai/continuous-eval

    Open-Source Evaluation for GenAI Application Pipelines

    Language:Python33742216
  • TonicAI/tonic_validate

    Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

    Language:Python218123125
  • PyDGN

    diningphil/PyDGN

    A research library for automating experiments on Deep Graph Networks

    Language:Python21371412
  • zeno

    zeno-ml/zeno

    AI Data Management & Evaluation Platform

    Language:Svelte20982319
  • bijington/expressive

    Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

    Language:C#15898025
  • lartpang/PySODEvalToolkit

    PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection

    Language:Python14932221
  • athina-ai/athina-evals

    Python SDK for running evaluations on LLM generated responses

    Language:Python1415011
  • empirical-run/empirical

    Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application

    Language:TypeScript138459
  • AI21Labs/lm-evaluation

    Evaluation suite for large-scale language models.

    Language:Python1215213
  • tsenst/CrowdFlow

    Optical Flow Dataset and Benchmark for Visual Crowd Analysis

    Language:Python10471122
  • Borda/BIRL

    BIRL: Benchmark on Image Registration methods with Landmark validations

    Language:Python88122826
  • hpclab/rankeval

    Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.

    Language:Python87141911
  • haeyeoni/lidar_slam_evaluator

    LiDAR SLAM comparison and evaluation framework

    Language:Python822115
  • BMW-InnovationLab/SORDI-AI-Evaluation-GUI

    This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.

    Language:Python74203
  • nouhadziri/DialogEntailment

    The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"

    Language:Python74785
  • pentoai/vectory

    Vectory provides a collection of tools to track and compare embedding versions.

    Language:Python65550
  • codefuse-ai/codefuse-evaluation

    Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中

    Language:Python62328
  • ashafaei/OD-test

    OD-test: A Less Biased Evaluation of Out-of-Distribution (Outlier) Detectors (PyTorch)

    Language:Python615411
  • nlp-uoregon/mlmm-evaluation

    Multilingual Large Language Models Evaluation Benchmark

    Language:Python603513
  • powerflows/powerflows-dmn

    Power Flows DMN - Powerful decisions and rules engine

    Language:Java5191106
  • aiverify-foundation/moonshot

    Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

    Language:Python4910
  • SpikeInterface/spiketoolkit

    Python-based tools for pre-, post-processing, validating, and curating spike sorting datasets.

    Language:Python499031
  • kaiko-ai/eva

    Evaluation framework for oncology foundation models (FMs)

    Language:Python4552152
  • sb-ai-lab/Sim4Rec

    Simulator for training and evaluation of Recommender Systems

    Language:Jupyter Notebook43621
  • yupidevs/pactus

    Framework to evaluate Trajectory Classification Algorithms

    Language:Python43100
  • kolena

    kolenaIO/kolena

    Python client for Kolena's machine learning testing platform

    Language:Python41614
  • symflower/eval-dev-quality

    DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

    Language:Go413261
  • srcclr/efda

    Evaluation Framework for Dependency Analysis (EFDA)

    Language:C3916248
  • cowjen01/repsys

    Framework for Interactive Evaluation of Recommender Systems

    Language:JavaScript35255
  • GAIR-NLP/scaleeval

    Scalable Meta-Evaluation of LLMs as Evaluators

    Language:Python33413
  • OPTML-Group/Diffusion-MU-Attack

    The official implementation of the paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.

    Language:Python30152