evaluation

There are 1363 repositories under evaluation topic.

mrgloom/awesome-semantic-segmentation
:metal: awesome-semantic-segmentation
10.6k 441 252.5k
langfuse/langfuse
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Language:TypeScript9.9k 31 1.1k908
explodinggradients/ragas
Supercharge Your LLM Application Evaluations 🚀
Language:Python8.6k 42 1k876
promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language:TypeScript6k 20 835490
open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Language:Python5.1k 27 667528
Knetic/govaluate
Arbitrary expression evaluation for golang
Language:Go3.9k 63 147512
Marker-Inc-Korea/AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Language:Python3.7k 31 604291
MichaelGrupp/evo
Python package for the evaluation of odometry and SLAM
Language:Python3.7k 48 426763
Helicone/helicone
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
Language:TypeScript3.5k 20 182349
sdiehl/write-you-a-haskell
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
Language:Haskell3.4k 190 50256
Kiln-AI/Kiln
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
Language:Python3.3k 32 77228
CLUEbenchmark/SuperCLUE
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
3.1k 38 50103
viebel/klipse
Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.
Language:HTML3.1k 53 180148
zzw922cn/Automatic_Speech_Recognition
End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow
Language:Python2.8k 145 90534
microsoft/promptbench
A unified evaluation framework for large language models
Language:Python2.6k 18 61190
ianarawjo/ChainForge
An open-source visual programming environment for battle-testing prompts to LLMs.
Language:TypeScript2.6k 30 190206
EvolvingLMMs-Lab/lmms-eval
Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.
Language:Python2.3k 8 286233
uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Language:Python2.3k 19 149199
huggingface/evaluate
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Language:Python2.2k 44 309271
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Language:Python2.1k 12 351305
ContinualAI/avalanche
Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
Language:Python1.9k 30 829304
Cloud-CV/EvalAI
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
Language:Python1.8k 52 1.3k847
lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.
Language:TypeScript1.7k 11 53101
xinshuoweng/AB3DMOT
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
Language:Python1.7k 50 104405
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
Language:Jupyter Notebook1.7k 9 161263
MLGroupJLU/LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
1.5k 16 1792
sepandhaghighi/pycm
Multi-class confusion matrix library in Python
Language:Python1.5k 35 206126
Maluuba/nlg-eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
Language:Python1.4k 27 113224
Xnhyacinth/Awesome-LLM-Long-Context-Modeling
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
1.4k 55 1047
huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Language:Python1.4k 28 246211
langwatch/langwatch
The ultimate LLM Ops platform - Monitoring, Analytics, Evaluations, Datasets and Prompt Optimization ✨
Language:TypeScript1.3k 8 3173
lunary-ai/lunary
The production toolkit for LLMs. Observability, prompt management and evaluations.
Language:TypeScript1.2k 4 141150
abo-abo/lispy
Short and sweet LISP editing
Language:Emacs Lisp1.2k 26 483136
EthicalML/xai
XAI - An eXplainability toolbox for machine learning
Language:Python1.2k 43 11179
google/fuzzbench
FuzzBench - Fuzzer benchmarking as a service.
Language:Python1.1k 34 509279
huggingface/evaluation-guidebook
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
Language:Jupyter Notebook1.1k 10 370

evaluation

mrgloom/awesome-semantic-segmentation

langfuse/langfuse

explodinggradients/ragas

promptfoo/promptfoo

open-compass/opencompass

Knetic/govaluate

Marker-Inc-Korea/AutoRAG

MichaelGrupp/evo

Helicone/helicone

sdiehl/write-you-a-haskell

Kiln-AI/Kiln

CLUEbenchmark/SuperCLUE

viebel/klipse

zzw922cn/Automatic_Speech_Recognition

microsoft/promptbench

ianarawjo/ChainForge

EvolvingLMMs-Lab/lmms-eval

uptrain-ai/uptrain

huggingface/evaluate

open-compass/VLMEvalKit

ContinualAI/avalanche

Cloud-CV/EvalAI

lmnr-ai/lmnr

xinshuoweng/AB3DMOT

tatsu-lab/alpaca_eval

MLGroupJLU/LLM-eval-survey

sepandhaghighi/pycm

Maluuba/nlg-eval

Xnhyacinth/Awesome-LLM-Long-Context-Modeling

huggingface/lighteval

langwatch/langwatch

lunary-ai/lunary

abo-abo/lispy

EthicalML/xai

google/fuzzbench

huggingface/evaluation-guidebook