evaluation
There are 1363 repositories under evaluation topic.
mrgloom/awesome-semantic-segmentation
:metal: awesome-semantic-segmentation
langfuse/langfuse
πͺ’ Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. πYC W23
explodinggradients/ragas
Supercharge Your LLM Application Evaluations π
promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Knetic/govaluate
Arbitrary expression evaluation for golang
Marker-Inc-Korea/AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
MichaelGrupp/evo
Python package for the evaluation of odometry and SLAM
Helicone/helicone
π§ Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 π
sdiehl/write-you-a-haskell
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
Kiln-AI/Kiln
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
CLUEbenchmark/SuperCLUE
SuperCLUE: δΈζιη¨ε€§ζ¨‘εη»Όεζ§εΊε | A Benchmark for Foundation Models in Chinese
viebel/klipse
Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.
zzw922cn/Automatic_Speech_Recognition
End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow
microsoft/promptbench
A unified evaluation framework for large language models
ianarawjo/ChainForge
An open-source visual programming environment for battle-testing prompts to LLMs.
EvolvingLMMs-Lab/lmms-eval
Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.
uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
huggingface/evaluate
π€ Evaluate: A library for easily evaluating machine learning models and datasets.
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
ContinualAI/avalanche
Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
Cloud-CV/EvalAI
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.
xinshuoweng/AB3DMOT
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
MLGroupJLU/LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
sepandhaghighi/pycm
Multi-class confusion matrix library in Python
Maluuba/nlg-eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
Xnhyacinth/Awesome-LLM-Long-Context-Modeling
π° Must-read papers and blogs on LLM based Long Context Modeling π₯
huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
langwatch/langwatch
The ultimate LLM Ops platform - Monitoring, Analytics, Evaluations, Datasets and Prompt Optimization β¨
lunary-ai/lunary
The production toolkit for LLMs. Observability, prompt management and evaluations.
abo-abo/lispy
Short and sweet LISP editing
EthicalML/xai
XAI - An eXplainability toolbox for machine learning
google/fuzzbench
FuzzBench - Fuzzer benchmarking as a service.
huggingface/evaluation-guidebook
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!