evaluation
There are 1115 repositories under evaluation topic.
mrgloom/awesome-semantic-segmentation
:metal: awesome-semantic-segmentation
langfuse/langfuse
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Knetic/govaluate
Arbitrary expression evaluation for golang
sdiehl/write-you-a-haskell
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
MichaelGrupp/evo
Python package for the evaluation of odometry and SLAM
promptfoo/promptfoo
Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
viebel/klipse
Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.
open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
zzw922cn/Automatic_Speech_Recognition
End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow
CLUEbenchmark/SuperCLUE
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
microsoft/promptbench
A unified evaluation framework for large language models
ianarawjo/ChainForge
An open-source visual programming environment for battle-testing prompts to LLMs.
uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
huggingface/evaluate
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
ContinualAI/avalanche
Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
Cloud-CV/EvalAI
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
xinshuoweng/AB3DMOT
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
sepandhaghighi/pycm
Multi-class confusion matrix library in Python
Maluuba/nlg-eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
MLGroupJLU/LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
abo-abo/lispy
Short and sweet LISP editing
EthicalML/xai
XAI - An eXplainability toolbox for machine learning
google/fuzzbench
FuzzBench - Fuzzer benchmarking as a service.
lunary-ai/lunary
The production toolkit for LLMs. Observability, prompt management and evaluations.
toshas/torch-fidelity
High-fidelity performance metrics for generative models in PyTorch
PRBonn/semantic-kitti-api
SemanticKITTI API for visualizing dataset, processing data, and evaluating results.
PaesslerAG/gval
Expression evaluation in golang
dbolya/tide
A General Toolbox for Identifying Object Detection Errors
bochinski/iou-tracker
Python implementation of the IOU Tracker
CBLUEbenchmark/CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
codingseb/ExpressionEvaluator
A Simple Math and Pseudo C# Expression Evaluator in One C# File. Can also execute small C# like scripts
ucinlp/autoprompt
AutoPrompt: Automatic Prompt Construction for Masked Language Models.
tecnickcom/tcexam
TCExam is a CBA (Computer-Based Assessment) system (e-exam, CBT - Computer Based Testing) for universities, schools and companies, that enables educators and trainers to author, schedule, deliver, and report on surveys, quizzes, tests and exams.
open-compass/VLMEvalKit
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks
jkkummerfeld/text2sql-data
A collection of datasets that pair questions with SQL queries.