evaluation-metrics

There are 393 repositories under evaluation-metrics topic.

confident-ai/deepeval
The LLM Evaluation Framework
Language:Python2.2k 17 197153
xinshuoweng/AB3DMOT
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
Language:Python1.6k 48 103401
AgentOps-AI/agentops
Python SDK for agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks like CrewAI, Langchain, and Autogen
Language:Python1.1k 17 6184
google-research/rliable
[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
Language:Jupyter Notebook719 11 1543
MIND-Lab/OCTIS
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
Language:Python698 14 10297
jitsi/jiwer
Evaluate your speech-to-text system with similarity measures such as word error rate (WER)
Language:Python563 15 4391
up42/image-similarity-measures
:chart_with_upwards_trend: Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.
Language:Python537 13 2468
proycon/pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Language:Python477 32 2567
huggingface/lighteval
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
Language:Python451 32 8553
Unbabel/COMET
A Neural Framework for MT Evaluation
Language:Python429 17 15771
AmenRa/ranx
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
Language:Python372 11 5622
relari-ai/continuous-eval
Open-Source Evaluation for GenAI Application Pipelines
Language:Python359 4 2219
v-iashin/SpecVQGAN
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
Language:Jupyter Notebook327 8 3336
salesforce/factCC
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper
Language:Python269 10 1431
bheinzerling/pyrouge
A Python wrapper for the ROUGE summarization evaluation package
Language:Python248 4 3170
clovaai/generative-evaluation-prdc
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
Language:Python235 9 927
FuxiaoLiu/LRV-Instruction
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Language:Python229 11 2214
TonicAI/tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Language:Python221 12 3125
sharmaroshan/Twitter-Sentiment-Analysis
It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text analysis, data analysis and data visualization
Language:Jupyter Notebook210 4 4125
davidsbatista/NER-Evaluation
An implementation of a full named-entity evaluation metrics based on SemEval'13 Task 9 - not at tag/token level but considering all the tokens that are part of the named-entity
Language:Python209 10 1649
clovaai/CLEval
CLEval: Character-Level Evaluation for Text Detection and Recognition Tasks
Language:Python184 10 928
tagucci/pythonrouge
Python wrapper for evaluating summarization quality by ROUGE package
Language:Perl165 3 3133
feralvam/easse
Easier Automatic Sentence Simplification Evaluation
Language:Roff153 6 5036
lartpang/PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Language:Python151 3 2220
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language:Python147 5 011
MantisAI/nervaluate
Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13
Language:Python145 5 3017
fakufaku/fast_bss_eval
A fast implementation of bss_eval metrics for blind source separation
Language:Python126 4 108
om-ai-lab/VL-CheckList
Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.
Language:Python121 6 114
YuanXinCherry/Person-reID-Evaluation
GOM：New Metric for Re-identification. 👉GOM explicitly balances the effect of performing retrieval and verification into a single unified metric.
Language:Python107 5 115
tohinz/semantic-object-accuracy-for-generative-text-to-image-synthesis
Code for "Semantic Object Accuracy for Generative Text-to-Image Synthesis" (TPAMI 2020)
Language:Python105 6 1023
LAIT-CVLab/TopPR
NeurIPS 2023 - TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models Official Code
Language:Python104 0 04
msmsajjadi/precision-recall-distributions
Assessing Generative Models via Precision and Recall (official repository)
Language:Python100 2 011
Muhtasham/summarization-eval
📝 Reference-Free automatic summarization evaluation with potential hallucination detection
Language:Python94 4 06
tanyuqian/ctc-gen-eval
EMNLP 2021 - CTC: A Unified Framework for Evaluating Natural Language Generation
Language:Python94 7 69
hpclab/rankeval
Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.
Language:Python87 14 1911
Coldmist-Lu/ErrorAnalysis_Prompt
:gift:[ChatGPT4MTevaluation] ErrorAnalysis Prompt for MT Evaluation in ChatGPT
Language:Python84 3 13

evaluation-metrics

confident-ai/deepeval

xinshuoweng/AB3DMOT

AgentOps-AI/agentops

google-research/rliable

MIND-Lab/OCTIS

jitsi/jiwer

up42/image-similarity-measures

proycon/pynlpl

huggingface/lighteval

Unbabel/COMET

AmenRa/ranx

relari-ai/continuous-eval

v-iashin/SpecVQGAN

salesforce/factCC

bheinzerling/pyrouge

clovaai/generative-evaluation-prdc

FuxiaoLiu/LRV-Instruction

TonicAI/tonic_validate

sharmaroshan/Twitter-Sentiment-Analysis

davidsbatista/NER-Evaluation

clovaai/CLEval

tagucci/pythonrouge

feralvam/easse

lartpang/PySODEvalToolkit

athina-ai/athina-evals

MantisAI/nervaluate

fakufaku/fast_bss_eval

om-ai-lab/VL-CheckList

YuanXinCherry/Person-reID-Evaluation

tohinz/semantic-object-accuracy-for-generative-text-to-image-synthesis

LAIT-CVLab/TopPR

msmsajjadi/precision-recall-distributions

Muhtasham/summarization-eval

tanyuqian/ctc-gen-eval

hpclab/rankeval

Coldmist-Lu/ErrorAnalysis_Prompt