evaluation-metrics
There are 388 repositories under evaluation-metrics topic.
confident-ai/deepeval
The LLM Evaluation Framework
xinshuoweng/AB3DMOT
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
AgentOps-AI/agentops
Python SDK for agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks like CrewAI, Langchain, and Autogen
google-research/rliable
[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
MIND-Lab/OCTIS
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
jitsi/jiwer
Evaluate your speech-to-text system with similarity measures such as word error rate (WER)
up42/image-similarity-measures
:chart_with_upwards_trend: Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.
proycon/pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
huggingface/lighteval
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
Unbabel/COMET
A Neural Framework for MT Evaluation
AmenRa/ranx
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
relari-ai/continuous-eval
Open-Source Evaluation for GenAI Application Pipelines
v-iashin/SpecVQGAN
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
salesforce/factCC
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper
bheinzerling/pyrouge
A Python wrapper for the ROUGE summarization evaluation package
clovaai/generative-evaluation-prdc
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
FuxiaoLiu/LRV-Instruction
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
TonicAI/tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
davidsbatista/NER-Evaluation
An implementation of a full named-entity evaluation metrics based on SemEval'13 Task 9 - not at tag/token level but considering all the tokens that are part of the named-entity
sharmaroshan/Twitter-Sentiment-Analysis
It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text analysis, data analysis and data visualization
clovaai/CLEval
CLEval: Character-Level Evaluation for Text Detection and Recognition Tasks
tagucci/pythonrouge
Python wrapper for evaluating summarization quality by ROUGE package
feralvam/easse
Easier Automatic Sentence Simplification Evaluation
lartpang/PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
MantisAI/nervaluate
Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
fakufaku/fast_bss_eval
A fast implementation of bss_eval metrics for blind source separation
om-ai-lab/VL-CheckList
Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.
YuanXinCherry/Person-reID-Evaluation
GOM:New Metric for Re-identification. 👉GOM explicitly balances the effect of performing retrieval and verification into a single unified metric.
tohinz/semantic-object-accuracy-for-generative-text-to-image-synthesis
Code for "Semantic Object Accuracy for Generative Text-to-Image Synthesis" (TPAMI 2020)
LAIT-CVLab/TopPR
NeurIPS 2023 - TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models Official Code
msmsajjadi/precision-recall-distributions
Assessing Generative Models via Precision and Recall (official repository)
Muhtasham/summarization-eval
📝 Reference-Free automatic summarization evaluation with potential hallucination detection
tanyuqian/ctc-gen-eval
EMNLP 2021 - CTC: A Unified Framework for Evaluating Natural Language Generation
hpclab/rankeval
Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.
Coldmist-Lu/ErrorAnalysis_Prompt
:gift:[ChatGPT4MTevaluation] ErrorAnalysis Prompt for MT Evaluation in ChatGPT