evaluation

There are 1363 repositories under evaluation topic.

  • mrgloom/awesome-semantic-segmentation

    :metal: awesome-semantic-segmentation

  • langfuse

    langfuse/langfuse

    πŸͺ’ Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

    Language:TypeScript9.9k311.1k908
  • explodinggradients/ragas

    Supercharge Your LLM Application Evaluations πŸš€

    Language:Python8.6k421k876
  • promptfoo/promptfoo

    Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

    Language:TypeScript6k20835490
  • open-compass/opencompass

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

    Language:Python5.1k27667528
  • Knetic/govaluate

    Arbitrary expression evaluation for golang

    Language:Go3.9k63147512
  • AutoRAG

    Marker-Inc-Korea/AutoRAG

    AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

    Language:Python3.7k31604291
  • evo

    MichaelGrupp/evo

    Python package for the evaluation of odometry and SLAM

    Language:Python3.7k48426763
  • Helicone/helicone

    🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 πŸ“

    Language:TypeScript3.5k20182349
  • sdiehl/write-you-a-haskell

    Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

    Language:Haskell3.4k19050256
  • Kiln

    Kiln-AI/Kiln

    The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

    Language:Python3.3k3277228
  • CLUEbenchmark/SuperCLUE

    SuperCLUE: δΈ­ζ–‡ι€šη”¨ε€§ζ¨‘εž‹η»Όεˆζ€§εŸΊε‡† | A Benchmark for Foundation Models in Chinese

  • viebel/klipse

    Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.

    Language:HTML3.1k53180148
  • zzw922cn/Automatic_Speech_Recognition

    End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

    Language:Python2.8k14590534
  • microsoft/promptbench

    A unified evaluation framework for large language models

    Language:Python2.6k1861190
  • ianarawjo/ChainForge

    An open-source visual programming environment for battle-testing prompts to LLMs.

    Language:TypeScript2.6k30190206
  • EvolvingLMMs-Lab/lmms-eval

    Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

    Language:Python2.3k8286233
  • uptrain-ai/uptrain

    UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

    Language:Python2.3k19149199
  • huggingface/evaluate

    πŸ€— Evaluate: A library for easily evaluating machine learning models and datasets.

    Language:Python2.2k44309271
  • open-compass/VLMEvalKit

    Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

    Language:Python2.1k12351305
  • avalanche

    ContinualAI/avalanche

    Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

    Language:Python1.9k30829304
  • Cloud-CV/EvalAI

    :cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

    Language:Python1.8k521.3k847
  • lmnr-ai/lmnr

    Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

    Language:TypeScript1.7k1153101
  • xinshuoweng/AB3DMOT

    (IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

    Language:Python1.7k50104405
  • tatsu-lab/alpaca_eval

    An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

    Language:Jupyter Notebook1.7k9161263
  • MLGroupJLU/LLM-eval-survey

    The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

  • pycm

    sepandhaghighi/pycm

    Multi-class confusion matrix library in Python

    Language:Python1.5k35206126
  • Maluuba/nlg-eval

    Evaluation code for various unsupervised automated metrics for Natural Language Generation.

    Language:Python1.4k27113224
  • Xnhyacinth/Awesome-LLM-Long-Context-Modeling

    πŸ“° Must-read papers and blogs on LLM based Long Context Modeling πŸ”₯

  • huggingface/lighteval

    Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

    Language:Python1.4k28246211
  • langwatch/langwatch

    The ultimate LLM Ops platform - Monitoring, Analytics, Evaluations, Datasets and Prompt Optimization ✨

    Language:TypeScript1.3k83173
  • lunary-ai/lunary

    The production toolkit for LLMs. Observability, prompt management and evaluations.

    Language:TypeScript1.2k4141150
  • abo-abo/lispy

    Short and sweet LISP editing

    Language:Emacs Lisp1.2k26483136
  • EthicalML/xai

    XAI - An eXplainability toolbox for machine learning

    Language:Python1.2k4311179
  • fuzzbench

    google/fuzzbench

    FuzzBench - Fuzzer benchmarking as a service.

    Language:Python1.1k34509279
  • huggingface/evaluation-guidebook

    Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!

    Language:Jupyter Notebook1.1k10370