llm-evaluation

There are 144 repositories under llm-evaluation topic.

  • langfuse

    langfuse/langfuse

    🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

    Language:TypeScript7.5k29794682
  • promptfoo/promptfoo

    Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

    Language:TypeScript5k21739407
  • giskard

    Giskard-AI/giskard

    🐢 Open-Source Evaluation & Testing for AI & LLM systems

    Language:Python4.2k33463278
  • confident-ai/deepeval

    The LLM Evaluation Framework

    Language:Python4.1k23312332
  • NVIDIA/garak

    the LLM vulnerability scanner

    Language:Python3.1k31602267
  • AutoRAG

    Marker-Inc-Korea/AutoRAG

    AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

    Language:Python3k23580229
  • Helicone/helicone

    🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

    Language:TypeScript2.8k14168284
  • PacktPublishing/LLM-Engineers-Handbook

    The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

    Language:Python2.2k3311366
  • agenta

    Agenta-AI/agenta

    The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM Observability all in one place.

    Language:Python1.8k22609230
  • lmnr-ai/lmnr

    Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

    Language:TypeScript1.4k54272
  • microsoft/prompty

    Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

    Language:Python617126452
  • relari-ai/continuous-eval

    Data-Driven Evaluation for LLM-Powered Applications

    Language:Python45742431
  • onejune2018/Awesome-LLM-Eval

    Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.

  • kimtth/awesome-azure-openai-llm

    a curated list of 🌌 Azure OpenAI, 🦙Large Language Models, and references with notes.

    Language:Python33610147
  • palico-ai/palico-ai

    Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework

    Language:TypeScript33539926
  • Value4AI/Awesome-LLM-in-Social-Science

    Awesome papers involving LLMs in Social Science.

  • athina-ai/athina-evals

    Python SDK for running evaluations on LLM generated responses

    Language:Python2445115
  • Psycoy/MixEval

    The official evaluation suite and dynamic data release for MixEval.

    Language:Python23013437
  • iMeanAI/WebCanvas

    Connect agents to live web environments evaluation.

    Language:Python22541012
  • PetroIvaniuk/llms-tools

    A list of LLMs Tools & Projects

  • villagecomputing/superpipe

    Superpipe - optimized LLM pipelines for structured data

    Language:Python107222
  • rungalileo/hallucination-index

    Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.

  • kolenaIO/autoarena

    Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation

    Language:TypeScript101707
  • raga-ai-hub/raga-llm-hub

    Framework for LLM evaluation, guardrails and security

    Language:Python1002310
  • allenai/CommonGen-Eval

    Evaluating LLMs with CommonGen-Lite

    Language:Python87613
  • hkust-nlp/dart-math

    [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*

    Language:Jupyter Notebook86163
  • loganrjmurphy/LeanEuclid

    LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.

    Language:Lean80715
  • Re-Align/just-eval

    A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

    Language:Python77446
  • parea-ai/parea-sdk-py

    Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

    Language:Python75226
  • alopatenko/LLMEvaluation

    A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

    Language:HTML74604
  • Addepto/contextcheck

    MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

    Language:Python546
  • deshwalmahesh/PHUDGE

    Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.

    Language:Jupyter Notebook49117
  • Awesome-LLMs-ICLR-24

    azminewasi/Awesome-LLMs-ICLR-24

    It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.

  • AntonioGr7/pratical-llms

    A collection of hand on notebook for LLMs practitioner

    Language:Jupyter Notebook39309
  • Babelscape/ALERT

    Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"

    Language:Python35307
  • Chainlit/literalai-cookbooks

    Cookbooks and tutorials on Literal AI

    Language:Jupyter Notebook355310