llm-evaluation

There are 60 repositories under llm-evaluation topic.

  • langfuse

    langfuse/langfuse

    🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

    Language:TypeScript4.1k16380379
  • giskard

    Giskard-AI/giskard

    🐢 Open-Source Evaluation & Testing for LLMs and ML models

    Language:Python3.5k27427221
  • promptfoo/promptfoo

    Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

    Language:TypeScript3.2k17415217
  • confident-ai/deepeval

    The LLM Evaluation Framework

    Language:Python2.1k15191145
  • agenta

    Agenta-AI/agenta

    The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

    Language:Python93020575159
  • relari-ai/continuous-eval

    Open-Source Evaluation for GenAI Application Pipelines

    Language:Python34242216
  • onejune2018/Awesome-LLM-Eval

    Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.

  • Value4AI/Awesome-LLM-in-Social-Science

    Awesome papers involving LLMs in Social Science.

  • athina-ai/athina-evals

    Python SDK for running evaluations on LLM generated responses

    Language:Python1415011
  • Psycoy/MixEval

    The official evaluation suite and dynamic data release for MixEval.

    Language:Python10110
  • villagecomputing/superpipe

    Superpipe - optimized LLM pipelines for structured data

    Language:Python99221
  • allenai/CommonGen-Eval

    Evaluating LLMs with CommonGen-Lite

    Language:Python81613
  • raga-ai-hub/raga-llm-hub

    Framework for LLM evaluation, guardrails and security

    Language:Python77226
  • PetroIvaniuk/llms-tools

    A list of LLMs Tools & Projects

  • Re-Align/just-eval

    A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

    Language:Python64334
  • rungalileo/hallucination-index

    Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.

  • loganrjmurphy/LeanEuclid

    LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.

    Language:Lean47401
  • deshwalmahesh/PHUDGE

    Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.

    Language:Jupyter Notebook42106
  • parea-ai/parea-sdk-py

    Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

    Language:Python42214
  • AntonioGr7/pratical-llms

    A collection of hand on notebook for LLMs practitioner

    Language:Jupyter Notebook31307
  • ChanLiang/CONNER

    The implementation for EMNLP 2023 paper ”Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators“

    Language:Python28221
  • LLM-Evaluation-s-Always-Fatiguing/leaf-playground

    A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.

    Language:Python210
  • intuit-ai-research/DCR-consistency

    DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

    Language:Python20402
  • Babelscape/ALERT

    Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"

    Language:Python17301
  • minnesotanlp/cobbler

    Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

    Language:Jupyter Notebook14201
  • Awesome-LLMs-ICLR-24

    azminewasi/Awesome-LLMs-ICLR-24

    It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.

  • aws-samples/fm-leaderboarder

    FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

    Language:Python125103
  • Chainlit/literal-cookbook

    Cookbooks and tutorials on Literal AI

    Language:Jupyter Notebook90
  • VITA-Group/llm-kick

    [ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.

    Language:Python91212
  • Praful932/llmsearch

    Find better generation parameters for your LLM

    Language:Python8
  • evaluation-tools/nutcracker

    Large Model Evaluation Experiments

    Language:Python5101
  • kwinkunks/promptly

    A prompt collection for testing and evaluation of LLMs.

    Language:Jupyter Notebook530
  • yandex-research/mind-your-format

    Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements

    Language:Jupyter Notebook520
  • euskoog/openai-assistants-link

    Link your OpenAI Assistants to a custom store + Evaluate Assistant responses

    Language:Python4
  • Networks-Learning/prediction-powered-ranking

    Code for the paper Prediction-Powered Ranking of Large Language Models, Arxiv 2024.

    Language:Python4301
  • zhuohaoyu/KIEval

    [ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

    Language:Python4201