llm-eval

There are 37 repositories under llm-eval topic.

  • promptfoo/promptfoo

    Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

    Language:TypeScript8.4k23985698
  • phoenix

    Arize-ai/phoenix

    AI Observability & Evaluation

    Language:Jupyter Notebook7k474.3k570
  • giskard

    Giskard-AI/giskard

    🐢 Open-Source Evaluation & Testing for AI & LLM systems

    Language:Python4.5k34483316
  • datachain

    iterative/datachain

    ETL, Analytics, Versioning for Unstructured Data

    Language:Python2.7k17347124
  • uptrain-ai/uptrain

    UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

    Language:Python2.3k19149196
  • athina-ai/athina-evals

    Python SDK for running evaluations on LLM generated responses

    Language:Python2925117
  • fiddlecube/fiddlecube-sdk

    Generate ideal question-answers for testing RAG

    Language:Python126103
  • Re-Align/just-eval

    A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

    Language:Python85346
  • parea-ai/parea-sdk-py

    Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

    Language:Python78127
  • kuk/rulm-sbs2

    Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

    Language:Jupyter Notebook61312
  • multinear/multinear

    Develop reliable AI apps

    Language:Python44040
  • ragrank

    Auto-Playground/ragrank

    🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

    Language:Python3401013
  • alan-turing-institute/prompto

    An open source library for asynchronous querying of LLM endpoints

    Language:Python2615620
  • Supahands/llm-comparison-backend

    This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end

    Language:Python20312
  • honeyhiveai/realign

    Realign is a testing and simulation framework for AI applications.

    Language:Python17101
  • genia-dev/vibraniumdome

    LLM Security Platform.

    Language:Python11212
  • Networks-Learning/prediction-powered-ranking

    Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

    Language:Jupyter Notebook9201
  • pyladiesams/eval-llm-based-apps-jan2025

    Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

    Language:Jupyter Notebook75
  • prompt-foundry/python-sdk

    The prompt engineering, prompt management, and prompt evaluation tool for Python

    Language:Python6000
  • prompt-foundry/typescript-sdk

    The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.

    Language:TypeScript6001
  • harlev/eva-l

    LLM Evaluation Framework

    Language:Python4100
  • harshagrawal523/GenerativeAgents

    Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.

    Language:Python4000
  • parea-ai/parea-sdk-ts

    TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

    Language:TypeScript4021
  • yukinagae/genkitx-promptfoo

    Community Plugin for Genkit to use Promptfoo

    Language:TypeScript4180
  • jaaack-wang/multi-problem-eval-llm

    Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

    Language:Jupyter Notebook310
  • yuzu-ai/ShinRakuda

    Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.

    Language:Python3100
  • genia-dev/vibraniumdome-docs

    LLM Security Platform Docs

    Language:MDX2200
  • prompt-foundry/go-sdk

    The prompt engineering, prompt management, and prompt evaluation tool for Go.

  • prompt-foundry/ruby-sdk

    The prompt engineering, prompt management, and prompt evaluation tool for Ruby.

  • yukinagae/promptfoo-sample

    Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

  • prompt-foundry/dotnet-sdk

    The prompt engineering, prompt management, and prompt evaluation tool for C# and .NET

  • prompt-foundry/kotlin-sdk

    The prompt engineering, prompt management, and prompt evaluation tool for Kotlin.

  • cuiyuheng/opencompass

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

    Language:Python
  • llm-eval

    daqh/llm-eval

    This project applies the LLM-Eval framework to the PersonaChat dataset to assess response quality in a conversational context. Using GPT-4o-mini via the OpenAI API, the system generates scores (on a 0-5 or 0-100 scale) for four evaluation metrics: context, grammar, relevance, and appropriateness.

    Language:Python10
  • kdcyberdude/punjabi-llm-eval

    First Punjabi LLM Eval.

    Language:Python00
  • yukinagae/genkit-promptfoo-sample

    Sample implementation demonstrating how to use Firebase Genkit with Promptfoo

    Language:TypeScript