evals

There are 33 repositories under evals topic.

  • mastra-ai/mastra

    The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.

    Language:TypeScript16.6k711.5k1.1k
  • phoenix

    Arize-ai/phoenix

    AI Observability & Evaluation

    Language:Jupyter Notebook7k474.3k570
  • AgentOps-AI/agentops

    Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

    Language:Python4.9k47391469
  • lmnr-ai/lmnr

    Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

    Language:TypeScript2.3k1156136
  • superlinear-ai/raglite

    🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL

    Language:Python1.1k93375
  • evalite

    mattpocock/evalite

    Evaluate your LLM-powered apps with TypeScript

    Language:TypeScript863136132
  • keshik6/HourVideo

    [NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

    Language:Jupyter Notebook153494
  • METR/vivaria

    Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

    Language:TypeScript110837731
  • evalica

    dustalov/evalica

    Evalica, your favourite evaluation toolkit

    Language:Python57264
  • flexpa/llm-fhir-eval

    Benchmarking Large Language Models for FHIR

    Language:TypeScript40006
  • AIAnytime/rag-evaluator

    A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

    Language:Python342117
  • NirantK/rag-to-riches

    Language:Jupyter Notebook20107
  • gai

    maragudk/gai

    Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.

    Language:Go1629
  • The-Swarm-Corporation/StatisticalModelEvaluator

    An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

    Language:Python15101
  • root-signals/rs-python-sdk

    Root Signals Python SDK

    Language:Python11200
  • openlayer-ai/templates

    Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.

    Language:Python8406
  • BBischof/mindAgents

    AI Agents play The Mind card game

    Language:Python33
  • nstankov-bg/oaievals-collector

    The OAIEvals Collector: A robust, Go-based metric collector for EVALS data. Supports Kafka, Elastic, Loki, InfluxDB, TimescaleDB integrations, and containerized deployment with Docker. Streamlines OAI-Evals data management efficiently with a low barrier of entry!

    Language:Go3100
  • noah-art3mis/crucible

    Develop better LLM apps by testing different models and prompts in bulk.

    Language:Python2100
  • Shard-AI/Shard

    Open Source Video Understanding API and Large Vision Model Observability Platform.

    Language:Python2000
  • zeus-fyi/mockingbird

    Mockingbird Front End Code | Zeus + SciFi = Power of the gods (cloud + ai | Zeus) Meets the power of SciFi (human ingenuity | SfYi) At the intersection of intelligent design (systems engineering excellence) For your intelligence —ZeusFYI.

    Language:TypeScript2100
  • gokayfem/dspy-ollama-colab

    dspy with ollama and llamacpp on google colab

    Language:Jupyter Notebook1100
  • picturebooks

    lennart-finke/picturebooks

    Which objects are visible through the holes in a picture book? This visual task is easy for adults, doable for primary schoolers, but hard for vision transformers.

    Language:Jupyter Notebook1100
  • mandoline-ai/mandoline-node

    Official Node.js client for the Mandoline API

    Language:TypeScript1000
  • mandoline-ai/mandoline-python

    Official Python client for the Mandoline API

    Language:Python1000
  • maragudk/evals-action

    A GitHub Action to parse LLM eval results, display and aggregate them.

  • gai-starter-kit

    maragudk/gai-starter-kit

    Get started with LLMs, FTS and vector search, RAG, and more, in Go!

    Language:Go1
  • modelmetry/modelmetry-sdk-js

    The Modelmetry JS/TS SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.

    Language:TypeScript1100
  • modelmetry/modelmetry-sdk-python

    The Modelmetry Python SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.

    Language:Python1101
  • camronh/ContextLength-Experiment

    Gemini 1.5 Million Token Context Experiment

    Language:Jupyter Notebook0110
  • jancervenka/czech-simpleqa

    How well can language models answer questions in Czech?

    Language:Python10
  • josephwilk/disability-justice-eval

    Automated eval of various LLMS against disability justice statements

    Language:Python10
  • jtmuller5/vibe-checker

    The TypeScript LLM Evaluation File

    Language:TypeScript