evals
There are 33 repositories under evals topic.
mastra-ai/mastra
The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.
Arize-ai/phoenix
AI Observability & Evaluation
AgentOps-AI/agentops
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.
superlinear-ai/raglite
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL
mattpocock/evalite
Evaluate your LLM-powered apps with TypeScript
keshik6/HourVideo
[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding
METR/vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
dustalov/evalica
Evalica, your favourite evaluation toolkit
flexpa/llm-fhir-eval
Benchmarking Large Language Models for FHIR
AIAnytime/rag-evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
maragudk/gai
Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.
The-Swarm-Corporation/StatisticalModelEvaluator
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"
root-signals/rs-python-sdk
Root Signals Python SDK
openlayer-ai/templates
Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.
BBischof/mindAgents
AI Agents play The Mind card game
nstankov-bg/oaievals-collector
The OAIEvals Collector: A robust, Go-based metric collector for EVALS data. Supports Kafka, Elastic, Loki, InfluxDB, TimescaleDB integrations, and containerized deployment with Docker. Streamlines OAI-Evals data management efficiently with a low barrier of entry!
noah-art3mis/crucible
Develop better LLM apps by testing different models and prompts in bulk.
Shard-AI/Shard
Open Source Video Understanding API and Large Vision Model Observability Platform.
zeus-fyi/mockingbird
Mockingbird Front End Code | Zeus + SciFi = Power of the gods (cloud + ai | Zeus) Meets the power of SciFi (human ingenuity | SfYi) At the intersection of intelligent design (systems engineering excellence) For your intelligence —ZeusFYI.
gokayfem/dspy-ollama-colab
dspy with ollama and llamacpp on google colab
lennart-finke/picturebooks
Which objects are visible through the holes in a picture book? This visual task is easy for adults, doable for primary schoolers, but hard for vision transformers.
mandoline-ai/mandoline-node
Official Node.js client for the Mandoline API
mandoline-ai/mandoline-python
Official Python client for the Mandoline API
maragudk/evals-action
A GitHub Action to parse LLM eval results, display and aggregate them.
maragudk/gai-starter-kit
Get started with LLMs, FTS and vector search, RAG, and more, in Go!
modelmetry/modelmetry-sdk-js
The Modelmetry JS/TS SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.
modelmetry/modelmetry-sdk-python
The Modelmetry Python SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.
camronh/ContextLength-Experiment
Gemini 1.5 Million Token Context Experiment
jancervenka/czech-simpleqa
How well can language models answer questions in Czech?
josephwilk/disability-justice-eval
Automated eval of various LLMS against disability justice statements
jtmuller5/vibe-checker
The TypeScript LLM Evaluation File