evals
There are 15 repositories under evals topic.
AgentOps-AI/agentops
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks like CrewAI, Langchain, and Autogen
METR/vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
superlinear-ai/raglite
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite
AIAnytime/rag-evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
dustalov/evalica
Evalica, your favourite evaluation toolkit
openlayer-ai/templates
Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.
nstankov-bg/oaievals-collector
The OAIEvals Collector: A robust, Go-based metric collector for EVALS data. Supports Kafka, Elastic, Loki, InfluxDB, TimescaleDB integrations, and containerized deployment with Docker. Streamlines OAI-Evals data management efficiently with a low barrier of entry!
zeus-fyi/mockingbird
Mockingbird Front End Code | Zeus + SciFi = Power of the gods (cloud + ai | Zeus) Meets the power of SciFi (human ingenuity | SfYi) At the intersection of intelligent design (systems engineering excellence) For your intelligence —ZeusFYI.
gokayfem/dspy-ollama-colab
dspy with ollama and llamacpp on google colab
lennart-finke/picturebooks
Which objects are visible through the holes in a picture book? This visual task is easy for adults, doable for primary schoolers, but hard for vision transformers.
modelmetry/modelmetry-sdk-js
The Modelmetry JS/TS SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.
modelmetry/modelmetry-sdk-python
The Modelmetry Python SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.
noah-art3mis/crucible
Develop better LLM apps by testing different models and prompts in bulk.
camronh/ContextLength-Experiment
Gemini 1.5 Million Token Context Experiment