evals

There are 20 repositories under evals topic.

AgentOps-AI/agentops
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks like CrewAI, Langchain, and Autogen
Language:Python2.5k 27 162246
lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.
Language:TypeScript1.4k 5 4272
superlinear-ai/raglite
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite
Language:Python625 6 1945
METR/vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
Language:TypeScript72 4 32322
keshik6/HourVideo
[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding
Language:Jupyter Notebook282
AIAnytime/rag-evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
Language:Python26 2 015
dustalov/evalica
Evalica, your favourite evaluation toolkit
Language:Python25 3 43
flexpa/llm-fhir-eval
Benchmarking Large Language Models for FHIR
16 0 02
NirantK/rag-to-riches
Language:Jupyter Notebook16 1 06
openlayer-ai/templates
Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.
Language:Python6 4 05
The-Swarm-Corporation/StatisticalModelEvaluator
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"
Language:Python6
nstankov-bg/oaievals-collector
The OAIEvals Collector: A robust, Go-based metric collector for EVALS data. Supports Kafka, Elastic, Loki, InfluxDB, TimescaleDB integrations, and containerized deployment with Docker. Streamlines OAI-Evals data management efficiently with a low barrier of entry!
Language:Go3 1 00
noah-art3mis/crucible
Develop better LLM apps by testing different models and prompts in bulk.
Language:Python2 2 00
VikramxD/pixelupbench
Benchmarking Pixel based AI Upscaling Models for Video Upscaling
Language:Python2
zeus-fyi/mockingbird
Mockingbird Front End Code | Zeus + SciFi = Power of the gods (cloud + ai | Zeus) Meets the power of SciFi (human ingenuity | SfYi) At the intersection of intelligent design (systems engineering excellence) For your intelligence —ZeusFYI.
Language:TypeScript2 1 00
gokayfem/dspy-ollama-colab
dspy with ollama and llamacpp on google colab
Language:Jupyter Notebook1 1 00
lennart-finke/picturebooks
Which objects are visible through the holes in a picture book? This visual task is easy for adults, doable for primary schoolers, but hard for vision transformers.
Language:Jupyter Notebook1 1 00
modelmetry/modelmetry-sdk-js
The Modelmetry JS/TS SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.
Language:TypeScript1 1 00
modelmetry/modelmetry-sdk-python
The Modelmetry Python SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.
Language:Python1 2 01
camronh/ContextLength-Experiment
Gemini 1.5 Million Token Context Experiment
Language:Jupyter Notebook0 1 10

evals

AgentOps-AI/agentops

lmnr-ai/lmnr

superlinear-ai/raglite

METR/vivaria

keshik6/HourVideo

AIAnytime/rag-evaluator

dustalov/evalica

flexpa/llm-fhir-eval

NirantK/rag-to-riches

openlayer-ai/templates

The-Swarm-Corporation/StatisticalModelEvaluator

nstankov-bg/oaievals-collector

noah-art3mis/crucible

VikramxD/pixelupbench

zeus-fyi/mockingbird

gokayfem/dspy-ollama-colab

lennart-finke/picturebooks

modelmetry/modelmetry-sdk-js

modelmetry/modelmetry-sdk-python

camronh/ContextLength-Experiment