evals

There are 20 repositories under evals topic.

  • AgentOps-AI/agentops

    Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks like CrewAI, Langchain, and Autogen

    Language:Python2.5k27162246
  • lmnr-ai/lmnr

    Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

    Language:TypeScript1.4k54272
  • superlinear-ai/raglite

    🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite

    Language:Python62561945
  • METR/vivaria

    Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

    Language:TypeScript72432322
  • keshik6/HourVideo

    [NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

    Language:Jupyter Notebook282
  • AIAnytime/rag-evaluator

    A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

    Language:Python262015
  • evalica

    dustalov/evalica

    Evalica, your favourite evaluation toolkit

    Language:Python25343
  • flexpa/llm-fhir-eval

    Benchmarking Large Language Models for FHIR

  • NirantK/rag-to-riches

    Language:Jupyter Notebook16106
  • openlayer-ai/templates

    Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.

    Language:Python6405
  • The-Swarm-Corporation/StatisticalModelEvaluator

    An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

    Language:Python6
  • nstankov-bg/oaievals-collector

    The OAIEvals Collector: A robust, Go-based metric collector for EVALS data. Supports Kafka, Elastic, Loki, InfluxDB, TimescaleDB integrations, and containerized deployment with Docker. Streamlines OAI-Evals data management efficiently with a low barrier of entry!

    Language:Go3100
  • noah-art3mis/crucible

    Develop better LLM apps by testing different models and prompts in bulk.

    Language:Python2200
  • VikramxD/pixelupbench

    Benchmarking Pixel based AI Upscaling Models for Video Upscaling

    Language:Python2
  • zeus-fyi/mockingbird

    Mockingbird Front End Code | Zeus + SciFi = Power of the gods (cloud + ai | Zeus) Meets the power of SciFi (human ingenuity | SfYi) At the intersection of intelligent design (systems engineering excellence) For your intelligence —ZeusFYI.

    Language:TypeScript2100
  • gokayfem/dspy-ollama-colab

    dspy with ollama and llamacpp on google colab

    Language:Jupyter Notebook1100
  • picturebooks

    lennart-finke/picturebooks

    Which objects are visible through the holes in a picture book? This visual task is easy for adults, doable for primary schoolers, but hard for vision transformers.

    Language:Jupyter Notebook1100
  • modelmetry/modelmetry-sdk-js

    The Modelmetry JS/TS SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.

    Language:TypeScript1100
  • modelmetry/modelmetry-sdk-python

    The Modelmetry Python SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.

    Language:Python1201
  • camronh/ContextLength-Experiment

    Gemini 1.5 Million Token Context Experiment

    Language:Jupyter Notebook0110