rdnfn's Stars
jonathan-roberts1/GRAB
dmg-illc/JUDGE-BENCH
BenTenmann/bio-data-harmoniser
Automatically ingest and harmonise biological data from different sources.
google-deepmind/dangerous-capability-evaluations
cambridgeltl/zepo
Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments (Zhou et al.)
confident-ai/deepeval
The LLM Evaluation Framework
UKGovernmentBEIS/inspect_ai
Inspect: A framework for large language model evaluations
HannahKirk/prism-alignment
The Prism Alignment Project
bminixhofer/zett
Code for Zero-Shot Tokenizer Transfer
signalstickers/signalstickers
🖥📱 An unofficial gallery of stickers for Signal, the secure messenger!
google-deepmind/long-form-factuality
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
HowieHwong/MetaTool
[ICLR 2024] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
swe-bench/experiments
Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
lm-sys/arena-hard-auto
Arena-Hard-Auto: An automatic LLM benchmark.
thunlp/ChatEval
Codes for our paper "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate"
meta-llama/llama3
The official Meta Llama 3 GitHub site
s-orellana/UKB_CM_Brain
Publication code
princeton-nlp/SWE-agent
SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It solves 12.47% of bugs in the SWE-bench evaluation set and takes just 1 minute to run.
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
getcursor/cursor
The AI Code Editor
allenai/WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
huggingface/lighteval
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
justinchiu/openlogprobs
Extract full next-token probabilities via language model APIs
killiansheriff/LovelyPlots
Matplotlib style sheets to nicely format figures for scientific papers, thesis and presentations while keeping them fully editable in Adobe Illustrator.
segment-any-text/wtpsplit
Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
HeyPuter/puter
🌐 The Internet OS! Free, Open-Source, and Self-Hostable.
mlcommons/modelgauge
Make it easy to automatically and uniformly measure the behavior of many AI Systems.
marqo-ai/marqo
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
beartype/beartype
Unbearably fast near-real-time hybrid runtime-static type-checking in pure Python.
citadel-ai/langcheck
Simple, Pythonic building blocks to evaluate LLM applications.