stanford-crfm/helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).

PythonApache-2.0

Issues

Add new model (upstage/solar-pro-preview-instruct)
#3040 opened 6 days ago by remy-rec
11
Switch AnthropicTokenizer to use client.beta.messages.count_tokens()
#3212 opened 7 days ago by yifanmai
0
Issue with Model Execution on the HumanEval Benchmark
#3201 opened 19 days ago by lucas-s-p
3
Add new model NECTEC (nectec/OpenThaiLLM-Prebuilt-7B, nectec/Pathumma-llm-text-1.0.0)
#3188 opened a month ago by JackJessada
3
Reproducing WMT14
#3175 opened a month ago by MaxHahnbueck
1
At least one of model_deployment and model must be specified
#3184 opened a month ago by thallysonjsa
5
Is there any way to load GGUF models?
#3141 opened 2 months ago by InAnYan
3
VHELM setup
#3183 opened a month ago by smsarov
1
Installation Guide: Required Dependencies for Supported Operating Systems and Distributions
#3160 opened a month ago by shakatoday
5
pip needs to be temporarily downgraded to 24.1.2
#2855 opened 5 months ago by yifanmai
1
Issue with running HEIM
#3080 opened 2 months ago by sudhir-mcw
3
Add MMLU-Pro
#3018 opened 3 months ago by yifanmai
0
Optimum Intel OpenVino fails with segmentation fault
#3066 opened a month ago by yifanmai
2
Help me understand the HELM Classic Leaderboard's missing results
#2994 opened 2 months ago by PaulJoeMaliakel
2
Proposal: Add free-form data fields to Instance
#3057 opened 2 months ago by yifanmai
2
Add metadata field to HELM predictions
#2886 opened 2 months ago by farzaank
3
Update tutorial for helm-server
#2805 opened 2 months ago by farzaank
1
Add metric tooltip to Predictions
#2987 opened 2 months ago by farzaank
1
Add wrapper with consistent padding for MiniLeaderboards
#2933 opened 2 months ago by farzaank
1
VHELM images do not scale when window is resized
#2885 opened 2 months ago by yifanmai
1
RAFT evaluation
#2909 opened 2 months ago by divyashan
2
Breaking change in nltk 3.8.2
#2926 opened 2 months ago by yifanmai
1
Support few-shot chain-of-thought in GPQA / MMLU
#3088 opened 2 months ago by yifanmai
0
Add GPQA scenario
#3017 opened 2 months ago by yifanmai
0
flake8 in Python 3.12 produces a large number of errors
#3072 opened 2 months ago by yifanmai
0
HatefulMemesScenario get_instances returning error
#3056 opened 2 months ago by dxwu2
3
Add Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002
#3021 opened 2 months ago by yifanmai
1
Not able to install install-heim-extras.sh for heim leaderboard
#3060 opened 2 months ago by snehith3195
1
How to use this package when I have the pompts and images across models
#3062 opened 2 months ago by snehith3195
1
Proposal: Multi-turn support
#3059 opened 2 months ago by yifanmai
0
Proposal: User-provided custom adapters
#3058 opened 2 months ago by yifanmai
0
Add Llama 3 Instruct Lite / Turbo family
#3022 opened 3 months ago by yifanmai
1
Run.tsx fetches all run specs but we only need one
#2954 opened 3 months ago by farzaank
0
o1 series models cannot take system prompt
#3019 opened 3 months ago by bryanzhou008
2
Win rate bug when scores are tied
#2991 opened 3 months ago by yifanmai
1
Make ModelAsJudgeAnnotator fail when annotation fails
#2936 opened 3 months ago by farzaank
1
Error when running OpenAI o1 series models
#2995 opened 3 months ago by bryanzhou008
7
Add amazon bedrock 3p models to the model configuration.
#2967 opened 4 months ago by subhaviv
2
Question - Latest Version of results
#2965 opened 4 months ago by subhaviv
2
Perturbations are incorrectly displayed in frontend
#2940 opened 4 months ago by yifanmai
0
run_specs returns null when using expand
#2948 opened 4 months ago by xvanQ
4
Incorrect scoring due to answer format mismatch in MMLU evaluation
#2939 opened 4 months ago by DerryChan
1
Annotator block in predictions page should be collapsed by default
#2884 opened 4 months ago by yifanmai
1
Add Windows support
#2907 opened 5 months ago by yifanmai
0
Official Llama 3.1 Evals
#2851 opened 5 months ago by YLGH
2
HEIMHumanEvalScenario requires permissions to download data from codalab
#2865 opened 5 months ago by slymane
1
Add sorting/filtering functionality to predictions pages
#2887 opened 5 months ago by farzaank
0
Fix issue with clicking first col in leaderboards
#2806 opened 5 months ago by farzaank
5
Predictions page downloads run_specs.json when search query is modified
#2838 opened 5 months ago by yifanmai
0
Is there a way to select a subset of models to compare?
#2849 opened 5 months ago by YLGH
0