stanford-crfm/helm
Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
PythonApache-2.0
Issues
- 11
Add new model (upstage/solar-pro-preview-instruct)
#3040 opened by remy-rec - 0
- 3
Issue with Model Execution on the HumanEval Benchmark
#3201 opened by lucas-s-p - 3
Add new model NECTEC (nectec/OpenThaiLLM-Prebuilt-7B, nectec/Pathumma-llm-text-1.0.0)
#3188 opened by JackJessada - 1
Reproducing WMT14
#3175 opened by MaxHahnbueck - 5
- 3
Is there any way to load GGUF models?
#3141 opened by InAnYan - 1
VHELM setup
#3183 opened by smsarov - 5
Installation Guide: Required Dependencies for Supported Operating Systems and Distributions
#3160 opened by shakatoday - 1
pip needs to be temporarily downgraded to 24.1.2
#2855 opened by yifanmai - 3
Issue with running HEIM
#3080 opened by sudhir-mcw - 0
Add MMLU-Pro
#3018 opened by yifanmai - 2
Optimum Intel OpenVino fails with segmentation fault
#3066 opened by yifanmai - 2
- 2
Proposal: Add free-form data fields to Instance
#3057 opened by yifanmai - 3
Add metadata field to HELM predictions
#2886 opened by farzaank - 1
Update tutorial for helm-server
#2805 opened by farzaank - 1
Add metric tooltip to Predictions
#2987 opened by farzaank - 1
Add wrapper with consistent padding for MiniLeaderboards
#2933 opened by farzaank - 1
VHELM images do not scale when window is resized
#2885 opened by yifanmai - 2
RAFT evaluation
#2909 opened by divyashan - 1
Breaking change in nltk 3.8.2
#2926 opened by yifanmai - 0
Support few-shot chain-of-thought in GPQA / MMLU
#3088 opened by yifanmai - 0
Add GPQA scenario
#3017 opened by yifanmai - 0
flake8 in Python 3.12 produces a large number of errors
#3072 opened by yifanmai - 3
HatefulMemesScenario get_instances returning error
#3056 opened by dxwu2 - 1
Add Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002
#3021 opened by yifanmai - 1
- 1
- 0
Proposal: Multi-turn support
#3059 opened by yifanmai - 0
Proposal: User-provided custom adapters
#3058 opened by yifanmai - 1
Add Llama 3 Instruct Lite / Turbo family
#3022 opened by yifanmai - 0
Run.tsx fetches all run specs but we only need one
#2954 opened by farzaank - 2
o1 series models cannot take system prompt
#3019 opened by bryanzhou008 - 1
Win rate bug when scores are tied
#2991 opened by yifanmai - 1
Make ModelAsJudgeAnnotator fail when annotation fails
#2936 opened by farzaank - 7
Error when running OpenAI o1 series models
#2995 opened by bryanzhou008 - 2
Add amazon bedrock 3p models to the model configuration.
#2967 opened by subhaviv - 2
Question - Latest Version of results
#2965 opened by subhaviv - 0
Perturbations are incorrectly displayed in frontend
#2940 opened by yifanmai - 4
run_specs returns null when using expand
#2948 opened by xvanQ - 1
- 1
- 0
Add Windows support
#2907 opened by yifanmai - 2
Official Llama 3.1 Evals
#2851 opened by YLGH - 1
- 0
Add sorting/filtering functionality to predictions pages
#2887 opened by farzaank - 5
Fix issue with clicking first col in leaderboards
#2806 opened by farzaank - 0
- 0
Is there a way to select a subset of models to compare?
#2849 opened by YLGH