EleutherAI/lm-evaluation-harness

A framework for few-shot evaluation of language models.

PythonMIT

Pinned issues

[Discussion] Add Major Code Benchmarks

#1157 opened 6 months ago by haileyschoelkopf

Open2

Issues

Is there something wrong with 'google/gemma-1.1-2b-it' ?
#1854 opened 21 days ago by rangehow
3
IndexError: list index out of range when running benchmark on gguf model
#1768 opened a month ago by fherrmannsdoerfer
2
Task description newline characters removed by Jinja templating, affecting generated requests and performance
#1817 opened a month ago by ma0li
1
llama3 baseline reproduction problem
#1799 opened 19 days ago by fmm170
5
I get this error whenever I try to run an eval: ImportError: cannot import name 'HfApi' from 'huggingface_hub'
#1826 opened a month ago by menhguin
8
Getting error on lm-evaluation for merged models deployed on HF
#1823 opened 19 days ago by tolgakurtuluss
2
Error when evaluating math.
#1819 opened 19 days ago by SefaZeng
2
Evaluate encoder-decoder-models
#1840 opened 19 days ago by Bachstelze
1
How to evaluate a large model like llama-65B?
#1804 opened a month ago by fayuge
1
`--tasks list` does not work
#1850 opened 21 days ago by Sunt-ing
4
SyntaxError when import lm_eval
#1820 opened 22 days ago by mxjmtxrm
3
How to use Zeno
#1842 opened 25 days ago by DavidAdamczyk
0
how to select the --model parameter for the meta format checkpoints
#1845 opened 23 days ago by MaxwelsDonc
3
Errors when loading exact_match.py
#1830 opened a month ago by twxin
2
--device cuda:3 not honored when using --model vllm
#1846 opened 24 days ago by LGLG42
1
Multi Label Classification
#1814 opened a month ago by IsraelAbebe
0
MPS backend out of memory evaluating fine-tuned Mixtral-8x7B-Instruct-v0.1 on a machine with 100+ GB
#1835 opened 24 days ago by chimezie
2
Bug: wrong `until` default value for chat based model
#1837 opened 25 days ago by YilunZhou
2
Inconsistent evaluation results with Chat Template
#1841 opened 25 days ago by shiweijiezero
0
Evaluation results of llama2 with lm-evaluation-harness using wikitext-2
#1833 opened a month ago by l2002924700
1
AssertionError: aggregation named 'mean' conflicts with existing registered aggregation!
#1839 opened 25 days ago by hunter2009pf
0
Using Language Models as Evaluators
#1831 opened a month ago by lintangsutawika
3
how to run all the bigbench tasks at once?
#1809 opened a month ago by kbmlcoding
0
sha256 for datasets or samples
#1836 opened 25 days ago by artemorloff
0
Multi-round evaluation for chat models
#1816 opened a month ago by YilunZhou
1
The input format for XNLI seems wired?
#1822 opened a month ago by SefaZeng
2
Exclude all current tasks
#1801 opened a month ago by YilunZhou
2
Avoid slow testing due to network issues.
#1824 opened a month ago by pixeli99
2
eval gsm8k from local dataset folder with the bug info "ValueError: BuilderConfig 'main' not found."
#1829 opened a month ago by Jp-17
0
Add More Tests
#1827 opened a month ago by haileyschoelkopf
0
Support Mamba based models for evaluation tasks
#1812 opened a month ago by NamburiSrinath
1
when MMLU eval, num_few_shot=5, more GPU overhead
#1818 opened a month ago by chunniunai220ml
1
TypeError: 'NoneType' object is not iterable when using cache and loglikelihood_rolling
#1821 opened a month ago by mdocekal
0
how to evaluate on boolq? incorrect results
#1813 opened a month ago by sidhantls
4
Out-Of-Memory Error for same batch size but different dataset
#1811 opened a month ago by richardzhuang0412
7
Hugging Face: Open LLM Leaderboard: how do I reproduce results for details_gpt2 repository
#1802 opened a month ago by CoconutJJ
1
Gemini 1.5/Ultra support
#1808 opened a month ago by notrichardren
0
Is caching large evaluation dataset like MMLU supported?
#1805 opened a month ago by richardzhuang0412
1
Add NPU support for huggingface.py
#1797 opened a month ago by jiaqiw09
2
Math or minerva_math not generating any samples via scripts.write_out
#1795 opened a month ago by xksteven
1
Same results - different models
#1771 opened a month ago by aleksoren
4
Sorting task output alphabetically
#1774 opened a month ago by ad8e
2
error in eval-tracker : 'Namespace' object has no attribute 'push_results_to_hub'
#1778 opened a month ago by abgoswam
1
Support loading slices of a split from a dataset
#1788 opened a month ago by alexrs
0
Data preprocess is slow for mmlu
#1781 opened a month ago by ThisisBillhe
1
Error when limit is not specified (possibly issue with requirements?)
#1782 opened a month ago by hammoudhasan
2
openai.InternalServerError: the model generated invalid Unicode output
#1783 opened a month ago by djstrong
0
Support OpenAI's Batch API
#1770 opened a month ago by djstrong
1
How to filter to see only generate_until: lm-eval --tasks list
#1772 opened a month ago by chigkim
0
Cannot have both a group list and task list
#1767 opened a month ago by steven-basart
3