bigscience-workshop/lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
PythonMIT
Issues
- 0
Number of fewshot examples
#158 opened by lpq29743 - 0
Single reference target
#159 opened by lpq29743 - 0
- 2
AssertionError
#161 opened by lpc-eol - 1
max_length not set correctly
#148 opened by hatimbr - 1
OverflowError: math range error
#106 opened by Muennighoff - 1
Rouge score
#157 opened by Muennighoff - 1
lm_eval.list_model_apis() not found
#156 opened by robertLiuLinFeng - 0
Python 3.8 Support
#155 opened by vrunm - 0
translation evalution error
#154 opened by laozhanghahaha - 2
- 1
Unknown issue with loading object
#147 opened by pku-yao-cheng - 0
cache not storing predictions
#146 opened by rbawden - 1
How this evaluation is done?
#142 opened by a-cavalcanti - 1
Space prepended for Seq2Seq
#135 opened by Muennighoff - 1
Implement COMET
#50 opened by StellaAthena - 0
Create a nice API for getting (response, label) pairs to plug into external libraries.
#51 opened by StellaAthena - 1
BLEURT or BERTScore added to NLG datasets.
#59 opened by jordiclive - 9
copa+…As a result, C1 or C2? prompting error
#65 opened by StellaAthena - 1
Bloom tested dataset not exist in this repo
#114 opened by switiz - 2
different score ranges are confusing
#119 opened by Muennighoff - 3
Selecting prompts
#107 opened by Muennighoff - 1
Multilingual prompts
#105 opened by Muennighoff - 1
- 6
Clean up interface for HF models
#35 opened by StellaAthena - 0
FLORES bugged with T5
#64 opened by StellaAthena