[Question] How to have no preset values sent into `.compute()`
alvations opened this issue · 0 comments
We've a use-case https://huggingface.co/spaces/alvations/llm_harness_mistral_arc/blob/main/llm_harness_mistral_arc.py
where default feature input types for evaluate.Metric
is nothing and we get something like this in our llm_harness_mistral_arc/llm_harness_mistral_arc.py
import evaluate
import datasets
import lm_eval
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class llm_harness_mistral_arc(evaluate.Metric):
def _info(self):
# TODO: Specifies the evaluate.EvaluationModuleInfo object
return evaluate.MetricInfo(
# This is the description that will appear on the modules page.
module_type="metric",
description="",
citation="",
inputs_description="",
# This defines the format of each prediction and reference
features={},
)
def _compute(self, pretrained=None, tasks=[]):
outputs = lm_eval.simple_evaluate(
model="hf",
model_args={"pretrained":pretrained},
tasks=tasks,
num_fewshot=0,
)
results = {}
for task in outputs['results']:
results[task] = {'acc':outputs['results'][task]['acc,none'],
'acc_norm':outputs['results'][task]['acc_norm,none']}
return results
And in our expected user-behavior is something like, [in]:
import evaluate
module = evaluate.load("alvations/llm_harness_mistral_arc")
module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2", tasks=["arc_easy"])
And the expected output as per our tests.py
, https://huggingface.co/spaces/alvations/llm_harness_mistral_arc/blob/main/tests.py [out]:
{'arc_easy': {'acc': 0.8131313131313131, 'acc_norm': 0.7680976430976431}}
But the evaluate.Metric.compute()
somehow expects a default batch and module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2", tasks=["arc_easy"])
throws an error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-20-bd94e5882ca5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2",
2 tasks=["arc_easy"])
2 frames
[/usr/local/lib/python3.10/dist-packages/evaluate/module.py](https://localhost:8080/#) in _get_all_cache_files(self)
309 if self.num_process == 1:
310 if self.cache_file_name is None:
--> 311 raise ValueError(
312 "Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` "
313 "at least once before calling `compute`."
ValueError: Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` at least once before calling `compute`.
Q: Is it possible for the .compute()
to expect no features?
I've also tried this but somehow the evaluate.Metric.compute
is still looking for some sort of predictions
variable.
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class llm_harness_mistral_arc(evaluate.Metric):
def _info(self):
# TODO: Specifies the evaluate.EvaluationModuleInfo object
return evaluate.MetricInfo(
# This is the description that will appear on the modules page.
module_type="metric",
description="",
citation="",
inputs_description="",
# This defines the format of each prediction and reference
features=[
datasets.Features(
{
"pretrained": datasets.Value("string", id="sequence"),
"tasks": datasets.Sequence(datasets.Value("string", id="sequence"), id="tasks"),
}
)]
)
def _compute(self, pretrained, tasks):
outputs = lm_eval.simple_evaluate(
model="hf",
model_args={"pretrained":pretrained},
tasks=tasks,
num_fewshot=0,
)
results = {}
for task in outputs['results']:
results[task] = {'acc':outputs['results'][task]['acc,none'],
'acc_norm':outputs['results'][task]['acc_norm,none']}
return results
then:
import evaluate
module = evaluate.load("alvations/llm_harness_mistral_arc")
module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2", tasks=["arc_easy"])
[out]:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
[<ipython-input-36-bd94e5882ca5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2",
2 tasks=["arc_easy"])
3 frames
[/usr/local/lib/python3.10/dist-packages/evaluate/module.py](https://localhost:8080/#) in _infer_feature_from_example(self, example)
606 f"Predictions and/or references don't match the expected format.\n"
607 f"Expected format:\n{feature_strings},\n"
--> 608 f"Input predictions: {summarize_if_long_list(example['predictions'])},\n"
609 f"Input references: {summarize_if_long_list(example['references'])}"
610 )
KeyError: 'predictions'