A petty issue with MetricsEnsemble's global score, that surfaces a general problem
Closed this issue · 2 comments
As seen in the unit test:
unitxt/tests/library/test_metrics.py
Line 1712 in a0c39a5
In the global score, as expected, "score_name": "ensemble_score"
, "score": 0.44
, and "ensemble_score": 0.44
.
However, the CI-s reported in "score_ci_high": 0.56
, and "score_ci_low": 0.0
, are those left from ensemble_1_recall_macro
(the last of self.metrics for which CI is computed). For MetricEnsemble itself, CI is not computed in the settings of the unit test, as is evident from the absence of ensemble_score_ci_..
from the global score.
The association of these values (ci_high = 0.56, and ci_low = 0.0) with "score" is misleading here, since it is not consistent with "score_name".
This inconsistency (of score_ci_high
/ score_ci_low
with score_name
) can be reproduced also with a task of several metrics (each having its own main_score
), the last of which to be applied (being the first one in task.metrics) gives its name to score_name
, but score_ci_low
and score_ci_high
are simply the last ones computed in the task:
from unitxt.artifact import fetch_artifact
from unitxt.operators import ApplyMetric, CastFields, Copy, Perturb
from unitxt.standard import StandardRecipe
from unitxt.settings_utils import get_settings
from unitxt.text_utils import print_dict
settings = get_settings()
settings.allow_unverified_code=True
card, _ =fetch_artifact("cards.coedit_error_detection")
print(card.task.metrics)
recipe = StandardRecipe(card=card, loader_limit=50, demos_pool_size=0, num_demos=0, template_card_index=0)
ms = recipe()
copyoperator = Copy(field="target", to_field="prediction")
ms = copyoperator(ms)
castfieldoperator = CastFields(fields={"prediction":"float"}, failure_defaults={})
ms = castfieldoperator(ms)
castfieldoperator = CastFields(fields={"references":"int"}, failure_defaults={}, process_every_value=True)
ms = castfieldoperator(ms)
perturboperator = Perturb(field="prediction", select_from=[0,1], percentage_to_perturb=60)
ms = perturboperator(ms)
applymetricoperator = ApplyMetric(metric_field="metrics", calc_confidence_intervals=True)
ms = applymetricoperator(ms)
trains = list(ms["train"])
print(len(trains))
print_dict(trains[0]["score"]["global"])
print("**** no CI in last metric applied, the metric that is marked by 'score_name'*******")
card.task.metrics[0] = 'metrics.accuracy[n_resamples=0]'
print(card.task.metrics)
recipe = StandardRecipe(card=card, loader_limit=50, demos_pool_size=0, num_demos=0, template_card_index=0)
ms = recipe()
copyoperator = Copy(field="target", to_field="prediction")
ms = copyoperator(ms)
castfieldoperator = CastFields(fields={"prediction":"float"}, failure_defaults={})
ms = castfieldoperator(ms)
castfieldoperator = CastFields(fields={"references":"int"}, failure_defaults={}, process_every_value=True)
ms = castfieldoperator(ms)
perturboperator = Perturb(field="prediction", select_from=[0,1], percentage_to_perturb=60)
ms = perturboperator(ms)
applymetricoperator = ApplyMetric(metric_field="metrics", calc_confidence_intervals=True)
ms = applymetricoperator(ms)
trains=list(ms["train"])
print_dict(trains[0]["score"]["global"])
that prints out:
Loading filtered by: lambda x: x['task'] == 'gec';
Loading limited to 50 instances by setting LoadHF.loader_limit;
90
f1_binary (float64):
0.7619047619047619
f1_binary_neg (float64):
0.7916666666666666
recall_binary (float64):
0.7111111111111111
recall_binary_neg (float64):
0.8444444444444444
precision_binary (float64):
0.8205128205128205
precision_binary_neg (float64):
0.7450980392156863
score (float64): <-----------from here
0.15555555555555556
score_name (str):
accuracy
score_ci_low (float64):
0.08888888888888889
score_ci_high (float64):
0.25555555555555554 <------ down to here, consistent with accuracy
recall_binary_ci_low (float64):
0.5463282189392394
recall_binary_ci_high (float64):
0.8462029114420383
precision_binary_ci_low (float64):
0.711368184735433
precision_binary_ci_high (float64):
0.9028650762467041
f1_binary_ci_low (float64):
0.6528563264107466
f1_binary_ci_high (float64):
0.8550395441291192
accuracy (float64):
0.15555555555555556
accuracy_ci_low (float64):
0.08888888888888889
accuracy_ci_high (float64):
0.25555555555555554
**** no CI in last metric applied, the metric that is marked by 'score_name'*******
['metrics.accuracy[n_resamples=0]', 'metrics.f1_binary', 'metrics.precision_binary', 'metrics.recall_binary']
Loader line limit was set to 50
Loading filtered by: lambda x: x['task'] == 'gec';
Loading limited to 50 instances by setting LoadHF.loader_limit;
f1_binary (float64):
0.7619047619047619
f1_binary_neg (float64):
0.7916666666666666
recall_binary (float64):
0.7111111111111111
recall_binary_neg (float64):
0.8444444444444444
precision_binary (float64):
0.8205128205128205
precision_binary_neg (float64):
0.7450980392156863
score (float64): <------ from here
0.15555555555555556
score_name (str):
accuracy <-------- only to here, consistent with accuracy
score_ci_low (float64): <-------but from here
0.6528563264107466
score_ci_high (float64):
0.8550395441291192 <------ down to here, these are left-overs from f1_binary_ci_..
recall_binary_ci_low (float64):
0.5463282189392394
recall_binary_ci_high (float64):
0.8462029114420383
precision_binary_ci_low (float64):
0.711368184735433
precision_binary_ci_high (float64):
0.9028650762467041
f1_binary_ci_low (float64):
0.6528563264107466
f1_binary_ci_high (float64):
0.8550395441291192
accuracy (float64):
0.15555555555555556
Resolved by #1065