ROC AUC Score
monilouise opened this issue · 2 comments
Hi,
Did you implement any way to measure ROC AUC score for NER? If not, why?
I'm trying to figure out how to add this metric to the code...
Thanks in advance.
Hi @monilouise,
Unfortunately, we did not implement ROC AUC because it is not used by the evaluation dataset we used, but it would be an interesting metric to have.
Regarding how to implement it, I believe the major change is to add a way to gather the tag probability distribution for every token instead of the predicted class index with argmax as we currently do. For that we can use the OutputComposer class to "undo" the windowing that is performed in the preprocessing and combine the predictions of many windows into a single tensor for each input example.
The evaluate function receives an output_composer
that combines the predicted class indices y_true
. One way is to add another OutputComposer to do the same thing for the probabilities:
# create an OutputComposer similar to existing validation/evaluation composers
probs_output_composer = OutputComposer(
eval_examples,
eval_features,
output_transform_fn=None) # <--- We do not want to modify the outputs
# add new arguments and pass them to evaluate function
def evaluate(..., probs_output_composer, roc_auc_computer):
(...)
outs = model(...)
(...)
logits = outs['logits'] # it will only work for models without CRF layer
probs = F.softmax(logits, axis=-1) # (batch_size, max_length, num_classes)
probs_output_composer.insert_batch(example_ixs, doc_span_ixs, probs)
# Now we can a list of probabilities tensors by calling the `get_outputs()` method.
# N lists of shape (example_length, num_classes)
all_probs = probs_output_composer.get_outputs()
# Compute ROC AUC score and add it to metrics output dict
roc_auc_score = roc_auc_computer(y_true, all_probs)
return metrics
Another problem is that inside evaluate
the labels are tag strings instead of class indices. Assuming you would need class indices for the labels, you would have to use NERTagsEncoder
to convert them to indices (that is why I suggested adding the roc_auc_computer
argument in evaluate
as well). The other metrics use tags directly, so it uses OutputComposer.output_transform_fn
to convert y_pred
into tags.
Could you please share how you plan to compute the ROC AUC score? I haven't used ROC AUC for multiclass problems myself, so I'm curious how it's done.
I plan to compute ROC AUC for each class by using one vs rest strategy. There's an implementation available for multiclass problems in https://huggingface.co/spaces/evaluate-metric/roc_auc. One vs one is another possibility.