Opposing results when comparing to transformers interpret

Question

Opposing results when comparing to transformers interpret

racia opened this issue 2 years ago · 5 comments

ferret version: 0.4.0
Python version: 3.9.2
Operating System: Linux Debian

Description

When comparing your feature attribution scores of the explanation provided by Integrated Gradients (plain) with the ones by the transformers_interpret library (MultiLabelClassificationExplainer), I get significantly different results. For example, a token may have a high score of 0.5 with transformers_interpret, but is negatively attributed with ferret.
Why could that be ?
Of course, I tested this on the same conditions for both transformers_interpret and ferret (e.g.: pretrained local multi-label BertForSequenceClassification, bert-base-german-cased tokenizer, same sample)

What I Did

transformers_interpret:

cls_explainer = MultiLabelClassificationExplainer(model, tokenizer, custom_labels=labels)
word_attrib = cls_explainer(<SAMPLE>)
pred = cls_explainer.predicted_class_name
print(word_attrib[pred])

ferret:

bench = Benchmark(model, tokenizer)
score = bench.score(<SAMPLE>)
metr = bench.explain(sent, target=target)[4] ### IG (plain) ###
print(metr.scores)

Answer 1 · 2023-03-29T11:39:16.000Z

Hi, thank you for reaching out. Can you give me the version of transformer interpret you are using and the sentence you are explaining so that I can setup a quick colab?

EDIT: also, I notice you are specifying 0.8.0 as the ferret version, but that cannot be correct. Are you using the latest v0.4.1?

Answer 2 · 2023-03-30T06:18:38.000Z

Hi, thank you for reaching out. Can you give me the version of transformer interpret you are using and the sentence you are explaining so that I can setup a quick colab?

EDIT: also, I notice you are specifying 0.8.0 as the ferret version, but that cannot be correct. Are you using the latest v0.4.1?

Hey, thank you for your fast answer! Yes I'm sorry, the version of ferret should have stated: 0.4.0 and transformers_interpret is 0.9.6. I talked to my co-worker about this and it seems this is the case for any sample, e.g.: "Röntgenologisch wird kein V.a. eine Krankheit gestellt." where "eine" receives a -0.04 score with ferret's plain IG & 0.48 with transformers_interpret

Answer 3 · 2023-03-30T15:15:35.000Z

Perfect! Can you also tell me which model checkpoint you are using? (if it's a private model, I can try a similar public one if you point me to it).
In the meantime, you could try to set the normalization false in the explain method. By default, we apply L1 normalization across all tokens to make attribution scores comparable across different explainers.

metr = bench.explain(sent, target=target, normalize_scores=False)[4] ### IG (plain) ###

Answer 4 · 2023-04-03T11:25:34.000Z

Hey,
please excuse the long wait and thanks very much for your tip regarding the normalization, however that unfortunately doesn't solve the deviation in scores.
I was now able to reproduce the case with the publicly available nlptown/bert-base-multilingual-uncased-sentiment from the huggingface hub (where the label "5 stars" corresponds to target=4)

transformers_interpret:

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
cls_explainer = MultiLabelClassificationExplainer(model, tokenizer)
word_attrib = cls_explainer("Sehr gutes Essen hier..")
pred = cls_explainer.predicted_class_name
print(pred, word_attrib[pred])
>>> 5 stars [('[CLS]', 0.0), ('sehr', 0.29032204519214533), ('gute', 0.9150404689960208), ('##s', 0.025730148572368404), ('essen', 0.0850625593805736), ('hier', 0.22078816729135264), ('.', -0.08066158021678087), ('.', 0.12354215993789357), ('[SEP]', 0.0)]

ferret:

bench = Benchmark(model, tokenizer)
bench.explain("Sehr gutes Essen hier..", target=4)
>>> Explanation(text='Sehr gutes Essen hier..', tokens=['[CLS]', 'sehr', 'gute', '##s', 'essen', 'hier', '.', '.', '[SEP]'], scores=array([ 0.        ,  0.16889089,  0.55689445,  0.04138098,  0.07297469,
         0.07667237, -0.03699176,  0.04619486,  0.        ]), explainer='Integrated Gradient (x Input)', target=4)]

While some attributions only have a small deviation of <0.02 (e.g.: '##s') when comparing both, others can have a difference in scoring of more than 0.4 attribution points.
Please note also, that unlike earlier the explainer here is IGxInp from ferret, since I learned this is the default version from transformers_interpret too.

Thanks a lot for looking into this and happy to hear from you!

Answer 5 · 2023-04-30T11:06:17.000Z

Hey, sorry for the long radio silence.
In your last code example, normalization is still happening, so that might be an issue. Anyway, another different factor I see is the baseline used for computing Integrated Gradients. We are currently using a [CLS] + token count * [PAD] + [SEP] scheme and that might be different than what transformer interpret is doing.
(The choice on how to construct the baseline is not well-agreed in the community, so they might be following another strategy)