Inconsistent predictions (confidence values) with multiple runs

Question

Inconsistent predictions (confidence values) with multiple runs

Opened this issue 2 years ago · 4 comments

ferret version: 0.4.1
Python version: 3.10.9
Operating System: Ubuntu 20.04.5 LTS

Description

Describe what you were trying to get done.
I am loading ferret's explainer with my pretrained model (for classification task on nlp dataset).

Tell us what happened, what went wrong, and what you expected to happen.
I am loading ferret's explainer with my pretrained model (for classification task on nlp dataset) but the problems are:
(1) every run of the explainer is giving me different confidence labels
(2) [Could be consequence of 1st problem] the explainer's prediction is often inconsistent with the pretrained model's prediction

What I Did

import torch
from transformers import AutoModelForSequenceClassification, BertweetTokenizer
from ferret import Benchmark

device = torch.device("cuda:2") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", num_labels=3, ignore_mismatched_sizes=True).to(device)
model.load_state_dict(torch.load(model_load_path))
model.eval()
tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base", normalization=True, is_fast=True)

bench = Benchmark(model, tokenizer)
tweet = "#god is utterly powerless without human intervention . . . </s> atheism"
bench.score(tweet)

Output (illustrates problem-1 of different confidence values for different runs)

{'LABEL_0': 0.3069733679294586,
 'LABEL_1': 0.35715219378471375,
 'LABEL_2': 0.33587440848350525}
# Prediction: **LABEL_1**

{'LABEL_0': 0.3356691002845764,
 'LABEL_1': 0.3353104293346405,
 'LABEL_2': 0.3290204405784607}
# Prediction: **LABEL_0**

model.eval()
sample = tokenizer.encode_plus(tweet)
sample['labels'] = [0]
with torch.no_grad():
    input_ids = torch.tensor(sample['input_ids']).to(device)
    attention_mask = torch.tensor(sample['attention_mask']).to(device)
    labels = torch.tensor(sample['labels']).to(device)
    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    preds = outputs.logits
    rounded_preds = F.softmax(preds)
    _, indices = torch.max(rounded_preds, 1)

# Output: tensor([[-0.0779, -0.0418,  0.1261]], device='cuda:2')
# Prediction: **LABEL_2** (different from explainer's prediction - illustrates problem #2)

Answer 1 · 2023-02-27T17:14:06.000Z

More insights into the problem:

(1) The problem is not with my pretrained model but whenever the ferret loads any pretrained model, it leads to completely different explanations and prediction values.

(2) Just reloading ferret everytime gives consistent results. So, it seems like some randomness is being introduced during the time the ferret loads the model.

Answer 2 · 2023-03-06T10:33:58.000Z

Hi, thank you for reaching out.

ferret uses the model.config.label2id and model.config.id2label to associate the logit positional index to the label, which might be non-coherent with your class labels. We should notify the user of this immediately after the benchmark class is instantiated.

About the randomicity of results, I've been trying ferret with two models and tasks (sentiment analysis and language detection), and I have consistent results (exact prediction for every inference). You can see it here: https://colab.research.google.com/drive/14rSS8RZx45vZIrKdds4rSR2hRhHV3-1z?usp=sharing

Is there any model checkpoint that you can share so that I can try it with yours?

Answer 3 · 2023-03-15T14:34:31.000Z

Hi @kgarg8, any news here? :)

Answer 4 · 2023-03-15T14:55:19.000Z

Yes, I do have updates.

I think the problem was I used let's say X steps to evaluate my pretrained model which included loading my pretrained model, seeding it, loading dataloader, etc. but using ferret, I did not use all these steps since it is a two-liner evaluation to evaluate a given sample.

The way I resolved it was - do the same 'X' steps first and then try to work on ferret model. This made the ferret prediction consistent with my original model evaluation.