Inconsistent predictions (confidence values) with multiple runs
Opened this issue · 4 comments
- ferret version: 0.4.1
- Python version: 3.10.9
- Operating System: Ubuntu 20.04.5 LTS
Description
Describe what you were trying to get done.
I am loading ferret's explainer with my pretrained model (for classification task on nlp dataset).
Tell us what happened, what went wrong, and what you expected to happen.
I am loading ferret's explainer with my pretrained model (for classification task on nlp dataset) but the problems are:
(1) every run of the explainer is giving me different confidence labels
(2) [Could be consequence of 1st problem] the explainer's prediction is often inconsistent with the pretrained model's prediction
What I Did
import torch
from transformers import AutoModelForSequenceClassification, BertweetTokenizer
from ferret import Benchmark
device = torch.device("cuda:2") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", num_labels=3, ignore_mismatched_sizes=True).to(device)
model.load_state_dict(torch.load(model_load_path))
model.eval()
tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base", normalization=True, is_fast=True)
bench = Benchmark(model, tokenizer)
tweet = "#god is utterly powerless without human intervention . . . </s> atheism"
bench.score(tweet)
Output (illustrates problem-1 of different confidence values for different runs)
{'LABEL_0': 0.3069733679294586,
'LABEL_1': 0.35715219378471375,
'LABEL_2': 0.33587440848350525}
# Prediction: **LABEL_1**
{'LABEL_0': 0.3356691002845764,
'LABEL_1': 0.3353104293346405,
'LABEL_2': 0.3290204405784607}
# Prediction: **LABEL_0**
model.eval()
sample = tokenizer.encode_plus(tweet)
sample['labels'] = [0]
with torch.no_grad():
input_ids = torch.tensor(sample['input_ids']).to(device)
attention_mask = torch.tensor(sample['attention_mask']).to(device)
labels = torch.tensor(sample['labels']).to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
preds = outputs.logits
rounded_preds = F.softmax(preds)
_, indices = torch.max(rounded_preds, 1)
# Output: tensor([[-0.0779, -0.0418, 0.1261]], device='cuda:2')
# Prediction: **LABEL_2** (different from explainer's prediction - illustrates problem #2)
More insights into the problem:
(1) The problem is not with my pretrained model but whenever the ferret loads any pretrained model, it leads to completely different explanations and prediction values.
(2) Just reloading ferret everytime gives consistent results. So, it seems like some randomness is being introduced during the time the ferret loads the model.
Hi, thank you for reaching out.
ferret
uses the model.config.label2id
and model.config.id2label
to associate the logit positional index to the label, which might be non-coherent with your class labels. We should notify the user of this immediately after the benchmark class is instantiated.
About the randomicity of results, I've been trying ferret
with two models and tasks (sentiment analysis and language detection), and I have consistent results (exact prediction for every inference). You can see it here: https://colab.research.google.com/drive/14rSS8RZx45vZIrKdds4rSR2hRhHV3-1z?usp=sharing
Is there any model checkpoint that you can share so that I can try it with yours?
Yes, I do have updates.
I think the problem was I used let's say X
steps to evaluate my pretrained model which included loading my pretrained model, seeding it, loading dataloader, etc. but using ferret, I did not use all these steps since it is a two-liner evaluation to evaluate a given sample.
The way I resolved it was - do the same 'X' steps first and then try to work on ferret model. This made the ferret prediction consistent with my original model evaluation.