low loss in fine tuning but generated answers are not correct
Closed this issue · 7 comments
Hi, I am fine tuning a QA dataset using huggingface unified v2 t5 large, and the sample code is like below
# training
model_inputs = self.tokenizer(questions,
padding=True, truncation=True,
max_length=self.tokenizer.model_max_length, return_tensors="pt").to(device)
with self.tokenizer.as_target_tokenizer():
labels = self.tokenizer(answers,
padding=True, truncation=True,
max_length=self.tokenizer.model_max_length, return_tensors="pt").to(device)
# ignore pad token for loss
labels["input_ids"][
labels["input_ids"] == self.tokenizer.pad_token_id
] = -100
model_inputs["labels"] = labels["input_ids"]
outputs = self.model(**model_inputs)
loss = outputs.loss
# generate
model_inputs = self.tokenizer(questions,
padding=True, truncation=True,
max_length=self.tokenizer.model_max_length, return_tensors="pt").to(device)
sampled_outputs = self.model.generate(**model_inputs,
num_beams=4, max_length=50, early_stopping=True)
I can get fairly low loss (0.41) after fine tuning for around 5 epochs, yet the generated answers are mostly wrong (0.23 accuracy). According to T5 doc it seems that generate
can handle the prepending of pad token. Also, the generated answers indeed belong to one of the choices, it is just that they are not the correct ones.
I am wondering what might be the issue. Thanks!
I am not sure; this is certainly not a common issue. Do you observe similar issues when you use older models (e.g., https://huggingface.co/allenai/unifiedqa-t5-large) or the vanilla T5?
@danyaljj Thanks for the reply. I am currently training an older unifiedqa model and will update the results when it's ready. Also, I found that even generating answers in the training set gets pretty poor results (0.4 accuracy). The question I have looks like this
What is ... \\n (A) answer A (B) answer B ... \\n context
(similar to the RACE example in the demo)
while the answer is answer A
. And everything will be mapped to lowercase. Although the bart example has a flag that prepends bos token for both question and answer, I choose to not prepend it due to T5 not having a bos token. Do you think I have made mistakes here? Thanks again!
Edit: after digging into examples of how to fine tune a T5, for example here it seems that to fine tune a vanilla T5 we need to append </s>
to both input and label. I am wondering for unifiedqa is that still required?
I am wondering for unifiedqa is that still required?
I am not sure -- our models were originally trained with TensorFlow. So, I am not aware of any HF-specific specificities. There might also be issues/bugs HF; so you may want to try different versions.
One thing that I should add is that the v2 models are pretty new and might have issues that I am unaware of. So I would strongly recommend starting your experiments with the older models.
You can also compare the predictions of the "large" model here: https://unifiedqa.apps.allenai.org/
Thanks @danyaljj! After a week's attempt I think I somehow solved this problem. In my case, it seems that fine tuning more epochs will work. Previously I was fine tuning either 5 or 10 epochs, and got 0.23 accuracy. When fine tuning for 50 epochs, I can get 0.72 accuracy.
I wonder that in your paper did you also fine tune with large epoch? Thanks!!
We did not track "epochs". We trained the models for several hundred "steps" but our data was extremely large (in the order of millions).