Unable to understand the labels in preprocess_logits_for_metrics
Closed this issue · 3 comments
Somehow, I am seeing -100 being appended to the ground truth labels inside the preprocess_logits_for_metrics which can not be decoded back to string by tokenizer.batch_decode(). Just to make sure, train_on_inputs = True and hence, the following block of code doesn't run -
if not train_on_inputs:
user_prompt = generate_prompt({**data_point, "output": ""})
tokenized_user_prompt = tokenize(user_prompt, add_eos_token=False)
user_prompt_len = len(tokenized_user_prompt["input_ids"])
tokenized_full_prompt["labels"] = [
-100
] * user_prompt_len + tokenized_full_prompt["labels"][
user_prompt_len:
] # could be sped up, probably
I tested it out by commenting this part and still the labels have -100. Could anyone explain please? I can remove them manually but I do not understand why they come in the first place.
![image](https://private-user-images.githubusercontent.com/23135476/322083156-ad281f72-cb0c-47f8-8112-84bdef816b5c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk1NzA2NjAsIm5iZiI6MTcxOTU3MDM2MCwicGF0aCI6Ii8yMzEzNTQ3Ni8zMjIwODMxNTYtYWQyODFmNzItY2IwYy00N2Y4LTgxMTItODRiZGVmODE2YjVjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjI4VDEwMjYwMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk3ZmQyZmZiZDBjNDdkNGU2NjM3ZTg3NWU0NmE1NzU5YjRiZjg1YzkwNzEwNGExYTk5NzYzMmE0ODhjYTM2ODEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.qZZxlq9NXOxnVlqAafHMbjOvGWZwMdE2yIy0BeZl5oI)
I am using meta-llama/Llama-2-7b-chat-hf
as the base model just if it could shed some light.
I got it... It is automatically added by transformers.DataCollatorForSeq2Seq()
which is ignored by PyTorch loss functions but creates a problem when trying to decode back. Can you shed some light on if we can convert the logits
in preprocess_logits_for_metrics
to labels
to text
using the following code -
logits = logits.softmax(dim=-1)
predicted_labels = torch.argmax(logits, dim=-1)
print("Predicted:", tokenizer.batch_decode(predicted_labels, skip_special_tokens=False, clean_up_tokenization_spaces=True))
-100 is the default mask ID in PyTorch’s CrossEntropyLoss that does not compute the loss. Therefore, if you fill in -100, the loss at that position will not be computed.