Text-to-speech data collator exhibits weird batching behavior with Seq2SeqTrainer
GinUTE opened this issue · 9 comments
System Info
- transformers version: 4.37.0.dev0
- platform: Linux-6.1.58+-x86_64-with-glibc2.35 (Colaboratory free accelerated runtime)
- python version: 3.10.12
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am currently fine-tuning SpeechT5 on Vietnamese TTS. I followed the official fine-tuning guide here.
The only difference I made is that I changed the tokenizer wrapped in SpeechT5Processor with my own Vietnamese SentencePiece character-level tokenizer. I made sure to add the same special tokens in the original tokenizer, and it is working as expected. I used the following code snippet:
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
tokenizer = SpeechT5Tokenizer("spm-char.model")
processor.tokenizer = tokenizer
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
model.resize_token_embeddings(new_num_tokens=len(tokenizer), pad_to_multiple_of=8)
The issue arises when I got to the training phase at trainer.train()
. It throws the following error:
Sizes of tensors must match except in dimension 2. Expected size 16 but got size 256 for tensor number 1 in the list.
I found that the error changes according to batch size. Specifically, the second sentence always throws:
Expect size <batch size> but got size <batch size to the power of 2> for tensor number 1 in the list.
Batch size other than 1 will throw such an error.
I made no change to the original data collator, here is the code snippet:
@dataclass
class TTSDataCollatorWithPadding:
processor: Any
def __call__(
self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
) -> Dict[str, torch.Tensor]:
input_ids = [{"input_ids": feature["input_ids"]} for feature in features]
label_features = [{"input_values": feature["labels"]} for feature in features]
speaker_features = [feature["speaker_embeddings"] for feature in features]
batch = processor.pad(
input_ids=input_ids, labels=label_features, return_tensors="pt"
)
batch["labels"] = batch["labels"].masked_fill(
batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100
)
del batch["decoder_attention_mask"]
if model.config.reduction_factor > 1:
target_lengths = torch.tensor(
[len(feature["input_values"]) for feature in label_features]
)
target_lengths = target_lengths.new(
[
length - length % model.config.reduction_factor
for length in target_lengths
]
)
max_length = max(target_lengths)
batch["labels"] = batch["labels"][:, :max_length]
batch["speaker_embeddings"] = torch.tensor(speaker_features)
return batch
data_collator = TTSDataCollatorWithPadding(processor=processor)
I checked the batch returned by the data collator with 16 examples and it seems to check out:
{'input_ids': torch.Size([16, 188]),
'attention_mask': torch.Size([16, 188]),
'labels': torch.Size([16, 628, 80]),
'speaker_embeddings': torch.Size([16, 512])}
I suspect it must be something to do with the DataLoader, or something else obvious that I just cannot wrap my head around. Any help is appreciated.
Expected behavior
The fine-tuning should proceed as per usual. I fine-tuned SpeechT5 on Vietnamese TTS once before but not with a custom tokenizer.