huggingface/transformers

Text-to-speech data collator exhibits weird batching behavior with Seq2SeqTrainer

GinUTE opened this issue · 9 comments

GinUTE commented

System Info

  • transformers version: 4.37.0.dev0
  • platform: Linux-6.1.58+-x86_64-with-glibc2.35 (Colaboratory free accelerated runtime)
  • python version: 3.10.12

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am currently fine-tuning SpeechT5 on Vietnamese TTS. I followed the official fine-tuning guide here.

The only difference I made is that I changed the tokenizer wrapped in SpeechT5Processor with my own Vietnamese SentencePiece character-level tokenizer. I made sure to add the same special tokens in the original tokenizer, and it is working as expected. I used the following code snippet:

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
tokenizer = SpeechT5Tokenizer("spm-char.model")
processor.tokenizer = tokenizer
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
model.resize_token_embeddings(new_num_tokens=len(tokenizer), pad_to_multiple_of=8)

The issue arises when I got to the training phase at trainer.train(). It throws the following error:
Sizes of tensors must match except in dimension 2. Expected size 16 but got size 256 for tensor number 1 in the list.

I found that the error changes according to batch size. Specifically, the second sentence always throws:
Expect size <batch size> but got size <batch size to the power of 2> for tensor number 1 in the list.

Batch size other than 1 will throw such an error.

I made no change to the original data collator, here is the code snippet:

@dataclass
class TTSDataCollatorWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        input_ids = [{"input_ids": feature["input_ids"]} for feature in features]
        label_features = [{"input_values": feature["labels"]} for feature in features]
        speaker_features = [feature["speaker_embeddings"] for feature in features]

        batch = processor.pad(
            input_ids=input_ids, labels=label_features, return_tensors="pt"
        )
        batch["labels"] = batch["labels"].masked_fill(
            batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100
        )
        del batch["decoder_attention_mask"]

        if model.config.reduction_factor > 1:
            target_lengths = torch.tensor(
                [len(feature["input_values"]) for feature in label_features]
            )
            target_lengths = target_lengths.new(
                [
                    length - length % model.config.reduction_factor
                    for length in target_lengths
                ]
            )
            max_length = max(target_lengths)
            batch["labels"] = batch["labels"][:, :max_length]
        batch["speaker_embeddings"] = torch.tensor(speaker_features)
        return batch


data_collator = TTSDataCollatorWithPadding(processor=processor)

I checked the batch returned by the data collator with 16 examples and it seems to check out:

{'input_ids': torch.Size([16, 188]),
 'attention_mask': torch.Size([16, 188]),
 'labels': torch.Size([16, 628, 80]),
 'speaker_embeddings': torch.Size([16, 512])}

I suspect it must be something to do with the DataLoader, or something else obvious that I just cannot wrap my head around. Any help is appreciated.

Expected behavior

The fine-tuning should proceed as per usual. I fine-tuned SpeechT5 on Vietnamese TTS once before but not with a custom tokenizer.