microsoft/DeepSpeedExamples

Question: Why not padding to the same sequence length within the batch during the sft training phase?

LKLKyy opened this issue · 0 comments

Question: In the SFT training phase in dschat, I found that function create_dataset_split in data_utils.py will pad the samples to the maximum length. Therefore, why not dynamically padding to the maximum length of samples in the batch during training, which can significantly speed up training.
if chosen_sentence is not None:
chosen_sentence += end_of_conversation_token
chosen_token = tokenizer(chosen_sentence,
max_length=max_seq_len,
padding="max_length",
truncation=True,
return_tensors="pt")
chosen_token["input_ids"] = chosen_token["input_ids"].squeeze(0)
chosen_token["attention_mask"] = chosen_token["attention_mask"].squeeze(0)
chosen_dataset.append(chosen_token)