chaoyi-wu/Finetune_LLAMA

attention mask for different documents in dataset chunk

waterhorse1 opened this issue · 3 comments

Hi chaoyi,

Thanks for your great work. I have a question about dataset tokenization in the following code.

all_tokens = [1] + [
tok
for row in all_tokenized
for tok in row + [tokenizer.eos_token_id, tokenizer.bos_token_id]
]
truncated_tokens = all_tokens[:(len(all_tokens) // args.max_seq_length) * args.max_seq_length]
arr = np.array(truncated_tokens).reshape(-1, args.max_seq_length)
ds = datasets.Dataset.from_dict({"input_ids": arr})
ds.save_to_disk(args.save_path)

From my understanding I think this data preprocessing will cause the fact that different documents might be included in the same data chunk. For example, the first document might take 512 tokens while the second document takes 128 tokens in a chunk of 640 tokens. In this case, I think the generation for the second document should not see the first document, so we might need to use an attention mask to mask the first documents for the second document generation. Am I correct?

Thanks for your recognition.

Yes, your understanding is correct.

Since this project is a tutorial, the code here mainly targets simplifying the main codes, avoiding some dirty padding operations, and making the whole training flow more readable.

In practice, such pre-processing way is only suitable for some large chaos corpus for pre-training. In most cases, you need to replace the dataset Python document with your own and add correct attention and padding masks based on your data characteristics.

@chaoyi-wu Thanks for your answer! I also meet one problem when running finetune_pp_peft_trainer_lora.sh,

ValueError: FlatParameter requires uniform requires_grad, any idea why this happens?

Yes, the FSDP with Lora has this bug and we are going to fix this, you may use Deepspeed instead if you are working with Lora