Train with attention mask

Question

germanjke opened this issue 8 months ago · 1 comments

Hi,

Llama 3 trains like this

We trained the models on sequences of 8,192 tokens, using a mask to ensure self-attention does not cross document boundaries.

I see you have something like this in mpt_modeling.py here

Can you tell please, how we can define this in train config?

Thanks

Answer 1 · 2024-05-16T02:13:05.000Z

Hey, we have not implemented the attention masking you are describing for models other than MPT variants.