Is dataloader making optimal batches?

Question

Is dataloader making optimal batches?

Closed this issue 6 months ago · 1 comments

Have a question - in the current form of making batches, aren't we throwing away information? e.g. we take a an input and transform it into B * T matrix. Now for each row, the first token is blind to previous tokens as we never put that sequence into the training loop. Wouldn't better way to make dataloader would be something like a moving window?

Answer 1 · 2024-06-21T06:44:24.000Z

I got the answer here: https://www.youtube.com/watch?v=l8pRSuU81PU&lc=UgxBEJSMh2LngUmeJiR4AaABAg