Is dataloader making optimal batches?
Closed this issue · 1 comments
paraschopra commented
Have a question - in the current form of making batches, aren't we throwing away information? e.g. we take a an input and transform it into B * T matrix. Now for each row, the first token is blind to previous tokens as we never put that sequence into the training loop. Wouldn't better way to make dataloader would be something like a moving window?
paraschopra commented
I got the answer here: https://www.youtube.com/watch?v=l8pRSuU81PU&lc=UgxBEJSMh2LngUmeJiR4AaABAg