Chunking method in the original GPT-2 training dataset
rasbt opened this issue · 2 comments
The data loader prepares the input data batch in chunks. Let's say the chunk size is 6, then you have a sliding window approach where you advance each chunk by 6 as you show in the video:
Tokenized text: [ 5962, 22307, 25, 198, 8421, 356, 5120, 597, 2252, 11, 3285, 502, 2740, 13, 198, 198, 3237, 25, 1081, 5248, 461, 11, 2740, 13, 99]
Batch inputs:
tensor([[ 5962, 22307, 25, 198, 8421, 356],
[ 5120, 597, 2252, 11, 3285, 502],
[ 2740, 13, 198, 198, 3237, 25],
[ 198, 5248, 461, 11, 2740, 13]])
Batch targets (inputs shifted by +1):
tensor([[22307, 25, 198, 8421, 356, 5120],
[ 597, 2252, 11, 3285, 502, 2740],
[ 13, 198, 198, 3237, 25, 1081],
[ 5248, 461, 11, 2740, 13, 99]])
This works well, and this is usually also how I do it.
However, I think for exactly reproducing the original GPT-2 model, I think they had overlaps between the inputs. I.e, each new chunk starts just one token after the previous one:
Batch Inputs:
tensor([[ 5962, 22307, 25, 198, 8421, 356 ]
[22307, 25, 198, 8421, 356, 5120]
[25, 198, 8421, 356, 5120, 597 ]
[198, 8421, 356, 5120, 597, 2252]
...])
Batch Targets:
tensor([[22307, 25, 198, 8421, 356, 5120],
[ 25, 198, 8421, 356, 5120, 597],
[ 198, 8421, 356, 5120, 597, 2252],
[ 8421, 356, 5120, 597, 2252, 11],
...])
I.e., instead of advancing the input by "chunk size", they advanced the input position by 1. Please correct me if I'm wrong.
? I would be quite surprised about that I think. Is there an indication to this somewhere?
Sorry, this is mainly based on hearsay ... when I shared an implementation of GPT pretraining code that does the first approach a few months ago, someone pointed out to me that the overlapping sliding window approach was used. But then, there is no evidence because the original GPT-2 repo doesn't share training code. Sorry for bothering, let me reopen this in case I find concrete evidence.