Chunking method in the original GPT-2 training dataset

Question

Chunking method in the original GPT-2 training dataset

rasbt opened this issue 6 months ago · 2 comments

The data loader prepares the input data batch in chunks. Let's say the chunk size is 6, then you have a sliding window approach where you advance each chunk by 6 as you show in the video:

Tokenized text: [ 5962, 22307, 25, 198, 8421, 356, 5120, 597, 2252, 11, 3285, 502, 2740, 13, 198, 198, 3237, 25, 1081, 5248, 461, 11, 2740, 13, 99]

Batch inputs:

tensor([[ 5962, 22307, 25,   198, 8421, 356],
        [ 5120, 597,   2252, 11,  3285, 502],
        [ 2740, 13,    198,  198, 3237, 25],
        [ 198,  5248,  461,  11,  2740, 13]])

Batch targets (inputs shifted by +1):

tensor([[22307,  25, 198, 8421, 356, 5120],
        [ 597,  2252, 11, 3285, 502, 2740],
        [ 13,   198, 198, 3237,  25, 1081],
        [ 5248, 461,  11, 2740,  13, 99]])

This works well, and this is usually also how I do it.

However, I think for exactly reproducing the original GPT-2 model, I think they had overlaps between the inputs. I.e, each new chunk starts just one token after the previous one:

Batch Inputs:

tensor([[ 5962, 22307,   25,  198, 8421, 356 ]
        [22307,    25,  198, 8421,  356, 5120]
        [25,      198, 8421,  356, 5120, 597 ]
        [198,    8421,  356, 5120,  597, 2252]
        ...])

Batch Targets:
tensor([[22307,    25,   198,  8421,  356, 5120],
        [   25,   198,  8421,   356, 5120,  597],
        [  198,  8421,   356,  5120,  597, 2252],
        [ 8421,   356,  5120,   597, 2252,   11],
        ...])

I.e., instead of advancing the input by "chunk size", they advanced the input position by 1. Please correct me if I'm wrong.

Answer 1 · 2024-06-13T18:39:44.000Z

? I would be quite surprised about that I think. Is there an indication to this somewhere?

Answer 2 · 2024-06-13T19:07:20.000Z

Sorry, this is mainly based on hearsay ... when I shared an implementation of GPT pretraining code that does the first approach a few months ago, someone pointed out to me that the overlapping sliding window approach was used. But then, there is no evidence because the original GPT-2 repo doesn't share training code. Sorry for bothering, let me reopen this in case I find concrete evidence.