Is the model structure exactly the same as GPT-2?

Question

Is the model structure exactly the same as GPT-2?

northfoxz opened this issue 5 years ago · 6 comments

Hi there, great work!
I'm trying to port the Grover model into the huggingface/transformers repo
Is model structure exactly the same as GPT-2?
thanks for your reply!

Answer 1 · 2020-03-05T15:52:58.000Z

After reading the code, I find out that there are some structural differences between your implementation and that of Openai's. Specifically the normalization process:

openai's implementation:

def block(x, scope, *, past, hparams):
    with tf.variable_scope(scope):
        nx = x.shape[-1].value
        a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
        x = x + a
        m = mlp(norm(x, 'ln_2'), 'mlp', nx*4, hparams=hparams)
        x = x + m
        return x, present

ln_1 of each block

Applying norm to the input before attention

ln_2 of each block

Applying norm to the input before the fully-connected layer

Grover's implementation

def residual_mlp_layer(x_flat, intermediate_size, initializer_range=0.02, hidden_dropout_prob=0.1):
    batch_size_seq_length, hidden_size = get_shape_list(x_flat, expected_rank=2)
    x_norm = layer_norm(x_flat, name='mlp_ln0')

    intermediate_output = tf.layers.dense(
        x_norm,
        intermediate_size,
        activation=gelu,
        kernel_initializer=create_initializer(initializer_range),
        name='intermediate',
    )

    output_for_residual = tf.layers.dense(
        intermediate_output,
        hidden_size,
        name='output',
        kernel_initializer=create_initializer(initializer_range))
    output_for_residual = dropout(output_for_residual, hidden_dropout_prob)

    layer_output = layer_norm(x_flat + output_for_residual, name='mlp_ln1')
    return layer_output

Grover applies 2 normalizations in fully-connected layer

That makes the structure different from the OpenAI's implementation, thus I'm unable to transfer this model to Huggingfaces's repo.

Answer 2 · 2020-03-30T16:17:49.000Z

sorry for taking a while to get to this one! I believe it's actually the same, since iirc there's an extra layer normalization somewhere else in the openai code. that said, the layer normalizations might not match up in terms of naming...

Answer 3 · 2020-08-08T16:36:25.000Z

Hi @northfoxz . Were you able to determine if the difference is only on naming, or if it is structural?
If the difference is on the names maybe a grover model can be converted to make it compatible with Huggingface.

Answer 4 · 2020-08-09T05:44:43.000Z

@EibrielInv well it is structural with slight difference, you will have to modify the gpt2 model code a bit to make it work.

Answer 5 · 2021-07-03T05:30:15.000Z

@northfoxz ~ Did you ever attempt to get it to port across into the huggingface/transformers repo by adjusting the GPT2 code?

Answer 6 · 2022-04-06T17:33:38.000Z

Have you ever made a progress on this one? @northfoxz
Only thing I have found is this: https://huggingface.co/gagan3012/distilbert-fakenews-model-grover
But nothing else is there?