lopuhin/transformer-lm

"state_dict" Mismatch

nitinnairk opened this issue · 1 comments

Building a base GPT2 model with the below paramerters renders a model that is different (state_dict differs number of params).
{ "initializer_range": 0.02, "layer_norm_epsilon": 1e-05, "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_layer": 12, "n_positions": 1024, "vocab_size": 100 }
I'm loading the model trained here on https://github.com/huggingface/transformers.
The layers missing from the model built using this repo are listed below.
{'blocks.0.attn.bias', 'blocks.1.attn.bias', 'blocks.10.attn.bias', 'blocks.11.attn.bias', 'blocks.2.attn.bias', 'blocks.3.attn.bias', 'blocks.4.attn.bias', 'blocks.5.attn.bias', 'blocks.6.attn.bias', 'blocks.7.attn.bias', 'blocks.8.attn.bias', 'blocks.9.attn.bias'}

Layers from model using this repo
['wpe.weight', 'wte.weight', 'blocks.0.ln_1.g', 'blocks.0.ln_1.b', 'blocks.0.ln_2.g', 'blocks.0.ln_2.b', 'blocks.0.mlp.c_fc.weight', 'blocks.0.mlp.c_fc.bias', 'blocks.0.mlp.c_proj.weight', 'blocks.0.mlp.c_proj.bias', 'blocks.0.attn.c_attn.weight', 'blocks.0.attn.c_attn.bias', 'blocks.0.attn.c_proj.weight', 'blocks.0.attn.c_proj.bias', 'blocks.1.ln_1.g', 'blocks.1.ln_1.b', 'blocks.1.ln_2.g', 'blocks.1.ln_2.b', 'blocks.1.mlp.c_fc.weight', 'blocks.1.mlp.c_fc.bias', 'blocks.1.mlp.c_proj.weight', 'blocks.1.mlp.c_proj.bias', 'blocks.1.attn.c_attn.weight', 'blocks.1.attn.c_attn.bias', 'blocks.1.attn.c_proj.weight', 'blocks.1.attn.c_proj.bias', 'blocks.2.ln_1.g', 'blocks.2.ln_1.b', 'blocks.2.ln_2.g', 'blocks.2.ln_2.b', 'blocks.2.mlp.c_fc.weight', 'blocks.2.mlp.c_fc.bias', 'blocks.2.mlp.c_proj.weight', 'blocks.2.mlp.c_proj.bias', 'blocks.2.attn.c_attn.weight', 'blocks.2.attn.c_attn.bias', 'blocks.2.attn.c_proj.weight', 'blocks.2.attn.c_proj.bias', 'blocks.3.ln_1.g', 'blocks.3.ln_1.b', 'blocks.3.ln_2.g', 'blocks.3.ln_2.b', 'blocks.3.mlp.c_fc.weight', 'blocks.3.mlp.c_fc.bias', 'blocks.3.mlp.c_proj.weight', 'blocks.3.mlp.c_proj.bias', 'blocks.3.attn.c_attn.weight', 'blocks.3.attn.c_attn.bias', 'blocks.3.attn.c_proj.weight', 'blocks.3.attn.c_proj.bias', 'blocks.4.ln_1.g', 'blocks.4.ln_1.b', 'blocks.4.ln_2.g', 'blocks.4.ln_2.b', 'blocks.4.mlp.c_fc.weight', 'blocks.4.mlp.c_fc.bias', 'blocks.4.mlp.c_proj.weight', 'blocks.4.mlp.c_proj.bias', 'blocks.4.attn.c_attn.weight', 'blocks.4.attn.c_attn.bias', 'blocks.4.attn.c_proj.weight', 'blocks.4.attn.c_proj.bias', 'blocks.5.ln_1.g', 'blocks.5.ln_1.b', 'blocks.5.ln_2.g', 'blocks.5.ln_2.b', 'blocks.5.mlp.c_fc.weight', 'blocks.5.mlp.c_fc.bias', 'blocks.5.mlp.c_proj.weight', 'blocks.5.mlp.c_proj.bias', 'blocks.5.attn.c_attn.weight', 'blocks.5.attn.c_attn.bias', 'blocks.5.attn.c_proj.weight', 'blocks.5.attn.c_proj.bias', 'blocks.6.ln_1.g', 'blocks.6.ln_1.b', 'blocks.6.ln_2.g', 'blocks.6.ln_2.b', 'blocks.6.mlp.c_fc.weight', 'blocks.6.mlp.c_fc.bias', 'blocks.6.mlp.c_proj.weight', 'blocks.6.mlp.c_proj.bias', 'blocks.6.attn.c_attn.weight', 'blocks.6.attn.c_attn.bias', 'blocks.6.attn.c_proj.weight', 'blocks.6.attn.c_proj.bias', 'blocks.7.ln_1.g', 'blocks.7.ln_1.b', 'blocks.7.ln_2.g', 'blocks.7.ln_2.b', 'blocks.7.mlp.c_fc.weight', 'blocks.7.mlp.c_fc.bias', 'blocks.7.mlp.c_proj.weight', 'blocks.7.mlp.c_proj.bias', 'blocks.7.attn.c_attn.weight', 'blocks.7.attn.c_attn.bias', 'blocks.7.attn.c_proj.weight', 'blocks.7.attn.c_proj.bias', 'blocks.8.ln_1.g', 'blocks.8.ln_1.b', 'blocks.8.ln_2.g', 'blocks.8.ln_2.b', 'blocks.8.mlp.c_fc.weight', 'blocks.8.mlp.c_fc.bias', 'blocks.8.mlp.c_proj.weight', 'blocks.8.mlp.c_proj.bias', 'blocks.8.attn.c_attn.weight', 'blocks.8.attn.c_attn.bias', 'blocks.8.attn.c_proj.weight', 'blocks.8.attn.c_proj.bias', 'blocks.9.ln_1.g', 'blocks.9.ln_1.b', 'blocks.9.ln_2.g', 'blocks.9.ln_2.b', 'blocks.9.mlp.c_fc.weight', 'blocks.9.mlp.c_fc.bias', 'blocks.9.mlp.c_proj.weight', 'blocks.9.mlp.c_proj.bias', 'blocks.9.attn.c_attn.weight', 'blocks.9.attn.c_attn.bias', 'blocks.9.attn.c_proj.weight', 'blocks.9.attn.c_proj.bias', 'blocks.10.ln_1.g', 'blocks.10.ln_1.b', 'blocks.10.ln_2.g', 'blocks.10.ln_2.b', 'blocks.10.mlp.c_fc.weight', 'blocks.10.mlp.c_fc.bias', 'blocks.10.mlp.c_proj.weight', 'blocks.10.mlp.c_proj.bias', 'blocks.10.attn.c_attn.weight', 'blocks.10.attn.c_attn.bias', 'blocks.10.attn.c_proj.weight', 'blocks.10.attn.c_proj.bias', 'blocks.11.ln_1.g', 'blocks.11.ln_1.b', 'blocks.11.ln_2.g', 'blocks.11.ln_2.b', 'blocks.11.mlp.c_fc.weight', 'blocks.11.mlp.c_fc.bias', 'blocks.11.mlp.c_proj.weight', 'blocks.11.mlp.c_proj.bias', 'blocks.11.attn.c_attn.weight', 'blocks.11.attn.c_attn.bias', 'blocks.11.attn.c_proj.weight', 'blocks.11.attn.c_proj.bias', 'ln_f.g', 'ln_f.b']
Total count: 148

Layers from the huggingface repo
["wte.weight", "wpe.weight", "h.0.ln_1.weight", "h.0.ln_1.bias", "h.0.attn.bias", "h.0.attn.c_attn.weight", "h.0.attn.c_attn.bias", "h.0.attn.c_proj.weight", "h.0.attn.c_proj.bias", "h.0.ln_2.weight", "h.0.ln_2.bias", "h.0.mlp.c_fc.weight", "h.0.mlp.c_fc.bias", "h.0.mlp.c_proj.weight", "h.0.mlp.c_proj.bias", "h.1.ln_1.weight", "h.1.ln_1.bias", "h.1.attn.bias", "h.1.attn.c_attn.weight", "h.1.attn.c_attn.bias", "h.1.attn.c_proj.weight", "h.1.attn.c_proj.bias", "h.1.ln_2.weight", "h.1.ln_2.bias", "h.1.mlp.c_fc.weight", "h.1.mlp.c_fc.bias", "h.1.mlp.c_proj.weight", "h.1.mlp.c_proj.bias", "h.2.ln_1.weight", "h.2.ln_1.bias", "h.2.attn.bias", "h.2.attn.c_attn.weight", "h.2.attn.c_attn.bias", "h.2.attn.c_proj.weight", "h.2.attn.c_proj.bias", "h.2.ln_2.weight", "h.2.ln_2.bias", "h.2.mlp.c_fc.weight", "h.2.mlp.c_fc.bias", "h.2.mlp.c_proj.weight", "h.2.mlp.c_proj.bias", "h.3.ln_1.weight", "h.3.ln_1.bias", "h.3.attn.bias", "h.3.attn.c_attn.weight", "h.3.attn.c_attn.bias", "h.3.attn.c_proj.weight", "h.3.attn.c_proj.bias", "h.3.ln_2.weight", "h.3.ln_2.bias", "h.3.mlp.c_fc.weight", "h.3.mlp.c_fc.bias", "h.3.mlp.c_proj.weight", "h.3.mlp.c_proj.bias", "h.4.ln_1.weight", "h.4.ln_1.bias", "h.4.attn.bias", "h.4.attn.c_attn.weight", "h.4.attn.c_attn.bias", "h.4.attn.c_proj.weight", "h.4.attn.c_proj.bias", "h.4.ln_2.weight", "h.4.ln_2.bias", "h.4.mlp.c_fc.weight", "h.4.mlp.c_fc.bias", "h.4.mlp.c_proj.weight", "h.4.mlp.c_proj.bias", "h.5.ln_1.weight", "h.5.ln_1.bias", "h.5.attn.bias", "h.5.attn.c_attn.weight", "h.5.attn.c_attn.bias", "h.5.attn.c_proj.weight", "h.5.attn.c_proj.bias", "h.5.ln_2.weight", "h.5.ln_2.bias", "h.5.mlp.c_fc.weight", "h.5.mlp.c_fc.bias", "h.5.mlp.c_proj.weight", "h.5.mlp.c_proj.bias", "h.6.ln_1.weight", "h.6.ln_1.bias", "h.6.attn.bias", "h.6.attn.c_attn.weight", "h.6.attn.c_attn.bias", "h.6.attn.c_proj.weight", "h.6.attn.c_proj.bias", "h.6.ln_2.weight", "h.6.ln_2.bias", "h.6.mlp.c_fc.weight", "h.6.mlp.c_fc.bias", "h.6.mlp.c_proj.weight", "h.6.mlp.c_proj.bias", "h.7.ln_1.weight", "h.7.ln_1.bias", "h.7.attn.bias", "h.7.attn.c_attn.weight", "h.7.attn.c_attn.bias", "h.7.attn.c_proj.weight", "h.7.attn.c_proj.bias", "h.7.ln_2.weight", "h.7.ln_2.bias", "h.7.mlp.c_fc.weight", "h.7.mlp.c_fc.bias", "h.7.mlp.c_proj.weight", "h.7.mlp.c_proj.bias", "h.8.ln_1.weight", "h.8.ln_1.bias", "h.8.attn.bias", "h.8.attn.c_attn.weight", "h.8.attn.c_attn.bias", "h.8.attn.c_proj.weight", "h.8.attn.c_proj.bias", "h.8.ln_2.weight", "h.8.ln_2.bias", "h.8.mlp.c_fc.weight", "h.8.mlp.c_fc.bias", "h.8.mlp.c_proj.weight", "h.8.mlp.c_proj.bias", "h.9.ln_1.weight", "h.9.ln_1.bias", "h.9.attn.bias", "h.9.attn.c_attn.weight", "h.9.attn.c_attn.bias", "h.9.attn.c_proj.weight", "h.9.attn.c_proj.bias", "h.9.ln_2.weight", "h.9.ln_2.bias", "h.9.mlp.c_fc.weight", "h.9.mlp.c_fc.bias", "h.9.mlp.c_proj.weight", "h.9.mlp.c_proj.bias", "h.10.ln_1.weight", "h.10.ln_1.bias", "h.10.attn.bias", "h.10.attn.c_attn.weight", "h.10.attn.c_attn.bias", "h.10.attn.c_proj.weight", "h.10.attn.c_proj.bias", "h.10.ln_2.weight", "h.10.ln_2.bias", "h.10.mlp.c_fc.weight", "h.10.mlp.c_fc.bias", "h.10.mlp.c_proj.weight", "h.10.mlp.c_proj.bias", "h.11.ln_1.weight", "h.11.ln_1.bias", "h.11.attn.bias", "h.11.attn.c_attn.weight", "h.11.attn.c_attn.bias", "h.11.attn.c_proj.weight", "h.11.attn.c_proj.bias", "h.11.ln_2.weight", "h.11.ln_2.bias", "h.11.mlp.c_fc.weight", "h.11.mlp.c_fc.bias", "h.11.mlp.c_proj.weight", "h.11.mlp.c_proj.bias", "ln_f.weight", "ln_f.bias"]
Total count:160

This is expected, this model is not meant to be compatible with https://github.com/huggingface/transformers - even if parameters were compatible, tokenizers user are still different (a custom tokenizer is the main point of this repo), so transfer would be tricky.