huggingface/peft

Saved weights differ from the original model

bezir opened this issue · 14 comments

bezir commented

System Info

transformers 4.40.1
peft 0.10.0

Who can help?

@pacman100 @younesbelkada @BenjaminBossan

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

I have fine-tuned a GPT2 Model using SFTTrainer. I merge base model and trained adapters with the code below. I also extended the vocabulary.

peft_model_path = "checkpoint"
tokenizer = AutoTokenizer.from_pretrained(peft_model_path, device_map="auto")
base_model_path = "openai-community/gpt2"
base_model = GPT2LMHeadModel.from_pretrained(base_model_path, device_map="auto")

base_model.resize_token_embeddings(len(tokenizer))
    
model = PeftModel.from_pretrained(base_model, peft_model_path, device_map="auto")
merged_model = model.merge_and_unload()

I test this model and the results are okay. Then I want to save this merged_model using the code below.

tokenizer.save_pretrained(save_path)
merged_model.save_pretrained(save_path) 

Lastly, I open the saved model with the code below.

tokenizer = AutoTokenizer.from_pretrained(save_path)
model = AutoModelForCausalLM.from_pretrained(save_path)

The model that I load from the save_path does not work well. It repeats the same token or gives random tokens from the base vocabulary.

Model before Save

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(66156, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=66156, bias=False)
)

Loaded Model:

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(66156, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=66156, bias=False)
)

Now let's look at the weights.

Model Before Save

OrderedDict([('transformer.wte.weight',
              tensor([[-0.2500,  0.2324,  0.0162,  ...,  0.0013, -0.4432,  0.2431],
                      [ 0.3332, -0.1894, -0.2949,  ...,  0.2883, -0.0411,  0.3148],
                      [-0.2403, -0.1975,  0.4091,  ..., -0.3482,  0.5244,  0.1759],
                      ...,
                      [ 0.3374, -0.1371, -0.2627,  ..., -0.6586, -0.5067, -0.0226],
                      [-0.1031,  0.1453, -0.9022,  ..., -0.3682,  0.4504,  0.3242],
                      [-0.5442, -0.6574, -0.0881,  ..., -0.2370, -0.3048,  0.7317]],
                     device='cuda:0')),
             ('transformer.wpe.weight',
              tensor([[-1.2274e-01, -1.5239e-01,  1.9312e-01,  ..., -2.8421e-03,
                       -5.4589e-02,  3.1110e-02],
                      [-1.1096e-02, -1.4371e-01, -5.5172e-02,  ...,  1.8058e-01,
                        4.9604e-02,  4.3034e-02],
                      [ 6.0074e-02, -2.2665e-01,  2.2515e-01,  ..., -1.2376e-02,
                        1.3002e-01, -1.2887e-02],
                      ...,
                      [ 3.5125e-01, -1.1077e+00,  1.3558e-01,  ..., -2.0376e-01,
                       -3.5680e-01,  3.1987e-01],
                      [ 4.0268e-01, -4.6439e-01, -8.0140e-04,  ...,  1.4744e-01,
                        1.2033e-01, -8.1738e-02],
                      [ 2.6610e-04,  3.0272e-03, -1.7086e-03,  ..., -4.6506e-03,
                       -2.3541e-03, -5.7855e-03]], device='cuda:0')),
             ('transformer.h.0.ln_1.weight',
              tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467, 0.1865, 0.2143,
                      0.1956, 0.2118, 0.2153, 0.1882, 0.2074, 0.1871, 0.2040, 0.2044, 0.1900,
                      0.1952, 0.0475, 0.1909, 0.2115, 0.1971, 0.2202, 0.1998, 0.2108, 0.2303,
                      ...
                      0.1662, 0.1982, 0.1582, 0.1935, 0.2182, 0.2067, 0.1855, 0.1778, 0.1900,
                      0.2124, 0.1215, 0.2092, 0.1929, 0.2434, 0.1936, 0.1948, 0.0622, 0.1852,
                      0.1868, 0.2035, 0.2310, 0.1794, 0.1655, 0.1756, 0.2074, 0.2194, 0.2152,
                      0.0502, 0.2294, 0.1950, 0.2149, 0.2024, 0.1727, 0.0657, 0.1919, 0.1847,
                      0.1900, 0.1825, 0.1898], device='cuda:1')), ...]

Loaded Model


OrderedDict([('transformer.wte.weight',
              tensor([[-0.0952, -0.0785,  0.0155,  ..., -0.1458, -0.0334,  0.0052],
                      [ 0.0234, -0.0811,  0.0049,  ...,  0.1172, -0.0847,  0.0343],
                      [-0.1060,  0.0711,  0.1621,  ..., -0.0243, -0.1103, -0.0732],
                      ...,
                      [-0.0358,  0.0035,  0.0351,  ..., -0.0633, -0.0200, -0.0084],
                      [-0.1555,  0.0488,  0.0125,  ..., -0.0582,  0.0440, -0.1661],
                      [-0.0095,  0.1273, -0.0158,  ...,  0.0115, -0.1641, -0.0303]],
                     device='cuda:0')),
             ('transformer.wpe.weight',
              tensor([[-1.2274e-01, -1.5239e-01,  1.9312e-01,  ..., -2.8421e-03,
                       -5.4589e-02,  3.1110e-02],
                      [-1.1096e-02, -1.4371e-01, -5.5172e-02,  ...,  1.8058e-01,
                        4.9604e-02,  4.3034e-02],
                      [ 6.0074e-02, -2.2665e-01,  2.2515e-01,  ..., -1.2376e-02,
                        1.3002e-01, -1.2887e-02],
                      ...,
                      [ 3.5125e-01, -1.1077e+00,  1.3558e-01,  ..., -2.0376e-01,
                       -3.5680e-01,  3.1987e-01],
                      [ 4.0268e-01, -4.6439e-01, -8.0140e-04,  ...,  1.4744e-01,
                        1.2033e-01, -8.1738e-02],
                      [ 2.6610e-04,  3.0272e-03, -1.7086e-03,  ..., -4.6506e-03,
                       -2.3541e-03, -5.7855e-03]], device='cuda:0')),
             ('transformer.h.0.ln_1.weight',
              tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467, 0.1865, 0.2143,
                      0.1956, 0.2118, 0.2153, 0.1882, 0.2074, 0.1871, 0.2040, 0.2044, 0.1900,
                      0.1952, 0.0475, 0.1909, 0.2115, 0.1971, 0.2202, 0.1998, 0.2108, 0.2303,
                      ...
                      0.1662, 0.1982, 0.1582, 0.1935, 0.2182, 0.2067, 0.1855, 0.1778, 0.1900,
                      0.2124, 0.1215, 0.2092, 0.1929, 0.2434, 0.1936, 0.1948, 0.0622, 0.1852,
                      0.1868, 0.2035, 0.2310, 0.1794, 0.1655, 0.1756, 0.2074, 0.2194, 0.2152,
                      0.0502, 0.2294, 0.1950, 0.2149, 0.2024, 0.1727, 0.0657, 0.1919, 0.1847,
                      0.1900, 0.1825, 0.1898], device='cuda:0')), ...]

I only call two functions save_pretrained then load_pretrained why are the weights different? I tried to change weights after
loading the model, it started to work fine again. Then, I tried to save that model, then the same problem, saved model is different than loaded model.

Expected behavior

The model weights are supposed to be the same.

bezir commented

One another way to look at it is this:

torch.save(merged_model.state_dict(), "save_path/pytorch_model.bin")
model = torch.load("save_path/pytorch_model.bin")

model = GPT2LMHeadModel.from_pretrained("save_path")

I get two different weight tensors for this two particular loading methods.

I tried to reproduce your issue but was not successful. Since your provided code was not fully complete, this is what I was working with based on your example:

import torch
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoModelForCausalLM
from peft import get_peft_model, LoraConfig, PeftModel

torch.manual_seed(0)
base_model_path = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", device_map="auto")
tokenizer.add_special_tokens(special_tokens_dict = {"cls_token": "<CLS>"})

lora_config = LoraConfig(init_lora_weights=False)

# first let's create a LoRA adapter that we can use
base_model = GPT2LMHeadModel.from_pretrained(base_model_path, device_map="auto")
base_model.resize_token_embeddings(len(tokenizer))
model = get_peft_model(base_model, lora_config)

print("show some weights BEFORE MERGING")
print("wte")
print(model.base_model.model.transformer.wte.weight[0, :7])
print("wpe")
print(model.base_model.model.transformer.wpe.weight[0, :7])
print("ln_1")
print(model.base_model.model.transformer.h[0].ln_1.weight[:7])
print("c_attn 0")
print(model.base_model.model.transformer.h[0].attn.c_attn.base_layer.weight[0, :7])

merged_model = model.merge_and_unload()
del base_model, model

print("show some weights AFTER MERGING")
print("wte")
print(merged_model.transformer.wte.weight[0, :7])
print("wpe")
print(merged_model.transformer.wpe.weight[0, :7])
print("ln_1")
print(merged_model.transformer.h[0].ln_1.weight[:7])
print("c_attn 0 (note: these weights should change because we apply LoRA (with init_lora_weights=False))")
print(merged_model.transformer.h[0].attn.c_attn.weight[0, :7])

merged_model.save_pretrained("/tmp/issue-1689-merged")
del merged_model

loaded = AutoModelForCausalLM.from_pretrained("/tmp/issue-1689-merged")
print("show some weights AFTER LOADING")
print("wte")
print(loaded.transformer.wte.weight[0, :7])
print("wpe")
print(loaded.transformer.wpe.weight[0, :7])
print("ln_1")
print(loaded.transformer.h[0].ln_1.weight[:7])
print("c_attn 0")
print(loaded.transformer.h[0].attn.c_attn.weight[0, :7])

The outputs for me are:

show some weights BEFORE MERGING
wte
tensor([-0.1101, -0.0393,  0.0331,  0.1338, -0.0485, -0.0789, -0.2398],
       device='cuda:0')
wpe
tensor([-0.0188, -0.1974,  0.0040,  0.0113,  0.0638, -0.1050,  0.0369],
       device='cuda:0')
ln_1
tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467],
       device='cuda:0')
c_attn 0
tensor([-0.4738, -0.2614, -0.0978, -0.3499,  0.2243, -0.0429,  0.4187],
       device='cuda:0')

show some weights AFTER MERGING
wte
tensor([-0.1101, -0.0393,  0.0331,  0.1338, -0.0485, -0.0789, -0.2398],
       device='cuda:0')
wpe
tensor([-0.0188, -0.1974,  0.0040,  0.0113,  0.0638, -0.1050,  0.0369],
       device='cuda:0')
ln_1
tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467],
       device='cuda:0')
c_attn 0 (note: these weights should change because we apply LoRA (with init_lora_weights=False))
tensor([-0.4912, -0.2313, -0.0803, -0.3547,  0.1991, -0.0385,  0.4098],
       device='cuda:0')

show some weights AFTER LOADING
wte
tensor([-0.1101, -0.0393,  0.0331,  0.1338, -0.0485, -0.0789, -0.2398],
       grad_fn=<SliceBackward0>)
wpe
tensor([-0.0188, -0.1974,  0.0040,  0.0113,  0.0638, -0.1050,  0.0369],
       grad_fn=<SliceBackward0>)
ln_1
tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467],
       grad_fn=<SliceBackward0>)
c_attn 0
tensor([-0.4912, -0.2313, -0.0803, -0.3547,  0.1991, -0.0385,  0.4098],
       grad_fn=<SliceBackward0>)

Does that make sense? Any idea where your code differs?

bezir commented

Hello,
Thanks for the answer. The first problem is that method 1 prints good results when generate answer however the second method ends up with repeating and throwing some random tokens.

base_model_path = "openai-community/gpt2"
base_model = GPT2LMHeadModel.from_pretrained(base_model_path, device_map="auto")

lora_config = LoraConfig(init_lora_weights=False)
base_model.resize_token_embeddings(len(tokenizer))
    
# Method 1
#model = PeftModel.from_pretrained(base_model, peft_model_path, lora_config=lora_config, device_map="auto")
#merged_model = model.merge_and_unload()

# Method 2
model = get_peft_model(base_model, lora_config)
merged_model = model.merge_and_unload()

Method 1
image

Method 2
image

When I execute your code I get the exact same result so I believe it is not because of environment. I updated the code based on the method 1. Here is the results.

import torch
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoModelForCausalLM
from peft import get_peft_model, LoraConfig, PeftModel

torch.manual_seed(0)
base_model_path = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(peft_model_path, device_map="auto") # CHANGE, loading my peft_model tokenizer, below image
#tokenizer.add_special_tokens(special_tokens_dict = {"cls_token": "<CLS>"}) CHANGE, SIZE MISMATCH

lora_config = LoraConfig(init_lora_weights=False)

base_model = GPT2LMHeadModel.from_pretrained(base_model_path, device_map="auto")
base_model.resize_token_embeddings(len(tokenizer)) # NEW SIZE
merged_model = PeftModel.from_pretrained(base_model, peft_model_path, lora_config=lora_config, device_map="auto") # CHANGE

print("show some weights BEFORE MERGING")
print("wte")
print(base_model.transformer.wte.weight[0, :7])
print("wpe")
print(base_model.transformer.wpe.weight[0, :7])
print("ln_1")
print(base_model.transformer.h[0].ln_1.weight[:7])
print("c_attn 0 (note: these weights should change because we apply LoRA (with init_lora_weights=False))")
print(base_model.transformer.h[0].attn.c_attn.weight[0, :7])

merged_model = merged_model.merge_and_unload()
del base_model, model

print("show some weights AFTER MERGING")
print("wte")
print(merged_model.transformer.wte.weight[0, :7])
print("wpe")
print(merged_model.transformer.wpe.weight[0, :7])
print("ln_1")
print(merged_model.transformer.h[0].ln_1.weight[:7])
print("c_attn 0 (note: these weights should change because we apply LoRA (with init_lora_weights=False))")
print(merged_model.transformer.h[0].attn.c_attn.weight[0, :7])

merged_model.save_pretrained("/tmp/issue-1689-merged")
del merged_model

loaded = AutoModelForCausalLM.from_pretrained("/tmp/issue-1689-merged")
print("show some weights AFTER LOADING")
print("wte")
print(loaded.transformer.wte.weight[0, :7])
print("wpe")
print(loaded.transformer.wpe.weight[0, :7])
print("ln_1")
print(loaded.transformer.h[0].ln_1.weight[:7])
print("c_attn 0")
print(loaded.transformer.h[0].attn.c_attn.weight[0, :7])

Output
image

PEFT Model File
image

model = get_peft_model(base_model, lora_config)
merged_model = model.merge_and_unload()

Note that when you call get_peft_model, you get a fresh new PEFT model, not something based on any learned adapter. Therefore, it's not surprising that you don't get any good results.

When I execute your code I get the exact same result so I believe it is not because of environment. I updated the code based on the method 1. Here is the results.

I cannot really interpret the results, because your script requires peft_model_path, which I guess points to your custom adapter, which I don't have. Which weights will change after merging depends on the settings of that adapter, e.g. if the embedding was adapted or not.

bezir commented

But still the question remains.

model = PeftModel.from_pretrained(base_model, peft_model_path, lora_config=lora_config, device_map="auto")
merged_model = model.merge_and_unload()

The weights of merged model in here is not the same after I save it and load it back. What is wrong? Is there anything I can do so you can reproduce problem?

bezir commented

After Merging and After Loading weights should be the same. But as you can see above they are different, is there something wrong with the script or adapter config?

As I said, I can't help you diagnosing your issue without knowing what's in your adapter. Can you share it, together with the adapter_json.config? If you can't share the weights, could you at least share the config and show the keys of the state_dict and the shape of its values?

I believe that the problem in here is the lm_head and wte weights are tied weights for GPT2 model, I have a solution for this can you evaluate my solution for this. @BenjaminBossan Here is my solution:

base_model = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, device_map="cuda", trust_remote_code=True, tie_word_embeddings=False)
assert id(base_model.transformer.wte.weight) != id(base_model.lm_head.weight)
base_model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(base_model, peft_model_path, device_map="cuda")
model = model.merge_and_unload()

untying the weights at the first load of the base and adapter.

bezir commented

adapter_config.json
image

Key: transformer.wte.weight
Shape: torch.Size([66156, 768])
Key: transformer.wpe.weight
Shape: torch.Size([1024, 768])
Key: transformer.h.0.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.0.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.0.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.0.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.0.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.0.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.0.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.0.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.0.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.0.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.0.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.0.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.1.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.1.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.1.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.1.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.1.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.1.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.1.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.1.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.1.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.1.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.1.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.1.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.2.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.2.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.2.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.2.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.2.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.2.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.2.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.2.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.2.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.2.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.2.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.2.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.3.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.3.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.3.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.3.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.3.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.3.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.3.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.3.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.3.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.3.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.3.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.3.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.4.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.4.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.4.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.4.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.4.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.4.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.4.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.4.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.4.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.4.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.4.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.4.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.5.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.5.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.5.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.5.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.5.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.5.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.5.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.5.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.5.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.5.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.5.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.5.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.6.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.6.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.6.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.6.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.6.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.6.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.6.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.6.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.6.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.6.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.6.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.6.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.7.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.7.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.7.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.7.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.7.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.7.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.7.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.7.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.7.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.7.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.7.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.7.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.8.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.8.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.8.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.8.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.8.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.8.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.8.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.8.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.8.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.8.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.8.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.8.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.9.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.9.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.9.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.9.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.9.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.9.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.9.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.9.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.9.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.9.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.9.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.9.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.10.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.10.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.10.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.10.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.10.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.10.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.10.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.10.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.10.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.10.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.10.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.10.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.11.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.11.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.11.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.11.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.11.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.11.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.11.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.11.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.11.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.11.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.11.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.11.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.ln_f.weight
Shape: torch.Size([768])
Key: transformer.ln_f.bias
Shape: torch.Size([768])
Key: lm_head.weight
Shape: torch.Size([66156, 768])
bezir commented

@BenjaminBossan The solution by @furkantrky seems like working. However, I would love any elaboration on it. Thanks for your help.

I'm not 100% sure, but when weights are tied, they share the same underlying data, so merging into one could affect the other in an unintended way.

Regarding your config, your modules_to_save looks odd. Those are modules that are supposed to be fully fine-tuned. E.g. if you have an LLM with a classification head added on top (which is initialized randomly), this head should be added to modules_to_save. Similarly, you could add the embeddings if you add new tokens, since those are also initialized randomly. However, it makes no sense to include layers targeted with LoRA as well.

bezir commented

@BenjaminBossan Thank you. I could not find enough information about modules_to_save and target_modules. Let's say I use LoRA to fine-tune a model but I want to update the embedding layer as well since I extended it what should be included in modules_to_save and target_modules and why. I know it depends on the model and the case but I would love any elaborations or documentation about this.

Yes, we can probably do a better job explaining what this setting does, as the name is not fully self-explanatory. We have a bit of documentation on modules_to_save here.

For your concrete example: When you extend the embedding layer, a few extra vectors will be initialized randomly and concatenated to the pretrained embeddings. These extra vectors must be trained. When you add the embedding layer to modules_to_save, the whole embedding layer will be fine tuned. This requires more memory but will probably work the best. You could instead also try to add LoRA on top of the embedding layer by listing it in target_modules and see if this works well enough. It's hard to say in general but it can work. What you should not do is to add a layer to both modules_to_save and target_modules at the same time.

Note that when you modify the embedding layer, PEFT tries to detect that automatically and to ensure that it is saved together with the other adapter weights when calling save_pretrained. If you cannot or don't want to rely on auto-detection, pass save_embedding_layers=True when calling save_pretrained.

bezir commented

Awesome, you're great! Thank you again.