Saved weights differ from the original model
bezir opened this issue · 14 comments
System Info
transformers 4.40.1
peft 0.10.0
Who can help?
@pacman100 @younesbelkada @BenjaminBossan
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder - My own task or dataset (give details below)
Reproduction
I have fine-tuned a GPT2 Model using SFTTrainer. I merge base model and trained adapters with the code below. I also extended the vocabulary.
peft_model_path = "checkpoint"
tokenizer = AutoTokenizer.from_pretrained(peft_model_path, device_map="auto")
base_model_path = "openai-community/gpt2"
base_model = GPT2LMHeadModel.from_pretrained(base_model_path, device_map="auto")
base_model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(base_model, peft_model_path, device_map="auto")
merged_model = model.merge_and_unload()
I test this model and the results are okay. Then I want to save this merged_model using the code below.
tokenizer.save_pretrained(save_path)
merged_model.save_pretrained(save_path)
Lastly, I open the saved model with the code below.
tokenizer = AutoTokenizer.from_pretrained(save_path)
model = AutoModelForCausalLM.from_pretrained(save_path)
The model that I load from the save_path does not work well. It repeats the same token or gives random tokens from the base vocabulary.
Model before Save
GPT2LMHeadModel(
(transformer): GPT2Model(
(wte): Embedding(66156, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=66156, bias=False)
)
Loaded Model:
GPT2LMHeadModel(
(transformer): GPT2Model(
(wte): Embedding(66156, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=66156, bias=False)
)
Now let's look at the weights.
Model Before Save
OrderedDict([('transformer.wte.weight',
tensor([[-0.2500, 0.2324, 0.0162, ..., 0.0013, -0.4432, 0.2431],
[ 0.3332, -0.1894, -0.2949, ..., 0.2883, -0.0411, 0.3148],
[-0.2403, -0.1975, 0.4091, ..., -0.3482, 0.5244, 0.1759],
...,
[ 0.3374, -0.1371, -0.2627, ..., -0.6586, -0.5067, -0.0226],
[-0.1031, 0.1453, -0.9022, ..., -0.3682, 0.4504, 0.3242],
[-0.5442, -0.6574, -0.0881, ..., -0.2370, -0.3048, 0.7317]],
device='cuda:0')),
('transformer.wpe.weight',
tensor([[-1.2274e-01, -1.5239e-01, 1.9312e-01, ..., -2.8421e-03,
-5.4589e-02, 3.1110e-02],
[-1.1096e-02, -1.4371e-01, -5.5172e-02, ..., 1.8058e-01,
4.9604e-02, 4.3034e-02],
[ 6.0074e-02, -2.2665e-01, 2.2515e-01, ..., -1.2376e-02,
1.3002e-01, -1.2887e-02],
...,
[ 3.5125e-01, -1.1077e+00, 1.3558e-01, ..., -2.0376e-01,
-3.5680e-01, 3.1987e-01],
[ 4.0268e-01, -4.6439e-01, -8.0140e-04, ..., 1.4744e-01,
1.2033e-01, -8.1738e-02],
[ 2.6610e-04, 3.0272e-03, -1.7086e-03, ..., -4.6506e-03,
-2.3541e-03, -5.7855e-03]], device='cuda:0')),
('transformer.h.0.ln_1.weight',
tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467, 0.1865, 0.2143,
0.1956, 0.2118, 0.2153, 0.1882, 0.2074, 0.1871, 0.2040, 0.2044, 0.1900,
0.1952, 0.0475, 0.1909, 0.2115, 0.1971, 0.2202, 0.1998, 0.2108, 0.2303,
...
0.1662, 0.1982, 0.1582, 0.1935, 0.2182, 0.2067, 0.1855, 0.1778, 0.1900,
0.2124, 0.1215, 0.2092, 0.1929, 0.2434, 0.1936, 0.1948, 0.0622, 0.1852,
0.1868, 0.2035, 0.2310, 0.1794, 0.1655, 0.1756, 0.2074, 0.2194, 0.2152,
0.0502, 0.2294, 0.1950, 0.2149, 0.2024, 0.1727, 0.0657, 0.1919, 0.1847,
0.1900, 0.1825, 0.1898], device='cuda:1')), ...]
Loaded Model
OrderedDict([('transformer.wte.weight',
tensor([[-0.0952, -0.0785, 0.0155, ..., -0.1458, -0.0334, 0.0052],
[ 0.0234, -0.0811, 0.0049, ..., 0.1172, -0.0847, 0.0343],
[-0.1060, 0.0711, 0.1621, ..., -0.0243, -0.1103, -0.0732],
...,
[-0.0358, 0.0035, 0.0351, ..., -0.0633, -0.0200, -0.0084],
[-0.1555, 0.0488, 0.0125, ..., -0.0582, 0.0440, -0.1661],
[-0.0095, 0.1273, -0.0158, ..., 0.0115, -0.1641, -0.0303]],
device='cuda:0')),
('transformer.wpe.weight',
tensor([[-1.2274e-01, -1.5239e-01, 1.9312e-01, ..., -2.8421e-03,
-5.4589e-02, 3.1110e-02],
[-1.1096e-02, -1.4371e-01, -5.5172e-02, ..., 1.8058e-01,
4.9604e-02, 4.3034e-02],
[ 6.0074e-02, -2.2665e-01, 2.2515e-01, ..., -1.2376e-02,
1.3002e-01, -1.2887e-02],
...,
[ 3.5125e-01, -1.1077e+00, 1.3558e-01, ..., -2.0376e-01,
-3.5680e-01, 3.1987e-01],
[ 4.0268e-01, -4.6439e-01, -8.0140e-04, ..., 1.4744e-01,
1.2033e-01, -8.1738e-02],
[ 2.6610e-04, 3.0272e-03, -1.7086e-03, ..., -4.6506e-03,
-2.3541e-03, -5.7855e-03]], device='cuda:0')),
('transformer.h.0.ln_1.weight',
tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467, 0.1865, 0.2143,
0.1956, 0.2118, 0.2153, 0.1882, 0.2074, 0.1871, 0.2040, 0.2044, 0.1900,
0.1952, 0.0475, 0.1909, 0.2115, 0.1971, 0.2202, 0.1998, 0.2108, 0.2303,
...
0.1662, 0.1982, 0.1582, 0.1935, 0.2182, 0.2067, 0.1855, 0.1778, 0.1900,
0.2124, 0.1215, 0.2092, 0.1929, 0.2434, 0.1936, 0.1948, 0.0622, 0.1852,
0.1868, 0.2035, 0.2310, 0.1794, 0.1655, 0.1756, 0.2074, 0.2194, 0.2152,
0.0502, 0.2294, 0.1950, 0.2149, 0.2024, 0.1727, 0.0657, 0.1919, 0.1847,
0.1900, 0.1825, 0.1898], device='cuda:0')), ...]
I only call two functions save_pretrained then load_pretrained why are the weights different? I tried to change weights after
loading the model, it started to work fine again. Then, I tried to save that model, then the same problem, saved model is different than loaded model.
Expected behavior
The model weights are supposed to be the same.
One another way to look at it is this:
torch.save(merged_model.state_dict(), "save_path/pytorch_model.bin")
model = torch.load("save_path/pytorch_model.bin")
model = GPT2LMHeadModel.from_pretrained("save_path")
I get two different weight tensors for this two particular loading methods.
I tried to reproduce your issue but was not successful. Since your provided code was not fully complete, this is what I was working with based on your example:
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoModelForCausalLM
from peft import get_peft_model, LoraConfig, PeftModel
torch.manual_seed(0)
base_model_path = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", device_map="auto")
tokenizer.add_special_tokens(special_tokens_dict = {"cls_token": "<CLS>"})
lora_config = LoraConfig(init_lora_weights=False)
# first let's create a LoRA adapter that we can use
base_model = GPT2LMHeadModel.from_pretrained(base_model_path, device_map="auto")
base_model.resize_token_embeddings(len(tokenizer))
model = get_peft_model(base_model, lora_config)
print("show some weights BEFORE MERGING")
print("wte")
print(model.base_model.model.transformer.wte.weight[0, :7])
print("wpe")
print(model.base_model.model.transformer.wpe.weight[0, :7])
print("ln_1")
print(model.base_model.model.transformer.h[0].ln_1.weight[:7])
print("c_attn 0")
print(model.base_model.model.transformer.h[0].attn.c_attn.base_layer.weight[0, :7])
merged_model = model.merge_and_unload()
del base_model, model
print("show some weights AFTER MERGING")
print("wte")
print(merged_model.transformer.wte.weight[0, :7])
print("wpe")
print(merged_model.transformer.wpe.weight[0, :7])
print("ln_1")
print(merged_model.transformer.h[0].ln_1.weight[:7])
print("c_attn 0 (note: these weights should change because we apply LoRA (with init_lora_weights=False))")
print(merged_model.transformer.h[0].attn.c_attn.weight[0, :7])
merged_model.save_pretrained("/tmp/issue-1689-merged")
del merged_model
loaded = AutoModelForCausalLM.from_pretrained("/tmp/issue-1689-merged")
print("show some weights AFTER LOADING")
print("wte")
print(loaded.transformer.wte.weight[0, :7])
print("wpe")
print(loaded.transformer.wpe.weight[0, :7])
print("ln_1")
print(loaded.transformer.h[0].ln_1.weight[:7])
print("c_attn 0")
print(loaded.transformer.h[0].attn.c_attn.weight[0, :7])
The outputs for me are:
show some weights BEFORE MERGING
wte
tensor([-0.1101, -0.0393, 0.0331, 0.1338, -0.0485, -0.0789, -0.2398],
device='cuda:0')
wpe
tensor([-0.0188, -0.1974, 0.0040, 0.0113, 0.0638, -0.1050, 0.0369],
device='cuda:0')
ln_1
tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467],
device='cuda:0')
c_attn 0
tensor([-0.4738, -0.2614, -0.0978, -0.3499, 0.2243, -0.0429, 0.4187],
device='cuda:0')
show some weights AFTER MERGING
wte
tensor([-0.1101, -0.0393, 0.0331, 0.1338, -0.0485, -0.0789, -0.2398],
device='cuda:0')
wpe
tensor([-0.0188, -0.1974, 0.0040, 0.0113, 0.0638, -0.1050, 0.0369],
device='cuda:0')
ln_1
tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467],
device='cuda:0')
c_attn 0 (note: these weights should change because we apply LoRA (with init_lora_weights=False))
tensor([-0.4912, -0.2313, -0.0803, -0.3547, 0.1991, -0.0385, 0.4098],
device='cuda:0')
show some weights AFTER LOADING
wte
tensor([-0.1101, -0.0393, 0.0331, 0.1338, -0.0485, -0.0789, -0.2398],
grad_fn=<SliceBackward0>)
wpe
tensor([-0.0188, -0.1974, 0.0040, 0.0113, 0.0638, -0.1050, 0.0369],
grad_fn=<SliceBackward0>)
ln_1
tensor([0.2232, 0.1820, 0.1534, 0.1917, 0.2036, 0.1948, 0.1467],
grad_fn=<SliceBackward0>)
c_attn 0
tensor([-0.4912, -0.2313, -0.0803, -0.3547, 0.1991, -0.0385, 0.4098],
grad_fn=<SliceBackward0>)
Does that make sense? Any idea where your code differs?
Hello,
Thanks for the answer. The first problem is that method 1 prints good results when generate answer however the second method ends up with repeating and throwing some random tokens.
base_model_path = "openai-community/gpt2"
base_model = GPT2LMHeadModel.from_pretrained(base_model_path, device_map="auto")
lora_config = LoraConfig(init_lora_weights=False)
base_model.resize_token_embeddings(len(tokenizer))
# Method 1
#model = PeftModel.from_pretrained(base_model, peft_model_path, lora_config=lora_config, device_map="auto")
#merged_model = model.merge_and_unload()
# Method 2
model = get_peft_model(base_model, lora_config)
merged_model = model.merge_and_unload()
When I execute your code I get the exact same result so I believe it is not because of environment. I updated the code based on the method 1. Here is the results.
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoModelForCausalLM
from peft import get_peft_model, LoraConfig, PeftModel
torch.manual_seed(0)
base_model_path = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(peft_model_path, device_map="auto") # CHANGE, loading my peft_model tokenizer, below image
#tokenizer.add_special_tokens(special_tokens_dict = {"cls_token": "<CLS>"}) CHANGE, SIZE MISMATCH
lora_config = LoraConfig(init_lora_weights=False)
base_model = GPT2LMHeadModel.from_pretrained(base_model_path, device_map="auto")
base_model.resize_token_embeddings(len(tokenizer)) # NEW SIZE
merged_model = PeftModel.from_pretrained(base_model, peft_model_path, lora_config=lora_config, device_map="auto") # CHANGE
print("show some weights BEFORE MERGING")
print("wte")
print(base_model.transformer.wte.weight[0, :7])
print("wpe")
print(base_model.transformer.wpe.weight[0, :7])
print("ln_1")
print(base_model.transformer.h[0].ln_1.weight[:7])
print("c_attn 0 (note: these weights should change because we apply LoRA (with init_lora_weights=False))")
print(base_model.transformer.h[0].attn.c_attn.weight[0, :7])
merged_model = merged_model.merge_and_unload()
del base_model, model
print("show some weights AFTER MERGING")
print("wte")
print(merged_model.transformer.wte.weight[0, :7])
print("wpe")
print(merged_model.transformer.wpe.weight[0, :7])
print("ln_1")
print(merged_model.transformer.h[0].ln_1.weight[:7])
print("c_attn 0 (note: these weights should change because we apply LoRA (with init_lora_weights=False))")
print(merged_model.transformer.h[0].attn.c_attn.weight[0, :7])
merged_model.save_pretrained("/tmp/issue-1689-merged")
del merged_model
loaded = AutoModelForCausalLM.from_pretrained("/tmp/issue-1689-merged")
print("show some weights AFTER LOADING")
print("wte")
print(loaded.transformer.wte.weight[0, :7])
print("wpe")
print(loaded.transformer.wpe.weight[0, :7])
print("ln_1")
print(loaded.transformer.h[0].ln_1.weight[:7])
print("c_attn 0")
print(loaded.transformer.h[0].attn.c_attn.weight[0, :7])
model = get_peft_model(base_model, lora_config)
merged_model = model.merge_and_unload()
Note that when you call get_peft_model
, you get a fresh new PEFT model, not something based on any learned adapter. Therefore, it's not surprising that you don't get any good results.
When I execute your code I get the exact same result so I believe it is not because of environment. I updated the code based on the method 1. Here is the results.
I cannot really interpret the results, because your script requires peft_model_path
, which I guess points to your custom adapter, which I don't have. Which weights will change after merging depends on the settings of that adapter, e.g. if the embedding was adapted or not.
But still the question remains.
model = PeftModel.from_pretrained(base_model, peft_model_path, lora_config=lora_config, device_map="auto")
merged_model = model.merge_and_unload()
The weights of merged model in here is not the same after I save it and load it back. What is wrong? Is there anything I can do so you can reproduce problem?
After Merging and After Loading weights should be the same. But as you can see above they are different, is there something wrong with the script or adapter config?
As I said, I can't help you diagnosing your issue without knowing what's in your adapter. Can you share it, together with the adapter_json.config
? If you can't share the weights, could you at least share the config and show the keys of the state_dict
and the shape of its values?
I believe that the problem in here is the lm_head and wte weights are tied weights for GPT2 model, I have a solution for this can you evaluate my solution for this. @BenjaminBossan Here is my solution:
base_model = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, device_map="cuda", trust_remote_code=True, tie_word_embeddings=False)
assert id(base_model.transformer.wte.weight) != id(base_model.lm_head.weight)
base_model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(base_model, peft_model_path, device_map="cuda")
model = model.merge_and_unload()
untying the weights at the first load of the base and adapter.
Key: transformer.wte.weight
Shape: torch.Size([66156, 768])
Key: transformer.wpe.weight
Shape: torch.Size([1024, 768])
Key: transformer.h.0.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.0.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.0.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.0.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.0.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.0.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.0.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.0.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.0.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.0.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.0.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.0.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.1.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.1.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.1.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.1.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.1.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.1.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.1.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.1.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.1.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.1.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.1.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.1.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.2.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.2.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.2.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.2.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.2.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.2.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.2.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.2.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.2.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.2.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.2.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.2.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.3.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.3.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.3.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.3.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.3.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.3.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.3.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.3.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.3.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.3.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.3.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.3.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.4.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.4.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.4.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.4.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.4.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.4.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.4.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.4.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.4.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.4.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.4.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.4.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.5.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.5.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.5.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.5.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.5.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.5.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.5.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.5.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.5.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.5.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.5.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.5.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.6.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.6.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.6.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.6.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.6.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.6.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.6.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.6.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.6.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.6.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.6.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.6.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.7.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.7.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.7.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.7.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.7.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.7.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.7.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.7.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.7.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.7.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.7.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.7.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.8.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.8.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.8.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.8.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.8.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.8.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.8.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.8.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.8.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.8.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.8.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.8.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.9.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.9.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.9.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.9.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.9.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.9.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.9.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.9.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.9.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.9.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.9.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.9.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.10.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.10.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.10.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.10.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.10.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.10.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.10.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.10.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.10.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.10.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.10.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.10.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.11.ln_1.weight
Shape: torch.Size([768])
Key: transformer.h.11.ln_1.bias
Shape: torch.Size([768])
Key: transformer.h.11.attn.c_attn.weight
Shape: torch.Size([768, 2304])
Key: transformer.h.11.attn.c_attn.bias
Shape: torch.Size([2304])
Key: transformer.h.11.attn.c_proj.weight
Shape: torch.Size([768, 768])
Key: transformer.h.11.attn.c_proj.bias
Shape: torch.Size([768])
Key: transformer.h.11.ln_2.weight
Shape: torch.Size([768])
Key: transformer.h.11.ln_2.bias
Shape: torch.Size([768])
Key: transformer.h.11.mlp.c_fc.weight
Shape: torch.Size([768, 3072])
Key: transformer.h.11.mlp.c_fc.bias
Shape: torch.Size([3072])
Key: transformer.h.11.mlp.c_proj.weight
Shape: torch.Size([3072, 768])
Key: transformer.h.11.mlp.c_proj.bias
Shape: torch.Size([768])
Key: transformer.ln_f.weight
Shape: torch.Size([768])
Key: transformer.ln_f.bias
Shape: torch.Size([768])
Key: lm_head.weight
Shape: torch.Size([66156, 768])
@BenjaminBossan The solution by @furkantrky seems like working. However, I would love any elaboration on it. Thanks for your help.
I'm not 100% sure, but when weights are tied, they share the same underlying data, so merging into one could affect the other in an unintended way.
Regarding your config, your modules_to_save
looks odd. Those are modules that are supposed to be fully fine-tuned. E.g. if you have an LLM with a classification head added on top (which is initialized randomly), this head should be added to modules_to_save
. Similarly, you could add the embeddings if you add new tokens, since those are also initialized randomly. However, it makes no sense to include layers targeted with LoRA as well.
@BenjaminBossan Thank you. I could not find enough information about modules_to_save
and target_modules
. Let's say I use LoRA to fine-tune a model but I want to update the embedding layer as well since I extended it what should be included in modules_to_save
and target_modules
and why. I know it depends on the model and the case but I would love any elaborations or documentation about this.
Yes, we can probably do a better job explaining what this setting does, as the name is not fully self-explanatory. We have a bit of documentation on modules_to_save
here.
For your concrete example: When you extend the embedding layer, a few extra vectors will be initialized randomly and concatenated to the pretrained embeddings. These extra vectors must be trained. When you add the embedding layer to modules_to_save
, the whole embedding layer will be fine tuned. This requires more memory but will probably work the best. You could instead also try to add LoRA on top of the embedding layer by listing it in target_modules
and see if this works well enough. It's hard to say in general but it can work. What you should not do is to add a layer to both modules_to_save
and target_modules
at the same time.
Note that when you modify the embedding layer, PEFT tries to detect that automatically and to ensure that it is saved together with the other adapter weights when calling save_pretrained
. If you cannot or don't want to rely on auto-detection, pass save_embedding_layers=True
when calling save_pretrained
.
Awesome, you're great! Thank you again.