I got KeyError: 'state_dict' in the step 3.
Closed this issue · 16 comments
I got KeyError: 'state_dict' in the step 3.
After I did step 2 (Comebine the autoencoder with the diffusion model), I made updated_ldm.ckpt.
And then, I did step 3 (Finetune Latent diffusion model) by using the updated_ldm.ckpt.
My commend was
CUDA_VISIBLE_DEVICES=0 python main.py --base ldm/models/ldm/inpainting_big/config.yaml --resume updated_ldm.ckpt --stage 1 -t --gpus 0,
Is the process correct by using updated_ldm.ckpt in the step 3?
If it is OK, I got some error.
After Summoning checkpoint,
File "main.py", line 764, in
trainer.fit(model, data)
File "/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 152, in load_model_state_dict
self.lightning_module.load_state_dict(checkpoint["state_dict"])
KeyError: 'state_dict'
How can I solve the error?
Thank you for your attention.
What keys is in your model?
Thank you for your attention.
I just used your same code and your 5 pair images in Kvasir-SEG/images/. I did not change any one in your code and files.
So, The keys is in your model.
after I run the command for step 3,
CUDA_VISIBLE_DEVICES=0 python main.py --base ldm/models/ldm/inpainting_big/config.yaml --resume updated_ldm.ckpt --stage 1 -t --gpus 0,
I got the error.
Global seed set to 23
Running on GPUs 0,
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 387.25 M params.
Keeping EMAs of 418.
making attention of type 'none' with 512 in_channels
Working with z of shape (1, 3, 64, 64) = 12288 dimensions.
making attention of type 'none' with 512 in_channels
Using first stage also as cond stage.
Monitoring val/loss as checkpoint metric.
Merged modelckpt-cfg:
{'target': 'pytorch_lightning.callbacks.ModelCheckpoint', 'params': {'dirpath': '/latent-diffusion-inpainting/logs/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': 'val/loss', 'save_top_k': 3}}
/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:446: UserWarning: Checkpoint directory /latent-diffusion-inpainting/logs/checkpoints exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
accumulate_grad_batches = 1
Setting learning rate to 1.00e-06 = 1 (accumulate_grad_batches) * 1 (num_gpus) * 1 (batchsize) * 1.00e-06 (base_lr)
training......
Restoring states from the checkpoint file at updated_ldm.ckpt
/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:101: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop
rank_zero_warn(f"you defined a {step_name} but have no {loader_name}. Skipping {stage} loop")
Global seed set to 23
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
Summoning checkpoint.
Traceback (most recent call last):
File "main.py", line 764, in
trainer.fit(model, data)
File "/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
self._run(model)
File "/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 868, in _run
self.checkpoint_connector.restore_model()
File "/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 142, in restore_model
self.trainer.training_type_plugin.load_model_state_dict(self._loaded_checkpoint)
File "/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 152, in load_model_state_dict
self.lightning_module.load_state_dict(checkpoint["state_dict"])
KeyError: 'state_dict'
My train data set was your 5 pair images in Kvasir-SEG/images/.
And, I just used your same code . I did not change any one in your code and files.
So, The keys is in your model.
How can I check the keys for the model?
Thank you for your help.
@nickyisadog When I checked the key for the process in step 2 (Comebine the autoencoder with the diffusion model),
I got the printing:
key : first_stage_model.encoder.conv_in.weight
key : first_stage_model.encoder.conv_in.bias
key : first_stage_model.encoder.down.0.block.0.norm1.weight
key : first_stage_model.encoder.down.0.block.0.norm1.bias
key : first_stage_model.encoder.down.0.block.0.conv1.weight
key : first_stage_model.encoder.down.0.block.0.conv1.bias
key : first_stage_model.encoder.down.0.block.0.norm2.weight
key : first_stage_model.encoder.down.0.block.0.norm2.bias
key : first_stage_model.encoder.down.0.block.0.conv2.weight
key : first_stage_model.encoder.down.0.block.0.conv2.bias
key : first_stage_model.encoder.down.0.block.1.norm1.weight
key : first_stage_model.encoder.down.0.block.1.norm1.bias
key : first_stage_model.encoder.down.0.block.1.conv1.weight
key : first_stage_model.encoder.down.0.block.1.conv1.bias
key : first_stage_model.encoder.down.0.block.1.norm2.weight
key : first_stage_model.encoder.down.0.block.1.norm2.bias
key : first_stage_model.encoder.down.0.block.1.conv2.weight
key : first_stage_model.encoder.down.0.block.1.conv2.bias
key : first_stage_model.encoder.down.0.downsample.conv.weight
key : first_stage_model.encoder.down.0.downsample.conv.bias
key : first_stage_model.encoder.down.1.block.0.norm1.weight
key : first_stage_model.encoder.down.1.block.0.norm1.bias
key : first_stage_model.encoder.down.1.block.0.conv1.weight
key : first_stage_model.encoder.down.1.block.0.conv1.bias
key : first_stage_model.encoder.down.1.block.0.norm2.weight
key : first_stage_model.encoder.down.1.block.0.norm2.bias
key : first_stage_model.encoder.down.1.block.0.conv2.weight
key : first_stage_model.encoder.down.1.block.0.conv2.bias
key : first_stage_model.encoder.down.1.block.0.nin_shortcut.weight
key : first_stage_model.encoder.down.1.block.0.nin_shortcut.bias
key : first_stage_model.encoder.down.1.block.1.norm1.weight
key : first_stage_model.encoder.down.1.block.1.norm1.bias
key : first_stage_model.encoder.down.1.block.1.conv1.weight
key : first_stage_model.encoder.down.1.block.1.conv1.bias
key : first_stage_model.encoder.down.1.block.1.norm2.weight
key : first_stage_model.encoder.down.1.block.1.norm2.bias
key : first_stage_model.encoder.down.1.block.1.conv2.weight
key : first_stage_model.encoder.down.1.block.1.conv2.bias
key : first_stage_model.encoder.down.1.downsample.conv.weight
key : first_stage_model.encoder.down.1.downsample.conv.bias
key : first_stage_model.encoder.down.2.block.0.norm1.weight
key : first_stage_model.encoder.down.2.block.0.norm1.bias
key : first_stage_model.encoder.down.2.block.0.conv1.weight
key : first_stage_model.encoder.down.2.block.0.conv1.bias
key : first_stage_model.encoder.down.2.block.0.norm2.weight
key : first_stage_model.encoder.down.2.block.0.norm2.bias
key : first_stage_model.encoder.down.2.block.0.conv2.weight
key : first_stage_model.encoder.down.2.block.0.conv2.bias
....
key : cond_stage_model.decoder.up.1.block.2.conv2.bias
key : cond_stage_model.decoder.up.1.upsample.conv.weight
key : cond_stage_model.decoder.up.1.upsample.conv.bias
key : cond_stage_model.decoder.up.2.block.0.norm1.weight
key : cond_stage_model.decoder.up.2.block.0.norm1.bias
key : cond_stage_model.decoder.up.2.block.0.conv1.weight
key : cond_stage_model.decoder.up.2.block.0.conv1.bias
key : cond_stage_model.decoder.up.2.block.0.norm2.weight
key : cond_stage_model.decoder.up.2.block.0.norm2.bias
key : cond_stage_model.decoder.up.2.block.0.conv2.weight
key : cond_stage_model.decoder.up.2.block.0.conv2.bias
key : cond_stage_model.decoder.up.2.block.1.norm1.weight
key : cond_stage_model.decoder.up.2.block.1.norm1.bias
key : cond_stage_model.decoder.up.2.block.1.conv1.weight
key : cond_stage_model.decoder.up.2.block.1.conv1.bias
key : cond_stage_model.decoder.up.2.block.1.norm2.weight
key : cond_stage_model.decoder.up.2.block.1.norm2.bias
key : cond_stage_model.decoder.up.2.block.1.conv2.weight
key : cond_stage_model.decoder.up.2.block.1.conv2.bias
key : cond_stage_model.decoder.up.2.block.2.norm1.weight
key : cond_stage_model.decoder.up.2.block.2.norm1.bias
key : cond_stage_model.decoder.up.2.block.2.conv1.weight
key : cond_stage_model.decoder.up.2.block.2.conv1.bias
key : cond_stage_model.decoder.up.2.block.2.norm2.weight
key : cond_stage_model.decoder.up.2.block.2.norm2.bias
key : cond_stage_model.decoder.up.2.block.2.conv2.weight
key : cond_stage_model.decoder.up.2.block.2.conv2.bias
key : cond_stage_model.decoder.up.2.upsample.conv.weight
key : cond_stage_model.decoder.up.2.upsample.conv.bias
key : cond_stage_model.decoder.norm_out.weight
key : cond_stage_model.decoder.norm_out.bias
key : cond_stage_model.decoder.conv_out.weight
key : cond_stage_model.decoder.conv_out.bias
key : cond_stage_model.quantize.embedding.weight
key : cond_stage_model.quant_conv.weight
key : cond_stage_model.quant_conv.bias
key : cond_stage_model.post_quant_conv.weight
key : cond_stage_model.post_quant_conv.bias
may i know the size of your autoencoder after part1?
@nickyisadog I checked the result between my key result and your combine.ipynb key result. It was same.
The size of my autoencoder after part1 is 3.08GB (3,311,335,440 byte).
I used last.ckpt in latent-diffusion-inpainting\logs\checkpoints.
@nickyisadog I checked the result between my key result and your combine.ipynb key result. It was same. The size of my autoencoder after part1 is 3.08GB (3,311,335,440 byte). I used last.ckpt in latent-diffusion-inpainting\logs\checkpoints.
hmmm... thats strange, cuz the autoencoder should have 600mb only, The latent diffusion model is 3GB.
I will check my code again.
@nickyisadog Thank you for your help.
I cannot keep my process,
I am waiting for your help.
Thank you very much.
@nickyisadog I confused last.ckpt in the log\checkpoints folder after step 3.
I said model size after I did step 3 and got error about KeyError: 'state_dict'.
The size was changed by 3.08GB (3,311,335,440 byte) even though the train made error.
So, I did again step 1, and I check the size of the model. The model size was 697MB(730,908,770 byte).
我也是这个错误
This error may be reduced by a new Pytorch version.
For me, I load the pre-trained model by modifying the code before the "instantiate_from_config" in main.py at stage 3, about 560th lines, just like the screenshot below shows.
After that, I encountered a new error "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) ", I referred to this issue ygtxr1997/CelebBasis#8 to solve.
After solving these two problems, you can start training!
This error may be reduced by a new Pytorch version.
For me, I load the pre-trained model by modifying the code before the "instantiate_from_config" in main.py at stage 3, about 560th lines, just like the screenshot below shows.
After that, I encountered a new error "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) ", I referred to this issue [ygtxr1997/CelebBasis#8](https://github.com/ygtxr1997/CelebBasis/issues/8) to solve.After solving these two problems, you can start training!
This did not solve the problem for me related to KeyError: 'state_dict'. How did you guys get around this ?
This error may be reduced by a new Pytorch version.
For me, I load the pre-trained model by modifying the code before the "instantiate_from_config" in main.py at stage 3, about 560th lines, just like the screenshot below shows.
After that, I encountered a new error "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) ", I referred to this issue [ygtxr1997/CelebBasis#8](https://github.com/ygtxr1997/CelebBasis/issues/8) to solve.After solving these two problems, you can start training!
Hi, what is the size of the trained ldm model after step 3?
This error may be reduced by a new Pytorch version.
For me, I load the pre-trained model by modifying the code before the "instantiate_from_config" in main.py at stage 3, about 560th lines, just like the screenshot below shows.
After that, I encountered a new error "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) ", I referred to this issue [ygtxr1997/CelebBasis#8](https://github.com/[ygtxr1997/CelebBasis/issues/8](https://github.com/ygtxr1997/CelebBasis/issues/8)) to solve.
After solving these two problems, you can start training!This did not solve the problem for me related to KeyError: 'state_dict'. How did you guys get around this ?
I wonder which line of code this error appears in? You can check to see if the main.py file has been fixed with the code I wrote in the above screenshot.
This error may be reduced by a new Pytorch version.
For me, I load the pre-trained model by modifying the code before the "instantiate_from_config" in main.py at stage 3, about 560th lines, just like the screenshot below shows.
After that, I encountered a new error "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) ", I referred to this issue [ygtxr1997/CelebBasis#8](https://github.com/[ygtxr1997/CelebBasis/issues/8](https://github.com/ygtxr1997/CelebBasis/issues/8)) to solve.
After solving these two problems, you can start training!This did not solve the problem for me related to KeyError: 'state_dict'. How did you guys get around this ?
@Srutarshi Did you solve this issue? I also couldn't resolve it with the methods provided above, but I fixed it by checking the format of the model weights. For training, you need more than just the model's state_dict because we are using PyTorch Lightning. You also need additional information like 'epoch', 'global_step', and more. I suspect you might have used the model obtained from combine.ipynb
. If so, you can fix it by saving the model as follows at the very end:
ldm_model = torch.load(ldm_model_path)
torch.save({
'epoch': ldm_model['epoch'],
'global_step': ldm_model['global_step'],
'pytorch-lightning_version': ldm_model['pytorch-lightning_version'],
'state_dict': model.state_dict(),
'callbacks': ldm_model['callbacks'],
'optimizer_states': ldm_model['optimizer_states'],
'lr_schedulers': ldm_model['lr_schedulers'],
}, './updated_ldm.ckpt')