I got KeyError: 'state_dict' in the step 3.

Question

I got KeyError: 'state_dict' in the step 3.

Closed this issue 4 months ago · 16 comments

After I did step 2 (Comebine the autoencoder with the diffusion model), I made updated_ldm.ckpt.
And then, I did step 3 (Finetune Latent diffusion model) by using the updated_ldm.ckpt.
My commend was
CUDA_VISIBLE_DEVICES=0 python main.py --base ldm/models/ldm/inpainting_big/config.yaml --resume updated_ldm.ckpt --stage 1 -t --gpus 0,

Is the process correct by using updated_ldm.ckpt in the step 3?
If it is OK, I got some error.
After Summoning checkpoint,

File "main.py", line 764, in
trainer.fit(model, data)

File "/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 152, in load_model_state_dict
self.lightning_module.load_state_dict(checkpoint["state_dict"])
KeyError: 'state_dict'

How can I solve the error?

Thank you for your attention.

Answer 1 · 2024-01-05T01:20:35.000Z

What keys is in your model?

Answer 2 · 2024-01-05T01:39:59.000Z

@nickyisadog

Thank you for your attention.

I just used your same code and your 5 pair images in Kvasir-SEG/images/. I did not change any one in your code and files.
So, The keys is in your model.

after I run the command for step 3,

CUDA_VISIBLE_DEVICES=0 python main.py --base ldm/models/ldm/inpainting_big/config.yaml --resume updated_ldm.ckpt --stage 1 -t --gpus 0,

I got the error.

Global seed set to 23
Running on GPUs 0,
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 387.25 M params.
Keeping EMAs of 418.
making attention of type 'none' with 512 in_channels
Working with z of shape (1, 3, 64, 64) = 12288 dimensions.
making attention of type 'none' with 512 in_channels
Using first stage also as cond stage.
Monitoring val/loss as checkpoint metric.
Merged modelckpt-cfg:
{'target': 'pytorch_lightning.callbacks.ModelCheckpoint', 'params': {'dirpath': '/latent-diffusion-inpainting/logs/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': 'val/loss', 'save_top_k': 3}}
/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:446: UserWarning: Checkpoint directory /latent-diffusion-inpainting/logs/checkpoints exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
accumulate_grad_batches = 1
Setting learning rate to 1.00e-06 = 1 (accumulate_grad_batches) * 1 (num_gpus) * 1 (batchsize) * 1.00e-06 (base_lr)
training......
Restoring states from the checkpoint file at updated_ldm.ckpt
/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:101: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop
rank_zero_warn(f"you defined a {step_name} but have no {loader_name}. Skipping {stage} loop")
Global seed set to 23
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes

Summoning checkpoint.

Traceback (most recent call last):
File "main.py", line 764, in
trainer.fit(model, data)
File "/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
self._run(model)
File "/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 868, in _run
self.checkpoint_connector.restore_model()
File "/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 142, in restore_model
self.trainer.training_type_plugin.load_model_state_dict(self._loaded_checkpoint)
File "/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 152, in load_model_state_dict
self.lightning_module.load_state_dict(checkpoint["state_dict"])
KeyError: 'state_dict'

Answer 3 · 2024-01-08T01:39:44.000Z

@nickyisadog

My train data set was your 5 pair images in Kvasir-SEG/images/.
And, I just used your same code . I did not change any one in your code and files.
So, The keys is in your model.

How can I check the keys for the model?

Thank you for your help.

Answer 4 · 2024-01-08T02:56:06.000Z

@nickyisadog When I checked the key for the process in step 2 (Comebine the autoencoder with the diffusion model),
I got the printing:

key : first_stage_model.encoder.conv_in.weight
key : first_stage_model.encoder.conv_in.bias
key : first_stage_model.encoder.down.0.block.0.norm1.weight
key : first_stage_model.encoder.down.0.block.0.norm1.bias
key : first_stage_model.encoder.down.0.block.0.conv1.weight
key : first_stage_model.encoder.down.0.block.0.conv1.bias
key : first_stage_model.encoder.down.0.block.0.norm2.weight
key : first_stage_model.encoder.down.0.block.0.norm2.bias
key : first_stage_model.encoder.down.0.block.0.conv2.weight
key : first_stage_model.encoder.down.0.block.0.conv2.bias
key : first_stage_model.encoder.down.0.block.1.norm1.weight
key : first_stage_model.encoder.down.0.block.1.norm1.bias
key : first_stage_model.encoder.down.0.block.1.conv1.weight
key : first_stage_model.encoder.down.0.block.1.conv1.bias
key : first_stage_model.encoder.down.0.block.1.norm2.weight
key : first_stage_model.encoder.down.0.block.1.norm2.bias
key : first_stage_model.encoder.down.0.block.1.conv2.weight
key : first_stage_model.encoder.down.0.block.1.conv2.bias
key : first_stage_model.encoder.down.0.downsample.conv.weight
key : first_stage_model.encoder.down.0.downsample.conv.bias
key : first_stage_model.encoder.down.1.block.0.norm1.weight
key : first_stage_model.encoder.down.1.block.0.norm1.bias
key : first_stage_model.encoder.down.1.block.0.conv1.weight
key : first_stage_model.encoder.down.1.block.0.conv1.bias
key : first_stage_model.encoder.down.1.block.0.norm2.weight
key : first_stage_model.encoder.down.1.block.0.norm2.bias
key : first_stage_model.encoder.down.1.block.0.conv2.weight
key : first_stage_model.encoder.down.1.block.0.conv2.bias
key : first_stage_model.encoder.down.1.block.0.nin_shortcut.weight
key : first_stage_model.encoder.down.1.block.0.nin_shortcut.bias
key : first_stage_model.encoder.down.1.block.1.norm1.weight
key : first_stage_model.encoder.down.1.block.1.norm1.bias
key : first_stage_model.encoder.down.1.block.1.conv1.weight
key : first_stage_model.encoder.down.1.block.1.conv1.bias
key : first_stage_model.encoder.down.1.block.1.norm2.weight
key : first_stage_model.encoder.down.1.block.1.norm2.bias
key : first_stage_model.encoder.down.1.block.1.conv2.weight
key : first_stage_model.encoder.down.1.block.1.conv2.bias
key : first_stage_model.encoder.down.1.downsample.conv.weight
key : first_stage_model.encoder.down.1.downsample.conv.bias
key : first_stage_model.encoder.down.2.block.0.norm1.weight
key : first_stage_model.encoder.down.2.block.0.norm1.bias
key : first_stage_model.encoder.down.2.block.0.conv1.weight
key : first_stage_model.encoder.down.2.block.0.conv1.bias
key : first_stage_model.encoder.down.2.block.0.norm2.weight
key : first_stage_model.encoder.down.2.block.0.norm2.bias
key : first_stage_model.encoder.down.2.block.0.conv2.weight
key : first_stage_model.encoder.down.2.block.0.conv2.bias

....

key : cond_stage_model.decoder.up.1.block.2.conv2.bias
key : cond_stage_model.decoder.up.1.upsample.conv.weight
key : cond_stage_model.decoder.up.1.upsample.conv.bias
key : cond_stage_model.decoder.up.2.block.0.norm1.weight
key : cond_stage_model.decoder.up.2.block.0.norm1.bias
key : cond_stage_model.decoder.up.2.block.0.conv1.weight
key : cond_stage_model.decoder.up.2.block.0.conv1.bias
key : cond_stage_model.decoder.up.2.block.0.norm2.weight
key : cond_stage_model.decoder.up.2.block.0.norm2.bias
key : cond_stage_model.decoder.up.2.block.0.conv2.weight
key : cond_stage_model.decoder.up.2.block.0.conv2.bias
key : cond_stage_model.decoder.up.2.block.1.norm1.weight
key : cond_stage_model.decoder.up.2.block.1.norm1.bias
key : cond_stage_model.decoder.up.2.block.1.conv1.weight
key : cond_stage_model.decoder.up.2.block.1.conv1.bias
key : cond_stage_model.decoder.up.2.block.1.norm2.weight
key : cond_stage_model.decoder.up.2.block.1.norm2.bias
key : cond_stage_model.decoder.up.2.block.1.conv2.weight
key : cond_stage_model.decoder.up.2.block.1.conv2.bias
key : cond_stage_model.decoder.up.2.block.2.norm1.weight
key : cond_stage_model.decoder.up.2.block.2.norm1.bias
key : cond_stage_model.decoder.up.2.block.2.conv1.weight
key : cond_stage_model.decoder.up.2.block.2.conv1.bias
key : cond_stage_model.decoder.up.2.block.2.norm2.weight
key : cond_stage_model.decoder.up.2.block.2.norm2.bias
key : cond_stage_model.decoder.up.2.block.2.conv2.weight
key : cond_stage_model.decoder.up.2.block.2.conv2.bias
key : cond_stage_model.decoder.up.2.upsample.conv.weight
key : cond_stage_model.decoder.up.2.upsample.conv.bias
key : cond_stage_model.decoder.norm_out.weight
key : cond_stage_model.decoder.norm_out.bias
key : cond_stage_model.decoder.conv_out.weight
key : cond_stage_model.decoder.conv_out.bias
key : cond_stage_model.quantize.embedding.weight
key : cond_stage_model.quant_conv.weight
key : cond_stage_model.quant_conv.bias
key : cond_stage_model.post_quant_conv.weight
key : cond_stage_model.post_quant_conv.bias

Answer 5 · 2024-01-08T02:59:54.000Z

may i know the size of your autoencoder after part1?

Answer 6 · 2024-01-08T03:15:51.000Z

@nickyisadog I checked the result between my key result and your combine.ipynb key result. It was same.
The size of my autoencoder after part1 is 3.08GB (3,311,335,440 byte).
I used last.ckpt in latent-diffusion-inpainting\logs\checkpoints.

Answer 7 · 2024-01-08T03:26:15.000Z

@nickyisadog I checked the result between my key result and your combine.ipynb key result. It was same. The size of my autoencoder after part1 is 3.08GB (3,311,335,440 byte). I used last.ckpt in latent-diffusion-inpainting\logs\checkpoints.

hmmm... thats strange, cuz the autoencoder should have 600mb only, The latent diffusion model is 3GB.
I will check my code again.

Answer 8 · 2024-01-08T04:36:37.000Z

@nickyisadog Thank you for your help.
I cannot keep my process,
I am waiting for your help.
Thank you very much.

Answer 9 · 2024-01-08T05:11:12.000Z

@nickyisadog I confused last.ckpt in the log\checkpoints folder after step 3.
I said model size after I did step 3 and got error about KeyError: 'state_dict'.
The size was changed by 3.08GB (3,311,335,440 byte) even though the train made error.

So, I did again step 1, and I check the size of the model. The model size was 697MB(730,908,770 byte).

Answer 10 · 2024-01-09T09:24:24.000Z

我也是这个错误

Answer 11 · 2024-01-20T15:28:46.000Z

This error may be reduced by a new Pytorch version.

For me, I load the pre-trained model by modifying the code before the "instantiate_from_config" in main.py at stage 3, about 560th lines, just like the screenshot below shows.

After that, I encountered a new error "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) ", I referred to this issue ygtxr1997/CelebBasis#8 to solve.

After solving these two problems, you can start training!

Answer 12 · 2024-03-09T17:14:48.000Z

This error may be reduced by a new Pytorch version.

For me, I load the pre-trained model by modifying the code before the "instantiate_from_config" in main.py at stage 3, about 560th lines, just like the screenshot below shows.
After that, I encountered a new error "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) ", I referred to this issue [ygtxr1997/CelebBasis#8](https://github.com/ygtxr1997/CelebBasis/issues/8) to solve.
After solving these two problems, you can start training!

This did not solve the problem for me related to KeyError: 'state_dict'. How did you guys get around this ?

Answer 13 · 2024-04-18T03:11:59.000Z

This error may be reduced by a new Pytorch version.

For me, I load the pre-trained model by modifying the code before the "instantiate_from_config" in main.py at stage 3, about 560th lines, just like the screenshot below shows.
After that, I encountered a new error "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) ", I referred to this issue [ygtxr1997/CelebBasis#8](https://github.com/ygtxr1997/CelebBasis/issues/8) to solve.
After solving these two problems, you can start training!

Hi, what is the size of the trained ldm model after step 3?

Answer 14 · 2024-05-03T13:59:57.000Z

This error may be reduced by a new Pytorch version.
For me, I load the pre-trained model by modifying the code before the "instantiate_from_config" in main.py at stage 3, about 560th lines, just like the screenshot below shows.

After that, I encountered a new error "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) ", I referred to this issue [ygtxr1997/CelebBasis#8](https://github.com/[ygtxr1997/CelebBasis/issues/8](https://github.com/ygtxr1997/CelebBasis/issues/8)) to solve.
After solving these two problems, you can start training!

This did not solve the problem for me related to KeyError: 'state_dict'. How did you guys get around this ?

I wonder which line of code this error appears in? You can check to see if the main.py file has been fixed with the code I wrote in the above screenshot.

Answer 15 · 2024-06-07T02:55:21.000Z

This error may be reduced by a new Pytorch version.
For me, I load the pre-trained model by modifying the code before the "instantiate_from_config" in main.py at stage 3, about 560th lines, just like the screenshot below shows.

After that, I encountered a new error "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) ", I referred to this issue [ygtxr1997/CelebBasis#8](https://github.com/[ygtxr1997/CelebBasis/issues/8](https://github.com/ygtxr1997/CelebBasis/issues/8)) to solve.
After solving these two problems, you can start training!

This did not solve the problem for me related to KeyError: 'state_dict'. How did you guys get around this ?

@Srutarshi Did you solve this issue? I also couldn't resolve it with the methods provided above, but I fixed it by checking the format of the model weights. For training, you need more than just the model's state_dict because we are using PyTorch Lightning. You also need additional information like 'epoch', 'global_step', and more. I suspect you might have used the model obtained from combine.ipynb. If so, you can fix it by saving the model as follows at the very end:

ldm_model = torch.load(ldm_model_path)
torch.save({
            'epoch': ldm_model['epoch'],
            'global_step': ldm_model['global_step'],
            'pytorch-lightning_version': ldm_model['pytorch-lightning_version'],
            'state_dict': model.state_dict(),
            'callbacks': ldm_model['callbacks'],
            'optimizer_states': ldm_model['optimizer_states'],
            'lr_schedulers': ldm_model['lr_schedulers'],
            }, './updated_ldm.ckpt')

Answer 16 · 2024-06-18T20:17:03.000Z

@gotjd709 : As @star4s mentioned in his comment above, I used last.ckpt in latent-diffusion-inpainting\logs\checkpoints.
This solved the issue for me.

distributed_backend=nccl All DDP processes registered. Starting ddp with 1 processes

distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes