UCSB-NLP-Chang/CoPaint

Inference error on Imagenet512:- ValueError: not enough values to unpack (expected 4, got 3)

Closed this issue · 3 comments

Hi,
I am getting this error while running the inference with imagenet512 model
Inference command
python main.py --config_file configs/imagenet512.yaml --input_image examples/celeb-sample.jpg --mask examples/celeb-mask.jpg --outdir images/example --n_samples 1 --algorithm o_ddim

This is the error I am getting

(copaint) CoPaint$ python main.py --config_file configs/imagenet512.yaml --input_image examples/celeb-sample.jpg --mask examples/celeb-mask.jpg --outdir images/example --n_samples 1 --algorithm o_ddim

/home/styldod/anaconda3/envs/copaint/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /home/styldod/anaconda3/envs/copaint/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZNK3c1010TensorImpl36is_contiguous_nondefault_policy_implENS_12MemoryFormatE
  warn(f"Failed to load image Python extension: {e}")
WARNING:root:Tensorflow not installed!
WARNING:root:Scikit-learn not installed!
WARNING:root:Logging level higher than INFO!

{
    "algorithm": "o_ddim",
    "attention_resolutions": "32,16,8",
    "channel_mult": "",
    "class_cond": true,
    "classifier_attention_resolutions": "32,16,8",
    "classifier_depth": 2,
    "classifier_path": "./checkpoints/512x512_classifier.pt",
    "classifier_pool": "attention",
    "classifier_resblock_updown": true,
    "classifier_scale": 1.0,
    "classifier_use_fp16": false,
    "classifier_use_scale_shift_norm": true,
    "classifier_width": 128,
    "clip_denoised": true,
    "cond_y": null,
    "dataset_ending_index": -1,
    "dataset_name": "imagenet",
    "dataset_starting_index": -1,
    "ddim": {
        "ddim_sigma": 0.0,
        "schedule_params": {
            "ddpm_num_steps": 250,
            "jump_length": 1,
            "jump_n_sample": 1,
            "num_inference_steps": 100,
            "schedule_type": "linear",
            "time_travel_filter_type": "none",
            "use_timetravel": false
        }
    },
    "ddnm": {
        "schedule_jump_params": {
            "jump_length": 1,
            "jump_n_sample": 1,
            "n_sample": 1,
            "t_T": 250
        }
    },
    "ddrm": {
        "schedule_jump_params": {
            "jump_length": 1,
            "jump_n_sample": 1,
            "n_sample": 1,
            "t_T": 250
        }
    },
    "debug": false,
    "diffusion_steps": 1000,
    "dps": {
        "eta": 1.0,
        "schedule_jump_params": {
            "jump_length": 1,
            "jump_n_sample": 1,
            "n_sample": 1,
            "t_T": 250
        },
        "step_size": 0.5
    },
    "dropout": 0.0,
    "image_size": 512,
    "input_image": "examples/celeb-sample.jpg",
    "learn_sigma": true,
    "lr_kernel_n_std": 2,
    "mask": "examples/celeb-mask.jpg",
    "mask_type": "half",
    "mode": "inpaint",
    "model_path": "./checkpoints/512x512_diffusion.pt",
    "n_iter": 1,
    "n_samples": 1,
    "noise_schedule": "linear",
    "num_channels": 256,
    "num_head_channels": 64,
    "num_heads": 4,
    "num_heads_upsample": -1,
    "num_res_blocks": 2,
    "num_samples": 100,
    "optimize_xt": {
        "coef_xt_reg": 0.01,
        "coef_xt_reg_decay": 1.0,
        "filter_xT": false,
        "lr_xt": 0.0025,
        "lr_xt_decay": 1.05,
        "mid_interval_num": 1,
        "num_iteration_optimize_xt": 5,
        "optimize_before_time_travel": false,
        "optimize_xt": true,
        "use_adaptive_lr_xt": true,
        "use_smart_lr_xt_decay": false
    },
    "outdir": "images/example",
    "predict_xstart": false,
    "repaint": {
        "inpa_inj_sched_prev": true,
        "inpa_inj_sched_prev_cumnoise": false,
        "schedule_jump_params": {
            "jump_length": 10,
            "jump_n_sample": 10,
            "n_sample": 1,
            "t_T": 250
        }
    },
    "resample": {
        "keep_n_samples": 2
    },
    "resblock_updown": true,
    "rescale_learned_sigmas": false,
    "rescale_timesteps": false,
    "respace_interpolate": false,
    "resume": false,
    "scale": 0,
    "seed": 42,
    "show_progress": true,
    "timestep_respacing": "250",
    "use_checkpoint": false,
    "use_ddim": false,
    "use_fp16": true,
    "use_git": false,
    "use_kl": false,
    "use_new_attention_order": false,
    "use_scale_shift_norm": true
}
2023-05-19-12:03:23-root-INFO: Prepare model...
2023-05-19-12:03:25-root-INFO: Loading model from ./checkpoints/512x512_diffusion.pt...
2023-05-19-12:03:27-root-INFO: Prepare classifier...
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
Loading model from: /home/styldod/anaconda3/envs/copaint/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
2023-05-19-12:03:28-root-INFO: Start sampling
  0%|                                                                                                                                       | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 334, in <module>
    main()
  File "main.py", line 203, in main
    image, mask, image_name, class_id = data
ValueError: not enough values to unpack (expected 4, got 3)

Hi Vineet,

Thank you for your interest in our paper.

By default, the ImageNet model that we use requires a class_id for classifier guidance, and the error is caused by a missing class_id. One solution is to use the CelebA model, by simply running

python main.py --config_file configs/celebahq.yaml --input_image examples/celeb-sample.jpg --mask examples/celeb-mask.jpg --outdir images/example --n_samples 1 --algorithm o_ddim 

If you insist on using the ImageNet model, you could remove the classifier guidance by

python main.py --config_file configs/imagenet512.yaml --input_image examples/celeb-sample.jpg --mask examples/celeb-mask.jpg --outdir images/example --n_samples 1 --algorithm o_ddim --no-class_cond

Or you could manually set a class id by modifying the code a little bit. Honestly, I do not recommend using the ImageNet model for inpainting a human face (celeba-sample.jpg), as the distributions of CelebA and Imagenet differ a lot, and thus the inpainting results might be bad.

Let me know if you have any other questions.

Regards,
Guanhua

Agreed, the CelebA trained model is more suitable for face-related inpainting. I am trying to see the use of imagenet trained model as object removal in images.
I was just first checking whether the inference works or not and used the only image/mask pair present in the repo.
As per your suggestion, i run the inference using this command:-
python main.py --config_file configs/imagenet512.yaml --input_image examples/celeb-sample.jpg --mask examples/celeb-mask.jpg --outdir images/example --n_samples 1 --algorithm o_ddim --no-class_cond

I am getting another error while loading checkpoint

UserWarning: Failed to load image Python extension: /home/styldod/anaconda3/envs/copaint/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZNK3c1010TensorImpl36is_contiguous_nondefault_policy_implENS_12MemoryFormatE
  warn(f"Failed to load image Python extension: {e}")
WARNING:root:Tensorflow not installed!
WARNING:root:Scikit-learn not installed!
WARNING:root:Logging level higher than INFO!

{
    "algorithm": "o_ddim",
    "attention_resolutions": "32,16,8",
    "channel_mult": "",
    "class_cond": false,
    "classifier_attention_resolutions": "32,16,8",
    "classifier_depth": 2,
    "classifier_path": "./checkpoints/512x512_classifier.pt",
    "classifier_pool": "attention",
    "classifier_resblock_updown": true,
    "classifier_scale": 1.0,
    "classifier_use_fp16": false,
    "classifier_use_scale_shift_norm": true,
    "classifier_width": 128,
    "clip_denoised": true,
    "cond_y": null,
    "dataset_ending_index": -1,
    "dataset_name": "imagenet",
    "dataset_starting_index": -1,
    "ddim": {
        "ddim_sigma": 0.0,
        "schedule_params": {
            "ddpm_num_steps": 250,
            "jump_length": 1,
            "jump_n_sample": 1,
            "num_inference_steps": 100,
            "schedule_type": "linear",
            "time_travel_filter_type": "none",
            "use_timetravel": false
        }
    },
    "ddnm": {
        "schedule_jump_params": {
            "jump_length": 1,
            "jump_n_sample": 1,
            "n_sample": 1,
            "t_T": 250
        }
    },
    "ddrm": {
        "schedule_jump_params": {
            "jump_length": 1,
            "jump_n_sample": 1,
            "n_sample": 1,
            "t_T": 250
        }
    },
    "debug": false,
    "diffusion_steps": 1000,
    "dps": {
        "eta": 1.0,
        "schedule_jump_params": {
            "jump_length": 1,
            "jump_n_sample": 1,
            "n_sample": 1,
            "t_T": 250
        },
        "step_size": 0.5
    },
    "dropout": 0.0,
    "image_size": 512,
    "input_image": "examples/celeb-sample.jpg",
    "learn_sigma": true,
    "lr_kernel_n_std": 2,
    "mask": "examples/celeb-mask.jpg",
    "mask_type": "half",
    "mode": "inpaint",
    "model_path": "./checkpoints/512x512_diffusion.pt",
    "n_iter": 1,
    "n_samples": 1,
    "noise_schedule": "linear",
    "num_channels": 256,
    "num_head_channels": 64,
    "num_heads": 4,
    "num_heads_upsample": -1,
    "num_res_blocks": 2,
    "num_samples": 100,
    "optimize_xt": {
        "coef_xt_reg": 0.01,
        "coef_xt_reg_decay": 1.0,
        "filter_xT": false,
        "lr_xt": 0.0025,
        "lr_xt_decay": 1.05,
        "mid_interval_num": 1,
        "num_iteration_optimize_xt": 5,
        "optimize_before_time_travel": false,
        "optimize_xt": true,
        "use_adaptive_lr_xt": true,
        "use_smart_lr_xt_decay": false
    },
    "outdir": "images/example",
    "predict_xstart": false,
    "repaint": {
        "inpa_inj_sched_prev": true,
        "inpa_inj_sched_prev_cumnoise": false,
        "schedule_jump_params": {
            "jump_length": 10,
            "jump_n_sample": 10,
            "n_sample": 1,
            "t_T": 250
        }
    },
    "resample": {
        "keep_n_samples": 2
    },
    "resblock_updown": true,
    "rescale_learned_sigmas": false,
    "rescale_timesteps": false,
    "respace_interpolate": false,
    "resume": false,
    "scale": 0,
    "seed": 42,
    "show_progress": true,
    "timestep_respacing": "250",
    "use_checkpoint": false,
    "use_ddim": false,
    "use_fp16": true,
    "use_git": false,
    "use_kl": false,
    "use_new_attention_order": false,
    "use_scale_shift_norm": true
}
2023-05-22-15:59:50-root-INFO: Prepare model...
2023-05-22-15:59:51-root-INFO: Loading model from ./checkpoints/512x512_diffusion.pt...
Traceback (most recent call last):
  File "main.py", line 334, in <module>
    main()
  File "main.py", line 164, in main
    unet, sampler = prepare_model(config.algorithm, config, device)
  File "main.py", line 59, in prepare_model
    unet.load_state_dict(
  File "/home/styldod/anaconda3/envs/copaint/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1667, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for UNetModel:
        Unexpected key(s) in state_dict: "label_emb.weight". 

Hi Vineet,

By default, the UNet we use requires class_id as input. With class_cond=False, part of the weights (e.g. "label_emb.weight" as you reported) would be removed and thus cannot be loaded. I made some modifications to the code for enabling non-strict model weights loading, and you could pull the repo and use the following command to run the experiment:

python main.py --config_file configs/imagenet.yaml --input_image examples/celeb-sample.jpg --mask examples/celeb-mask.jpg --outdir images/example --n_samples 1 --algorithm o_ddim --no-class_cond

However, in my trial, I noticed that the resulting image is of terrible performance due to the image distribution shift and the lack of class condition, so I suggest using this command only for debugging purpose.

Let me know if there is any other problems.

Regards,
Guanhua