Questions about training big-lama and the full-checkpoint

Question

Questions about training big-lama and the full-checkpoint

Closed this issue 3 years ago · 17 comments

Hi, thanks again for your excellent works.
Is the big-lama model trained on places-challenge dataset? Whether it performs greatly better than a big-lama trained with places2-standard?
Is it possible to release the full checkpoints of the big-lama model, so we can finetune it on other data? Thanks.

Answer 1 · 2022-03-01T05:17:53.000Z

Could you also share the training log or time of big-lama? Thanks so much.

Answer 2 · 2022-03-05T13:39:22.000Z

Is the big-lama model trained on places-challenge dataset?

Not exactly Places Challenge - it was trained on a subset of 157 categories from Places Challenge. Please refer to supp.mat for exact list of these categories.

Whether it performs greatly better than a big-lama trained with places2-standard?

The difference is pretty noticeable by a naked eye, but the improvement from standard -> subset-of-challenge is less than the most important contributions from our paper (e.g. masks, architecture and segm-pl).

Answer 3 · 2022-03-05T13:45:03.000Z

Could you also share the training log or time of big-lama? Thanks so much.

It took approximately 12 days to train this big-lama on 8xV100 32GB with total batch size of 120 (8 gpus x 15 samples).

Answer 4 · 2022-03-05T14:03:56.000Z

Is it possible to release the full checkpoints of the big-lama model, so we can finetune it on other data?

I've just uploaded full checkpoint to https://disk.yandex.ru/d/wJ2Ee0f1HvasDQ subfoler big-lama-with-discr - unlike other checkpoints, this one has discriminator and SegmPL weights included.

Please share your experience with finetuning - does it help and how dramatically.

Answer 5 · 2022-03-05T17:50:48.000Z

Thanks so much! That is super helpful!

Answer 6 · 2022-03-09T06:06:56.000Z

I'll close that issue for now - feel free to reopen if you have any issies with fine-tuning

Answer 7 · 2022-03-31T18:13:52.000Z

Hello,
I am having some issues loading the big-lama-with-discr for finetuning. Please correct me if I am wrong but I notice that the SegmPL weights are loss_segm_pl.impl... in the .ckpt, but the current trainer loads it as loss_resnet_pl.impl... https://github.com/saic-mdal/lama/blob/ede702b19b027ad2c0380419b2b71a90fe90a14f/saicinpainting/training/trainers/base.py#L110

After modifying this, I get the following error:

    'Trying to restore training state but checkpoint contains only the model.'
KeyError: 'Trying to restore training state but checkpoint contains only the model. This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`.'

@yzhouas did you have any success with this? I am wondering if it is just me.

Answer 8 · 2022-03-31T19:07:16.000Z

Apparently, this is a known issue in Pytorch Lightning, and the problem for the suggested Pytorch Lightning 1.2.9 seems to be here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/f9f4853f3663404362c7de8614a504b0403c25b8/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L106

        # restore training state
        self.restore_training_state(checkpoint)

So, a very ugly hack would be to bypass it as:

        # restore training state
        try:
            self.restore_training_state(checkpoint)
        except KeyError:
            rank_zero_warn(
                "File at `resume_from_checkpoint` Trying to restore training state but checkpoint contains only the model."
            )

Answer 9 · 2022-04-08T09:08:09.000Z

Hi @affromero !

Yeah, I forgot that we changed the name of this variable already after training big lama... Another possible solution is to just strip loss_segm_pl.impl... from the checkpoint altogether - anyway it is initialized from a fixed ade20k checkpoint.

Trying to restore training state but checkpoint contains only the model.

I have not faced this issue yet. Have you resolved it?

Answer 10 · 2022-05-11T23:36:30.000Z

Hi @windj007,

I looked into the Supplementary Material but I was not able to find what categories from Places Challenge were used for training Big-Lama. Could you please list these categories? Also, why haven't you used the entire Places Challenge for training Big-Lama?

Thank you

Answer 11 · 2022-05-12T07:52:15.000Z

Hi @windj007 ,

I am having the some issue loading the big-lama-with-discr for finetuning, please correct me if I made any mistake.

I run this command:
python bin/train.py -cn big-lama location=my_dataset data.batch_size=10 +trainer.kwargs.resume_from_checkpoint=path\\to\\big-lama-with-discr\\best.ckpt

and got this error message:
RuntimeError: Error(s) in loading state_dict for DefaultInpaintingTrainingModule:
Missing key(s) in state_dict: "loss_resnet_pl.impl.conv1.weight", "loss_resnet_pl......
Unexpected key(s) in state_dict: "loss_segm_pl.impl.conv1.weight", "loss_segm_pl.impl....

I modified base.py Line 109:
From:
if self.config.losses.get("resnet_pl", {"weight": 0})['weight'] > 0: self.loss_resnet_pl = ResNetPL(**self.config.losses.resnet_pl)

To:
if self.config.losses.get("sege_pl", {"weight": 0})['weight'] > 0: self.loss_sege_pl = ResNetPL(**self.config.losses.sege_pl)

And and Missing key error is disappeared, but still have the Unexpected key error message:
Unexpected key(s) in state_dict: "loss_segm_pl.impl.conv1.weight", "loss_segm_pl.impl...._

Do you have any suggestion for this?

Answer 12 · 2022-05-13T19:07:24.000Z

@marcelsan The list is there, on page 5.

why haven't you used the entire Places Challenge for training Big-Lama?

Bigger datasets need bigger models - and smaller models work better when the dataset is more focused. And Big-LaMa is not that big in terms of number of trainable parameters.

Answer 13 · 2022-05-13T19:09:54.000Z

@Liang-Sen

And and Missing key error is disappeared, but still have the Unexpected key error message:

The quick solution is a couple of comments above:

Another possible solution is to just strip loss_segm_pl.impl... from the checkpoint altogether - anyway it is initialized from a fixed ade20k checkpoint.

I should have fixed and reupploaded the checkpoint, but have not found time yet...

Answer 14 · 2022-05-15T14:58:18.000Z

@windj007
Thanks for your reply.
I just remove the "loss_segm_pl" from the checkpoint and its worked.

Share the remove_checkpoint here:
https://drive.google.com/file/d/1YTiKZ1hQnKvTEbXIxFXjGg61pBAch_N7/view?usp=sharing

Answer 15 · 2022-05-20T16:11:04.000Z

@Liang-Sen thank you!

Answer 16 · 2023-06-25T05:18:24.000Z

I summed up the experience above and trained big-lama like this. If I made any mistakes, please correct me.
1.modified pytorch_lightning/trainer/connectors/checkpoint_connector.py Line 106:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f9f4853f3663404362c7de8614a504b0403c25b8/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L106

        # restore training state
        self.restore_training_state(checkpoint)

to

        # restore training state
        try:
            self.restore_training_state(checkpoint)
        except KeyError:
            rank_zero_warn(
                "File at `resume_from_checkpoint` Trying to restore training state but checkpoint contains only the model."
            )

2.modified lama-main/saicinpainting/training/trainers/base.py Line 109:

            if self.config.losses.get("resnet_pl", {"weight": 0})['weight'] > 0:
                self.loss_resnet_pl = ResNetPL(**self.config.losses.resnet_pl)

to

            if self.config.losses.get("sege_pl", {"weight": 0})['weight'] > 0:
                self.loss_sege_pl = ResNetPL(**self.config.losses.sege_pl)

3.run

python bin/train.py -cn big-lama location=my_dataset data.batch_size=10 +trainer.kwargs.resume_from_checkpoint=abspath\\to\\big-lama-with-discr-remove-loss_segm_pl.ckpt

https://drive.google.com/file/d/1YTiKZ1hQnKvTEbXIxFXjGg61pBAch_N7/view?usp=sharing
model shared by @Liang-Sen

Answer 17 · 2024-10-30T09:46:27.000Z

@windj007 I just need to run the inference with lama-fourier-with-discr. As mentioned I have downloaded weights from https://drive.google.com/file/d/1YTiKZ1hQnKvTEbXIxFXjGg61pBAch_N7/view?usp=sharing mentioned by @Liang-Sen Can you please give me the config file for lama-fourier-with-discr.