explainingai-code/StableDiffusion-PyTorch

Unable to run

mognc opened this issue · 29 comments

Hello there, I am trying to run conditional text part and have followed all the instructions but at the end I am facing following error. Screenshot attached below. It says "Model checkpoint celebhq/ddpm_ckpt_text_cond_clip.pth not found"
Screenshot 2024-03-28 144732

Hello @mognc ,
When you ran train_ddpm_cond.py , what was the configuration file you used. If it was config/celebhq_text_cond.yaml, then this would have created a checkpoint in the celebhq/ddpm_ckpt_text_cond_clip.pth location.
Can you let me know which config you ran the training with, and if you have changed some parameters, then could you attach that config as well?

I didn't changed any parameter and yes I used config/celebhq_text_cond.yaml

Okay, then you can you check what is the name of checkpoint file that was created in the celebhq folder

Sorry, but there is no celebhq folder here.
Screenshot 2024-03-28 161325

But there should be a celebhq folder after you run the autoencoder. I am assuming you ran train_vqvae with the same config file right ?

Sorry for troubling you, I missed that block. Its downloading different models atm and I hope my error will be resolved now. Thanks for helping. I am new to this image generation thing so I make silly mistakes.

No problem at all @mognc :)
So basically you will first train autoencoder (train_vqvae.py) and then you can choose to train unconditional or conditional .
And just make sure that you use the same config file for both the stages(autoencoder and ldm).

Will keep the issue open for now, and you can close it once you have successfully ran the text conditional training.
Feel free to comment here when you run into any further problems during that.

Unfortunately the error did not get resolved, I am sharing attachments showing the commands I used also showing my celebhq folder contents.
Screenshot 2024-03-28 180307
Screenshot 2024-03-28 180248
Screenshot 2024-03-28 180203

After running train_vqvae , did you run the train_ddpm_cond ? Did that fail ?

After running this command "!python -m tools.train_vqvae --config config/celebhq_text_cond.yaml" it displayed the training is completed. Then i ran "!python -m tools.sample_ddpm_text_cond --config config/celebhq_text_cond.yaml" which failed. I also have attached pics above for your reference.

Yes but the train vqvae is only for Stage I. This will only train the autoencoder and not the diffusion model.
Once auto encoder is trained, we need to run train_ddpm_cond for Stage II training, that is training a conditional latent diffusion model.
And after that is trained, you would be able to generate data using sample_ddpm_text_cond script

Hey there, my model got in running state yesterday but the output was not that I desired. And I am bit confused what changes I have to make to improve them. I have 100 images dataset with 100 captions file. The dimensions of pics are 600*800. I have made dataset and yaml file similar to "celebhq.yaml" and celeb_dataset.py file. I also have modified the targets to my own files. I want to generate similar pics as my dataset but with variations in them. I figured out to achieve this task I will be using Training text conditional LDM method and will be creating similar file as "celebhq_text_cond.yaml" file right?. But now I am unsure what additional changes I have to make in my new three files I have created that will help me achieve my goal like can you pin out parameters I have to change like no of epochs I have to train and etc. Also I have enables save latents in those file as you mentioned in readme file to speed up the training process.

How many epochs/steps did you train the autoencoder for ? and could you add some output examples of autoencoder.
Same for LDM. Because that will help me understand which stage is not generating output as desired.

The first thing I would suggest is have more images, maybe 2K images to start with.
Second is there a reason you want to formulate this as a text conditional problem and not class conditional problem ? This is because for text conditional problems you would be training additional cross attention layers and the training will also be slower. So if you can achieve your goal by just formulating this as a class conditional problem , I would suggest to first try that.

Well I didn't change epochs or samples. At the moment I don't have access to those outputs as Collab erase all the data after session terminates. But the ldm stage was not producing correct output it was just blur pic.
Well I assumed to add variations to a dataset through prompt I will need to use text condition and not class condition. I might be wrong as I am no expert. But the final goal is to add variations according to the user prompt. I have dataset of walls and the user will enter prompt like snow on walls or shadow on the walls and the relevent pic will be generated.

If you didnt change any parameters then that means the autoencoder ran for only 20 epochs and the discriminator didnt even start because the config has the start of discriminator at 15000 steps.
So you should anyways train autoencoder again and change the discriminator_start parameter to the number of steps after which you start seeing decent but blurry outputs from autoencoder.

For the conditioning, if all you have are texts of type ' on walls' where obj can be one of K things, then you can use class conditioning with K classes rather than text conditioning.

Ok I will change that parameter. And I will try conditioning too, but I just want to make sure like my dataset is simple it don't include type of variations I want like snow or dust. This will not be a problem right ?

Also the ldm epochs are set at 100 epochs but this was for celebhq dataset with 30000 images.
I would suggest in the current setting with 100 images, you should train LDM(Stage 2) till 1000 epochs to validate quality of ldm outputs(more if you see that the quality is still continuing to improve )

"but I just want to make sure like my dataset is simple it don't include type of variations I want like snow or dust"
I didnt get this part. Could you clarify a bit ? Do you mean that you want the model to generate variations for which you don't have images ?

Yes, I don't have pics of variations I want.

But if the model has never seen what 'snow' looks like anytime during the training, it will not be able to generate 'snow on walls' right ?

Well my friend used some pre trained models and those were producing results, so I am not sure of this model how it works. So should I add simple pics of snow, dust and other variations and merge them with walls dataset?

Yes pre-trained model would work because that has seen what 'snow' looks like. But this model will be trained from scratch.
So I would suggest to either use the pre-trained model and then fine tune it using libraries like diffusers.
OR if you want to train using this repo from scratch then add in those images for training.

Well I will stick with this repo and will add variations pic along with captions and will merge all dataset together. Thanks for removing all confusions really appreciate your time.

I cant figure out this part of guide like what to change and where
image

This part of the readme is just saying that the dataset class must return a tuple of image tensor and a dictionary of conditional inputs.
And for class conditional case, we only need to have one key 'class' with value as the integer class of the item.
Example - https://github.com/explainingai-code/StableDiffusion-PyTorch/blob/main/dataset/mnist_dataset.py#L75-L77

Hello again, I was unable to train my model class conditionally was unable to solve all errors after changing some files. So i tried training it text conditionally and here are some outputs.
current_autoencoder_sample_193
This is final picture that was generated while training autoencoder at 500 epochs and disc_start at 100.
x0_996
This is sample generated at x0_996.
x0_0
And this is sample generated at x0_0. I trained model for 1000 epochs can this result be improved if I train for more epochs and final question does my training file which is ddpm_ckpt_text_cond_clip.pth overwrite every time I run "!python -m tools.train_ddpm_cond --config config/celebhq_text_cond.yaml" cell and new file is saved right ?

This is my config file which I edited:
dataset_params:
im_path: 'data/Cracks_data'
im_channels : 3
im_size : 256
name: 'crack'

diffusion_params:
num_timesteps : 1000
beta_start : 0.00085
beta_end : 0.012

ldm_params:
down_channels: [ 256, 384, 512, 768 ]
mid_channels: [ 768, 512 ]
down_sample: [ True, True, True ]
attn_down : [True, True, True]
time_emb_dim: 512
norm_channels: 32
num_heads: 16
conv_out_channels : 128
num_down_layers : 2
num_mid_layers : 2
num_up_layers : 2
condition_config:
condition_types: [ 'text' ]
text_condition_config:
text_embed_model: 'clip'
train_text_embed_model: False
text_embed_dim: 512
cond_drop_prob: 0.1

autoencoder_params:
z_channels: 3
codebook_size : 8192
down_channels : [64, 128, 256, 256]
mid_channels : [256, 256]
down_sample : [True, True, True]
attn_down : [False, False, False]
norm_channels: 32
num_heads: 4
num_down_layers : 2
num_mid_layers : 2
num_up_layers : 2

train_params:
seed : 1111
task_name: 'crack'
ldm_batch_size: 16
autoencoder_batch_size: 4
disc_start: 100
disc_weight: 0.5
codebook_weight: 1
commitment_beta: 0.2
perceptual_weight: 1
kl_weight: 0.000005
ldm_epochs: 1000
autoencoder_epochs: 500
num_samples: 1
num_grid_rows: 1
ldm_lr: 0.000005
autoencoder_lr: 0.00001
autoencoder_acc_steps: 4
autoencoder_img_save_steps: 64
save_latents : True
cf_guidance_scale : 1.0
vae_latent_dir_name: 'vae_latents'
vqvae_latent_dir_name: 'vqvae_latents'
ldm_ckpt_name: 'ddpm_ckpt_text_cond_clip.pth'
vqvae_autoencoder_ckpt_name: 'vqvae_autoencoder_ckpt.pth'
vae_autoencoder_ckpt_name: 'vae_autoencoder_ckpt.pth'
vqvae_discriminator_ckpt_name: 'vqvae_discriminator_ckpt.pth'
vae_discriminator_ckpt_name: 'vae_discriminator_ckpt.pth'

I think it would benefit by training the autoencoder more. Specifically two changes:

  1. autoencoder_epochs:1000
  2. disc_start : 200 x (Number of steps in one epoch)

Basically train for longer and start discriminator after your autoencoder generates the best reconstructions it can. The disc_start is the number of steps after which discriminator should start.

Yes the ddpm_ckpt_text_cond_clip.pth is overwritten every time you run the training.