forever208/DDPM-IP

A Theoretical Question

Closed this issue ยท 15 comments

In the diffusion model papers, we all assume the real image $\textbf{x}_0 \sim q(\textbf{x}_0)$, but I haven't seen an exact definition of $q(\textbf{x}_0)$. So I wonder what exactly $q(\textbf{x}_0)$ is. Is it the distribution function of $\textbf{x}_0$? If so, how do we calculate the distribution function of a single image? Thank you!

q(x_0) stands for the whole data distribution, i.e your training dataset.
We can not explicitly express the correct q(x_0) (we do not know if q(x_0) is Gaussian or other distribution), we can only draw samples from the data distribution q(x_0)

I see, thank you! And in $\textbf{x}_t \sim q(\textbf{x}_t|\textbf{x}_{t-1})=\mathcal{N}(\textbf{x}_t;\sqrt{1-\beta_t}\textbf{x}_{t-1},\beta_t \textbf{I})$, the function $q(\textbf{x}_t|\textbf{x}_{t-1})$ is the conditional distribution of $x_t$ given $x_{t-1}$. Am I right? Thanks.

yes

Thank you. Also, for example, Iโ€™m working on cifar10 dataset. Then the dimension of $\textbf{x}_0, \cdots, \textbf{x}_t$ is 32ร—32ร—3, right?

Thank you. Also, for example, Iโ€™m working on cifar10 dataset. Then the dimension of x0,โ‹ฏ,xt is 32ร—32ร—3, right?

yes

Thank you! I wonder what you mean by "each FID value is computed using T = 1000 sampling steps". Does it imply diffusion_steps 1000 in the code below? Thanks.

mpirun python scripts/image_sample.py \
--image_size 32 --timestep_respacing 100 \
--model_path PATH_TO_CHECKPOINT \
--num_channels 128 --num_head_channels 32 --num_res_blocks 3 --attention_resolutions 16,8 \
--resblock_updown True --use_new_attention_order True --learn_sigma True --dropout 0.3 \
--diffusion_steps 1000 --noise_schedule cosine --use_scale_shift_norm True --batch_size 256 --num_samples 50000

In Figure 3 of your paper, you calculated FID scores using T = 1000 sampling steps.
image

Thank you! I wonder what you mean by "each FID value is computed using T = 1000 sampling steps". Does it imply diffusion_steps 1000 in the code below? Thanks.

mpirun python scripts/image_sample.py \
--image_size 32 --timestep_respacing 100 \
--model_path PATH_TO_CHECKPOINT \
--num_channels 128 --num_head_channels 32 --num_res_blocks 3 --attention_resolutions 16,8 \
--resblock_updown True --use_new_attention_order True --learn_sigma True --dropout 0.3 \
--diffusion_steps 1000 --noise_schedule cosine --use_scale_shift_norm True --batch_size 256 --num_samples 50000

In Figure 3 of your paper, you calculated FID scores using T = 1000 sampling steps. image

Thank you! I wonder what you mean by "each FID value is computed using T = 1000 sampling steps". Does it imply diffusion_steps 1000 in the code below? Thanks.

mpirun python scripts/image_sample.py \
--image_size 32 --timestep_respacing 100 \
--model_path PATH_TO_CHECKPOINT \
--num_channels 128 --num_head_channels 32 --num_res_blocks 3 --attention_resolutions 16,8 \
--resblock_updown True --use_new_attention_order True --learn_sigma True --dropout 0.3 \
--diffusion_steps 1000 --noise_schedule cosine --use_scale_shift_norm True --batch_size 256 --num_samples 50000

In Figure 3 of your paper, you calculated FID scores using T = 1000 sampling steps. image
The code uses 100 sampling steps. Figure 3 will be updated in paper later

Got it. Could you tell me which parameter determines the number of sampling steps in the code below? Thank you.

mpirun python scripts/image_sample.py \
--image_size 32 --timestep_respacing 100 \
--model_path PATH_TO_CHECKPOINT \
--num_channels 128 --num_head_channels 32 --num_res_blocks 3 --attention_resolutions 16,8 \
--resblock_updown True --use_new_attention_order True --learn_sigma True --dropout 0.3 \
--diffusion_steps 1000 --noise_schedule cosine --use_scale_shift_norm True --batch_size 256 --num_samples 50000

Got it. Could you tell me which parameter determines the number of sampling steps in the code below? Thank you.

mpirun python scripts/image_sample.py \
--image_size 32 --timestep_respacing 100 \
--model_path PATH_TO_CHECKPOINT \
--num_channels 128 --num_head_channels 32 --num_res_blocks 3 --attention_resolutions 16,8 \
--resblock_updown True --use_new_attention_order True --learn_sigma True --dropout 0.3 \
--diffusion_steps 1000 --noise_schedule cosine --use_scale_shift_norm True --batch_size 256 --num_samples 50000

--timestep_respacing 100

Thank you. I'm still a little confused about the notations in the paper. You mentioned in the paper "When training, we always use $T = 1000$ steps for all the models. At inference time, the results reported with $T^{\prime} < T$ sampling steps have been obtained using the respacing technique." So in here $T = 1000$ refers to diffusion_steps 1000, and $T^{\prime}$ refers to the parameter timestep_respacing. Am I right? Thanks.

Thank you. I'm still a little confused about the notations in the paper. You mentioned in the paper "When training, we always use T=1000 steps for all the models. At inference time, the results reported with Tโ€ฒ<T sampling steps have been obtained using the respacing technique." So in here T=1000 refers to diffusion_steps 1000, and Tโ€ฒ refers to the parameter timestep_respacing. Am I right? Thanks.

yes

Thanks! By the way, I'm trying to train a new model for MNIST dataset using your code and I've created a notebook to download MNIST dataset and transform the training dataset to npz file. The only issue I have is that the dimension of the images in my npz file is 28ร—28ร—3, but the default dimension of MNIST images is 28ร—28ร—1.

So I wonder if this discrepancy will influence the training of DDPM-IP. Here is the Colab notebook. Thank you.

The dimension of my npz file:
image

The default dimension:

image

Hi, I also wonder what the number of total_batch_size is when you were training on the Celeba dataset. I guess total_batch_size is 8*16=128 since there are two nodes. And how long does the training process take? Thank you.

The code for CelebA 64x64 training:

mpiexec -n 16  python scripts/image_train.py --input_pertub 0.1 \
--data_dir PATH_TO_DATASET \
--image_size 64 --use_fp16 True --num_channels 192 --num_head_channels 64 --num_res_blocks 3 \
--attention_resolutions 32,16,8 --resblock_updown True --use_new_attention_order True \
--learn_sigma True --dropout 0.1 --diffusion_steps 1000 --noise_schedule cosine --use_scale_shift_norm True \
--rescale_learned_sigmas True --schedule_sampler loss-second-moment --lr 1e-4 --batch_size 16

Hi, I also wonder what the number of total_batch_size is when you were training on the Celeba dataset. I guess total_batch_size is 8*16=128 since there are two nodes. And how long does the training process take? Thank you.

The code for CelebA 64x64 training:

mpiexec -n 16  python scripts/image_train.py --input_pertub 0.1 \
--data_dir PATH_TO_DATASET \
--image_size 64 --use_fp16 True --num_channels 192 --num_head_channels 64 --num_res_blocks 3 \
--attention_resolutions 32,16,8 --resblock_updown True --use_new_attention_order True \
--learn_sigma True --dropout 0.1 --diffusion_steps 1000 --noise_schedule cosine --use_scale_shift_norm True \
--rescale_learned_sigmas True --schedule_sampler loss-second-moment --lr 1e-4 --batch_size 16

total_batch_size is 8*16=128 is correct. training celeba takes 4-5 days using 16 V100 GPUs