rohitgandikota/sliders

Some question about the code of textsliders "train_lora_xl.py"

Opened this issue · 0 comments

I observed that during the training process, firstly, based on the Lora structure, we infer denoised_latents from randomly initialized latents,
image
image
Then, based on denoised_latents and the frozen SD structure, continue to predict noise? denoised_latents is already the denoised image, what is the principle of predicting noise again? Why not predict noise for randomly initialized latents?
image