Image degradation/artifacts when scaling SDXL latents
city96 opened this issue · 14 comments
I figured this is the easiest place to open this issue, but I can probably reproduce it on the reference implementation (or diffusers) and post my issue there, if required.
I've been working on building an interposer to convert the latents generated by v1.X and v2.X models into the latent space that SDXL models use. While training, I noticed that XL-to-v1.5 conversion worked almost perfectly, while v1.5->XL conversion produced nasty digital artifacts. This also resulted in the NN never actually converging properly[1].
After some digging, I found out these same artifacts appear any time the SDXL latent is changed in some way between the encode and decode stages. The simplest example for this is up- or downscaling the latent by any amount. Downscaling a v1.5 latent produces a blurry image (as expected of bilinear scaling)[2]. Downscaling an XL latent produces weird corruptions that almost look like digital artifacts[2]. The effect is even worse when using bislerp. SBS output comparison for 768->512 downscale.
My current hypothesis is that the SDXL VAE is over-trained in some way. It seems a lot less capable of compensating for the worsened signal-to-noise ratio caused by scaling - or in my case converting - the latents. This might also explain why the v1.0 VAE had odd "scanline" issues.
As for fixing it, I have no clue - unless I overlooked something. Maybe @comfyanonymous can forward this to someone at SAI.
I guess in the meantime I'll see if I can train an XL VAE from scratch by pinning the encoder to the current one.
[1] - Interposer training outputs. v1->xl performed worse on the evaluation, despite all training runs sharing the same preprocessed latents as the inputs/targets.
Left graph is eval. loss, right two slides are training loss.
[2] "Digital noise" from scaling the latent. Present on both v0.9 and v1.0 SDXL VAE but absent from the v1.5 VAE.
As a follow up on this, I did some testing and trained a VAE by setting the encoder to constant (copying the weights from the v0.9 VAE). This let me train just the decoder part from scratch.
The outcome mostly confirms what I suspected - the cause of the problem is most-likely the SDXL VAE, specifically the decoder stage.
I have uploaded the new VAE on my HF page, although the output quality is mediocre at best due to my limited hardware.
(I have two VAE Encode/Upscale Latent nodes in the image below, but the encoder for both is the same so it would be the same even if you used v0.9 as the input on both, I just reused the workflow from the v1.5 test.)
Interesting stuff!
although the output quality is mediocre at best due to my limited hardware
What hardware x time would be required for a high quality model?
What hardware x time would be required for a high quality model?
Good question. I don't have much experience with distributed/large scale training. I just used a slightly-modified version of the original CompVis code for training and froze the encoder in place. Again, this was more of a proof of concept. The VAE isn't nearly as large as the actual UNET model so I'd imagine the time required would be manageable.
I’m curious - have you tried using your sd1.5->sdxl latent interposer in the middle of denoising? So, for instance, do 50% of the steps on sd1.5, then convert the latent to sdxl, and do the remaining 50% of steps? If so, were the artifacts present on the final sdxl output, or are they only present when you convert a fully denoised sd1.5 latent?
I guess, the initial question would be, does the interposer convert noised latents?
IIRC the UNET did a pretty good job at getting rid of the weird artifacts. It forces the latent back into a format the VAE actually expects, so there usually isn't many noticeable artifacts in the output.
As for the "leftover noise" in the advanced ksampler, that gets converted together with the rest of the image, so that works most of the time. Keep in mind most of my experiments are just hobby-grade at best so take everything with a grain of salt...
I did some testing a while ago and upscaling latents properly seems at least somewhat feasible.
Here's my attempt as well as a more sophisticated version by @Ttl - I haven't had time to test the later but it looks promising.
Do you share your work on a blog or discord or anything? Would love to keep up with your experiments.
@mturnshek
I haven't set up anything like that. I considered it before but wasn't sure if there was interest for it.
I'm not even sure what people use for blogs nowadays. Custom blogs w/ RSS are easy to set up but aren't really popular anymore. Stuff like Discord isn't properly accessible without an account. Twitter has a short character limit + also requires an account to view posts now. Medium/Substack are fine for long writeups but they don't easily allow for frequent progress updates...
I'm open to suggestions for what platform to use though.
Mastodon for short content
@city96 github.io is good enough if you can't Substack (don't pick Medium tho), or CivitAI's article section for something built-in.
I'll be honest, I completely forgot about this. Made a twitter but was lazy to post any of the tech related stuff I'm working on there. Would feel more like yelling into the void than anything so just never really felt the incentive to post that kind of stuff there. Also can't stand the UI.
I did look up Mastodon as per @Razunter 's suggestion but it just looked like GNU Social with added infighting and instance blocking. Has the same problem as twitter too, you have to actively work towards getting posts 'noticed' - probably worse since most instances doesn't even get indexed on search engines.
I do have a domain, mostly for email but I have it linked to github sites. I guess I could set that up properly instead of putting everything in the readme for random repos...
Warmed up to the idea of setting up a Slack Discord server for progress updates a bit, since at least there I'd be posting to the people who want to follow whatever I'm working on. I guess moderation could be a problem...
Anyway, I'm rambling. To give a short update, I'm currently training classifiers to prune a dataset. I was testing it with ESRGAN for UNET/VAE artifact reduction (img1 | img2). Eventually, I'll be using it to finetune PixArt with LLaVA (etc) captions, possibly SD as well since I finally upgraded my GPU.
I guess let me know with a reaction if you guys want to follow along, I'll set up a discord server + the domain for posts.
A month later and I actually remembered to do this lol. Here's the site that links out to the discord/blog/etc.
Any update on upscaling SDXL latents? I'm upscaling as image and then decoding for 2nd pass, this workaround is okay but seems like a bottleneck in my workflow. What do you guys suggest?
@myusf01
There's a few solution that directly upscale in latent space. There might be newer ones but I haven't really been keeping up due to work/personal reasons. Anyway, here's the ones I know of:
SD-Latent-Upscaler is my one, but it can only scale to fixed ratios and the quality isn't the best for XL, especially for realism.
ComfyUi_NNLatentUpscale should be better since it lets you use any scaling ratio but I haven't tested it for XL. This is what most people use IIRC.
ComfyUI-sudo-latent-upscale is decent depending on your usecase but only has models for SDv1. Not sure if @styler00dollar has any plans for other models.