Comparison discussion
x-legion opened this issue ยท 92 comments
Hello, would you please provide your weights (including the checkpoint & lora needed if you use lora) for your original image? I need them to reproduce your results in an oil-painting fashion. The MultiDiffusion results can be severely affected by the model checkpoints & lora you used.
But generally speaking, extraordinary high CFG Scale, and slightly higher denoising value will give you satisfying details. Example positive prompts are "highres, masterpiece, best quality, ultra-detailed unity 8k wallpaper, extremely clear, very clear, ultra-clear". You don't need anything concrete things in positive prompts; and then, drag the CFG Scale to an extra-large value. Denoising values between 0.1 and 0.4 are all OK but the content will change accordingly.
Here is my result of CFG=20, Sampler=DPM++ SDE Karras, denoising strength=0.3 for example. As I use the protogenX34 checkpoint, my painting style will be wildly different from yours:
Please comment on this issue if you find your results have significantly improved after you use proper model and CFG values.
Hi there, I will write here to not create new "issue" about similar thing.
Would be possible to write down or picture all settings that were used to upscale picture attached in extension description ? I think I tested everything but only what I get is blurred upscaled picture. Here is one of example results that shows how blurry result is (not to mention about lack of extra details with denoise at 0.3 and CFG at 20 - as example). Atm. I want copy 1:1 everything to see if issue is on my side or what. Thanks for create that extension - have high hopes
Example picture.
Hello, as you wish I provide the PNG info:
Here is the text version for your convenience. All resources are public things, but I'm quite busy and cannot provide your links.
masterpiece, best quality, highres, extremely detailed 8k unity wallpaper, ultra-detailed
Negative prompt: EasyNegative
Steps: 24, Sampler: DPM++ SDE Karras, CFG scale: 7, Seed: 1614054406, Size: 4096x3200, Model hash: 2ccfc34fe3, Model: 0.9(Gf_style2) + 0.1(abyssorangemix2_Hard), Denoising strength: 0.4, Clip skip: 3, Mask blur: 4, MultiDiffusion upscaler: 4x_foolhardy_Remacri, MultiDiffusion scale factor: 4, MultiDiffusion tile width: 128, MultiDiffusion tile height: 128, MultiDiffusion overlap: 64
If you don't know any of them, you can Google it. But your result is likely to come from pool positive and negative prompts, where I use a Textual Inversion called EasyNegative from civitai.com.
Click Here for Better Comparison View
masterpiece, best quality, portrait,
blue fire, silver hair, fox girl, mage, arm extended, holding blue fire, by jordan grimmer and greg rutkowski and pine ใใคใ wlop, intricate, beautiful, trending artstation, pixiv, digital art, anime, no torch,
<lora:Noise:1.75>
Negative prompt: EasyNegative, lowres, ((bad anatomy)), ((bad hands)), text, missing finger, extra digits, fewer digits, blurry, ((mutated hands and fingers)), (poorly drawn face), ((mutation)), ((deformed face)), (ugly), ((bad proportions)), ((extra limbs)), extra face, (double head), (extra head), ((extra feet)), monster, logo, cropped, worst quality, low quality, normal quality, jpeg, humpbacked, long body, long neck, ((jpeg artifacts))
Steps: 20, Sampler: DPM++ 2M Karras, CFG scale: 7, Seed: 3857533696, Size: 640x960, Model: dreamniji3fp16, Clip skip: 2, ENSD: 31337, Discard penultimate sigma: True
masterpiece, best quality, portrait,
blue fire, silver hair, fox girl, mage, arm extended, holding blue fire, by jordan grimmer and greg rutkowski and pine ใใคใ wlop, intricate, beautiful, trending artstation, pixiv, digital art, anime, no torch,
<lora:Noise:1.75>
Negative prompt: EasyNegative, lowres, ((bad anatomy)), ((bad hands)), text, missing finger, extra digits, fewer digits, blurry, ((mutated hands and fingers)), (poorly drawn face), ((mutation)), ((deformed face)), (ugly), ((bad proportions)), ((extra limbs)), extra face, (double head), (extra head), ((extra feet)), monster, logo, cropped, worst quality, low quality, normal quality, jpeg, humpbacked, long body, long neck, ((jpeg artifacts))
Steps: 20, Sampler: DPM++ 2M Karras, CFG scale: 14, Seed: 3857533696, Size: 1280x1920, Model: dreamniji3fp16, Denoising strength: 0.4, Clip skip: 2, ENSD: 31337, Mask blur: 4, Ultimate SD upscale upscaler: 4x_foolhardy_Remacri, Ultimate SD upscale tile_width: 768, Ultimate SD upscale tile_height: 768, Ultimate SD upscale mask_blur: 8, Ultimate SD upscale padding: 32, Discard penultimate sigma: True
masterpiece, best quality, portrait,
blue fire, silver hair, fox girl, mage, arm extended, holding blue fire, by jordan grimmer and greg rutkowski and pine ใใคใ wlop, intricate, beautiful, trending artstation, pixiv, digital art, anime, no torch,
<lora:Noise:1.75>
Negative prompt: EasyNegative, lowres, ((bad anatomy)), ((bad hands)), text, missing finger, extra digits, fewer digits, blurry, ((mutated hands and fingers)), (poorly drawn face), ((mutation)), ((deformed face)), (ugly), ((bad proportions)), ((extra limbs)), extra face, (double head), (extra head), ((extra feet)), monster, logo, cropped, worst quality, low quality, normal quality, jpeg, humpbacked, long body, long neck, ((jpeg artifacts))
Steps: 20, Sampler: DPM++ 2M Karras, CFG scale: 14, Seed: 3857533696, Size: 1280x1920, Model: dreamniji3fp16, Denoising strength: 0.4, Clip skip: 2, ENSD: 31337, Mask blur: 4, MultiDiffusion upscaler: 4x_foolhardy_Remacri, MultiDiffusion scale factor: 2, Discard penultimate sigma: True
https://imgsli.com/MTYwOTcx same here again
Hello, thanks for your interests in this work. I tried for several minutes on your image and here is my result with no tuning:
https://imgsli.com/MTYxMDI5.
It's hard to tell what is better; if you like illustration-style sharpness and faithfulness to the original image, may be Ultimate SD Upscaler + 4x Ultra Sharp is your best choice. But personally I'd like to see some fabricated details on realistic human face, so I prefer this tool.
It's noteworthy that, the biggest difference between MultiDiffusion and other upscalers is that currently it doesn't support any concrete contents when you upscale a image, otherwise each tile will contain a small character and your image finally becomes blur and messy.
The correct prompts is just as follows. I even don't use lora:
And my configurations, FYI:
Update: Oh I just noticed that, EasyNegative is a textual inversion from civitai.com, it is not a word. Please download that textual inversion.
Here is the link: https://civitai.com/models/7808/easynegative
The Upscalers are important too. I personally use two: 4x-UltraSharp and 4x-remacri. Here is the link:
https://upscale.wiki/wiki/Model_Database
Where you can find the two upscalers and put it in your ESRGAN folder.
4x-remacri
I used it with the image above
EasyNegative is a textual inversion
Already downloaded this embedding
4x-remacri
I used it with the image above
Do you use EasyNegative embeddings?
You mean you have used it in the above images?
It may not be as easy as the Ultimate Upscaler to use, as it's essentially a completely redraw without post-processing. Personally I have some intuitions to use it:
- No concrete positive prompts. Just something like clear, very clear, ultra clear
- Don't use too large tile size as SD 1.4 is only good at 512 - 768 (so you divide it by 8 and get 64 - 96).
- Large CFG Scales, Eular a & DPM++ SDE Karras, Denoising=0.2-0.4
- Try both 4x-UltraSharp and 4x-Remacri
- Clip Skip=2 or 3 worth to try.
please try to reproduce using my params
I just did it and it's a lot better
Settings (Even seed is the same):
But still it can't generate a result as good as yours
I know it highly depends on a hardware, but there's a very large difference in details
No any optimizations used (Such as xformers, opt-split-attention etc.)
please try to reproduce using my params
I just did it and it's a lot better
Settings (Even seed is the same):
But still it can't generate a result as good as yours I know it highly depends on a hardware, but there's a very large difference in details No any optimizations used (Such as xformers, opt-split-attention etc.)
I'm also confused. Are you using this model?
https://civitai.com/models/3666/protogen-x34-photorealism-official-release
I see our model hash is different. Except from this I couldn't find something else.
I'm also confused. Are you using this model?
Yes, I used protogen_x3.4, but pruned
Now I downloaded 5GB version with the same hash as your and THAT'S AMAZING
Very huge improvement in details:
It still not produces the exact same result as yours, I quess it depends on a hardware, but details are unbelievable, I can clearly see stitch seam on the sleeve
Oh thanks for your feedback. I don't know that pruned model can affect the details too before you test it.
Ohh! I think not many knows that to be honest o_O As much as I understand pruning, it should not affect such task as upscalling via small tiles? I gonna try with not pruned model as well and let you know.
Edit. No clue but today everything works as it should. Maybe Its needed to turn off and on everything, not just to restart UI - just like during installing Dreambooth
tried it and to be honest esrgan upscalers do 99% of the lifting, it barely does anything when used with lanczos, unless theres gonna be examples of it with lanczos where it introduces new details ? Best bet is to just upscale with esrgan by 2 and go to inpaint with it to mask the parts one by one to upscale them since you gonna have more pixel area to resolve detail, so unless someone will automate that , its gonna stay as the best way to upscale
More tests. ControlNet not work or it need way lower denoise than I used.
Upscaling for attached was in two passes plus dynamic CFG script - agree, way to off from original picture, but now when i know what and where, its time for fine tunning (hopefully to figure out issue with control net).
Indeed its essential to test couple upscalers because differences are huge - even bigger than used SD model.
tried it and to be honest esrgan upscalers do 99% of the lifting, it barely does anything when used with lanczos, unless theres gonna be examples of it with lanczos where it introduces new details ? Best bet is to just upscale with esrgan by 2 and go to inpaint with it to mask the parts one by one to upscale them since you gonna have more pixel area to resolve detail, so unless someone will automate that , its gonna stay as the best way to upscale
This is basically a tile-by-tile img2img SD redraw. So if you don't give it high strength it doesn't work as you expected. However, one of the weakness is that it currently cannot automatically map your prompts to different areas... If you can use stronger prompts, it should be way better.
But I'm working on Automatic Prompt Mapping. In img2img, it works by first estimate the attention map of your prompt to the original picture, and then re-apply them to multidiffusion tiles. In txt2img this may be similar, but I need time to do so.
https://github.com/dustysys/ddetailer.git try this one
tried it and to be honest esrgan upscalers do 99% of the lifting, it barely does anything when used with lanczos, unless theres gonna be examples of it with lanczos where it introduces new details ? Best bet is to just upscale with esrgan by 2 and go to inpaint with it to mask the parts one by one to upscale them since you gonna have more pixel area to resolve detail, so unless someone will automate that , its gonna stay as the best way to upscale
Iโm sorry for accidentally wrong edit.
This is basically a tile-by-tile img2img SD redraw. So if you don't give it high strength it doesn't work as you expected. However, one of the weakness is that it currently cannot automatically map your prompts to different areas... If you can use stronger prompts, it should be way better.
But I'm working on Automatic Prompt Mapping. In img2img, it works by first estimate the attention map of your prompt to the original picture, and then re-apply them to multidiffusion tiles. In txt2img this may be similar, but I need time to do so.
The key point is that I need a user interface to draw bbox, so that you can draw rectangles and control the MultiDiffusion with different prompts. In this way the result should get way better.
Why? because in this way you can just select the woman's face and tell SD to draw a beautiful woman's face. Then the SD will try his best, using his 512 * 512 resolution to ONLY draw a face. The resolution will be unprecedentedly high for SD models, as he dedicated to draw only one part of the image at the best of his capabilities.
However, when I was adding features I saw this f**king issue:
gradio-app/gradio#2316
Some one pr a bbox tool but the officials denied the merging:
gradio-app/gradio#3220
I don't know what are they thinking in mind to deny such a good PR (from my perspective) but don't provide their own solutions. It has been a half year since it was first proposed.
So it will be hard to draw rectangles on images directly. I must find another way to draw rectangles. Do you have any other idea?
So it will be hard to draw rectangles on images directly. I must find another way to draw rectangles. Do you have any other idea?
Check out this extension: https://github.com/hnmr293/sd-webui-llul
It fakes it by having you move around a rectangle in a separate window.
New Feature: "ZOOM ENHANCE" for the A111 WebUI. Automatically fix small details like faces and hands!
Hello, fellow Stable Diffusion users! I'm excited to share with you a new feature that I've added to the Unprompted extension: it's the [zoom_enhance]
shortcode.
If you're not familiar with Unprompted, it's a powerful extension that lets you use various shortcodes in your prompts to enhance your text generation experience. You can learn more about it here.
The [zoom_enhance]
shortcode is inspired by the fictional technology from CSI, where they can magically zoom in on any pixelated image and reveal crisp details. Of course, this is not possible in real life, but we can get pretty close with Stable Diffusion and some clever tricks.
The shortcode allows you to automatically upscale small details within your image where Stable Diffusion tends to struggle. It is particularly good at fixing faces and hands in long-distance shots.
How does it work?
The [zoom_enhance]
shortcode searches your image for specified target(s), crops out the matching regions and processes them through [img2img]
. It then blends the result back into your original image. All of this happens behind-the-scenes without adding any unnecessary steps to your workflow. Just set it and forget it.
Features and Benefits
- Great in both
txt2img
andimg2img
modes. - The shortcode is powered by the
[txt2mask]
implementation of clipseg, which means you can search for literally anything as a replacement target, and you get access to the full suite of[txt2mask]
settings, such as "padding" and "negative_mask." - It's also pretty good at deepfakes. Set
mask="face"
andreplacement="another person's face"
and check out the results. - It applies a gaussian blur to the boundaries of the upscaled image which helps it blend seamlessly with the original.
- It is equipped with Dynamic Denoising Strength which is based on a simple idea: the smaller your replacement target, the worse it probably looks. Think about it: when you generate a character who's far away from the camera, their face is often a complete mess. So, the shortcode will use a high denoising strength for small objects and a low strength for larger ones.
- It is significantly faster than Hires Fix and won't mess up the rest of your image.
- Compatible with A111's color correction setting.
How to use it?
To use this feature, you need to have Unprompted installed on your WebUI. If you don't have it yet, you can get it from here.
Once you have Unprompted, simply add this line anywhere in your prompt:
I have investigated a new technology DDNM (https://github.com/wyhuai/DDNM) that is very powerful in super-resolution. And it is also compatible with MultiDiffusion. Through initial test I found it is amazing. I believe this can beat their new feature in a compelling way.
The automatic mask technology seems not very compatible with multi-diffusion txt2img but I will try in img2img
I have investigated a new technology DDNM (https://github.com/wyhuai/DDNM) that is very powerful in super-resolution. And it is also compatible with MultiDiffusion. Through initial test I found it is amazing. I believe this can beat their new feature in a compelling way.
The automatic mask technology seems not very compatible with multi-diffusion txt2img but I will try in img2img
Really impressive. Do you know about a user-friendly UI for the DDNM?
multi-diffusion is a great idea btw.
So while this is great for faces it completely destroys my fingers in anything but 0.1 denoise. Still it's a great improvement from ultimate upscaler in speed
New Feature: "ZOOM ENHANCE" for the A111 WebUI. Automatically fix small details like faces and hands!
Hello, fellow Stable Diffusion users! I'm excited to share with you a new feature that I've added to the Unprompted extension: it's the
[zoom_enhance]
shortcode.If you're not familiar with Unprompted, it's a powerful extension that lets you use various shortcodes in your prompts to enhance your text generation experience. You can learn more about it here.
The
[zoom_enhance]
shortcode is inspired by the fictional technology from CSI, where they can magically zoom in on any pixelated image and reveal crisp details. Of course, this is not possible in real life, but we can get pretty close with Stable Diffusion and some clever tricks.The shortcode allows you to automatically upscale small details within your image where Stable Diffusion tends to struggle. It is particularly good at fixing faces and hands in long-distance shots.
How does it work?
The
[zoom_enhance]
shortcode searches your image for specified target(s), crops out the matching regions and processes them through[img2img]
. It then blends the result back into your original image. All of this happens behind-the-scenes without adding any unnecessary steps to your workflow. Just set it and forget it.Features and Benefits
- Great in both
txt2img
andimg2img
modes.- The shortcode is powered by the
[txt2mask]
implementation of clipseg, which means you can search for literally anything as a replacement target, and you get access to the full suite of[txt2mask]
settings, such as "padding" and "negative_mask."- It's also pretty good at deepfakes. Set
mask="face"
andreplacement="another person's face"
and check out the results.- It applies a gaussian blur to the boundaries of the upscaled image which helps it blend seamlessly with the original.
- It is equipped with Dynamic Denoising Strength which is based on a simple idea: the smaller your replacement target, the worse it probably looks. Think about it: when you generate a character who's far away from the camera, their face is often a complete mess. So, the shortcode will use a high denoising strength for small objects and a low strength for larger ones.
- It is significantly faster than Hires Fix and won't mess up the rest of your image.
- Compatible with A111's color correction setting.
How to use it?
To use this feature, you need to have Unprompted installed on your WebUI. If you don't have it yet, you can get it from here.
Once you have Unprompted, simply add this line anywhere in your prompt:
what's next?what's prompt?
You say this is supposed to be faster than UltimateSD Upscale, but it feels about the same if not worse, to me? I don't know what's set wrong, because a 3090 should at least be comparable to a v100... but I can't even use the same settings (if I set the tiling to 128x128, I get an out of memory error--which is even weirder, I have more VRAM), and a 20-step batch to double an image size takes 10 minutes. The performance seems outright bad.
Then there's the issue where if you try to upscale anyone with a tan it will remove it in the process, but that's just issues with not being able to use specific prompts.
+1 to @RainehDaze comment on performance - I have an RTX3090 as well and I can't get anywhere close to the supposed ~1 minute to upscale that the V100 got. Mine also takes close to 10 minutes for upscaling to 2160p/4K @ 20 steps.
10 minutes is absolutely crazy. There must be incorrect params. What's your params?
I give you detailed screenshots of mine.
For example, when you set overlap = 48 and tile size = 96, tile batch size=8, it will consume 2m38s in total.
Here is my parameters
And the input image is here:
The test for overlap=32 and tile size=96 is here, this time it only takes 1 min 49s. If you reduce the overlap more it will become even faster:
Is it something about the tiling VAE? It set itself by default to size 3092 and doesn't kick in at all.
Everything else seems pretty similar. hell, it was taking 10 minutes to double a 768x768 image.
Is it something about the tiling VAE? It set itself by default to size 3092 and doesn't kick in at all.
Everything else seems pretty similar. hell, it was taking 10 minutes to double a 768x768 image.
Please refer to my params. The problem is in MultiDiffusion.
The key factor is overlap and tile batch size. Larger overlap will significantly slow your speed, and larger tile batch size (this will take your VRAM) will significant increase your speed.
Normally, tile size should be set to 96 in most cases.
Update: the regional prompt control is about to complete. At that time the multidiffusion should be expected to perform better with your custom per-region prompt!
Is it something about the tiling VAE? It set itself by default to size 3092 and doesn't kick in at all.
Everything else seems pretty similar. hell, it was taking 10 minutes to double a 768x768 image.Please refer to my params. The problem is in MultiDiffusion.
The key factor is overlap and tile batch size. Larger overlap will significantly slow your speed, and larger tile batch size (this will take your VRAM) will significant increase your speed. Normally, tile size should be set to 96 in most cases.
Update: the regional prompt control is about to complete. At that time the multidiffusion should be expected to perform better with your custom per-region prompt!
And this is with tiling VAE on, yes?
It's still taking about 8 minutes, even with all settings the same (Except mask blur, because I'm not on inpainting and I'm not sure where that's even coming into this)
Is it something about the tiling VAE? It set itself by default to size 3092 and doesn't kick in at all.
Everything else seems pretty similar. hell, it was taking 10 minutes to double a 768x768 image.Please refer to my params. The problem is in MultiDiffusion.
The key factor is overlap and tile batch size. Larger overlap will significantly slow your speed, and larger tile batch size (this will take your VRAM) will significant increase your speed. Normally, tile size should be set to 96 in most cases.
Update: the regional prompt control is about to complete. At that time the multidiffusion should be expected to perform better with your custom per-region prompt!And this is with tiling VAE on, yes?
It's still taking about 8 minutes, even with all settings the same (Except mask blur, because I'm not on inpainting and I'm not sure where that's even coming into this)
Really weird. May I have a screenshot of your params? Mine is like this (denoise=0.3)
Okay, comparing it and trying again, the decoder tile size was lower, the latent tile batch size was at 1, and the overlap (following the parameters in the png info above) was 48. Matching settings exactly, it goes down to about 3:30 total. Which is definitely an improvement.
Which still seems a little odd (shouldn't a 3090 be outperforming a Tesla v100?); I wonder if it's having half VAE disabled to stop overflow errors?
I'm happy to see your great improvements (10min->3m30s)!
Theoretically you should be faster than me. But there are too many params and variables in the whole SD process, like half VAE, samplers, upscalers, denoise strength, other extensions, upscalers, Lora, ControlNet, so I'm also not sure what's the root cause.
If you have some findings, please also tell me and I will be happy to find room for optimization.
I guess if the other guy with a 3090 chips in, we might have a point for comparison if the numbers wind up massively different.
Although, can you go to a higher latent tile size? Because that definitely suggests something might be off on my end, a 3090 has more VRAM.
However larger tile size won't increase your image quality; on the contrary it may lower your quality. As most models is not trained on image larger than 1024*1024, 128 (just 1024/8) would be the largest size I would recommend.
You can try, I believe larger size can give you faster speed at the cost of image quality. Also, the tile batch size 8 is the best choice where UNet achieves its highest performance. > 8 won't be faster.
See, I'm not questioning whether 128x128 would lead to better quality, I'm just asking if you can run it. If I try running on greater than 96x96, I get an OoM error, so I'm wondering if that's an environment thing.
Oh I'm on 32 GB NVIDIA V100. The UNet itself will consume more than 24 GB when you use 128 latent tile size together with 8 tile batch size, therefore OOM is natural. The details are as follows:
Basically with MultiDiffusion, you are drawing tile_batch_size images of size 8 x tile size simultaneously. For example, if you set tile size =128 and batch size = 8, then in each forward pass, the VRAM usage of UNet will be identical to drawing 8 1024*1024 images without MultiDiffusion.
Hence, if you get OOM errors in normal generation, setting batch size = 8 and image height & width = 1024, you should get errors with MultiDiffusion too.
Ah, that explains it, I thought it would be a 16GB Tesla.
For anyone on a 3090 just up the batch size to 8 and leave everything else default. Mine can make a 2000x2000 image in 30 seconds
secret - I dig topic of upscale by a month so answer is not that simple as "click here that and there" (sampler, model - those things have gigantic impact on results and then you tinker with cfg and denoise )
3090/4090 best works at batch size 9 (learned that from dreambooth trainings), but indeed 8 divide nicely โ and in my case 16, because I can and its 5% faster. Personally, i goes with following settings.
secret - I dig topic of upscale by a month so answer is not that simple as "click here that and there" (sampler, model - those things have gigantic impact on results and then you tinker with cfg and denoise )
3090/4090 best works at batch size 9 (learned that from dreambooth trainings), but indeed 8 divide nicely โ and in my case 16, because I can and its 5% faster. Personally, i goes with following settings.
thank you!
Maybe this is common knowledge but I don't see it mentioned here. Automatic1111 allows you to change the initial noise value. At 0.5 it will barely change the initial image which allows for higher denoise values without substantially changing the image. At 1.5 it will introduce a lot of noise even at low denoise values and requires a high CFG but will add more texture and detail to the resulting image.
10 minutes is absolutely crazy. There must be incorrect params. What's your params?
For example, when you set overlap = 48 and tile size = 96, tile batch size=8, it will consume 2m38s in total.
Getting back to test this more on my end, but from that screenshot alone, I'm already seeing a difference in what my console outputs during the upscaling process - I get multiple progress bars when performing any img2img, maybe there are some settings affecting how Multidiffusion behaves?
Additionally, there's a huge difference too if denoising steps are set to be the same as sampling, which I had on in my previous tests.
Currently, with the latest commit (117bbf1a
) I can upscale 512x512 @ 4x with these settings in about 1m30s now:
10 minutes is absolutely crazy. There must be incorrect params. What's your params?
For example, when you set overlap = 48 and tile size = 96, tile batch size=8, it will consume 2m38s in total.Getting back to test this more on my end, but from that screenshot alone, I'm already seeing a difference in what my console outputs during the upscaling process - I get multiple progress bars when performing any img2img, maybe there are some settings affecting how Multidiffusion behaves?
Additionally, there's a huge difference too if denoising steps are set to be the same as sampling, which I had on in my previous tests.
Currently, with the latest commit (
117bbf1a
) I can upscale 512x512 @ 4x with these settings in about 1m30s now:
of course changing the denoising steps to not be the same as sampling makes a huge difference, you're running less steps total.
of course changing the denoising steps to not be the same as sampling makes a huge difference, you're running less steps total.
Are you saying that I should be getting those posted times even with this enabled? (I believe it's off by default) That's what I wanted to confirm if it should be on or not for the times I get.
of course changing the denoising steps to not be the same as sampling makes a huge difference, you're running less steps total.
Are you saying that I should be getting those posted times even with this enabled? (I believe it's off by default) That's what I wanted to confirm if it should be on or not for the times I get.
Pretty sure, yeah.
It's not a good comparison if you toggle that off, because then you're running a different number of steps with and without multidiffusion.
Your biggest limitation seems to be that you're only running a batch size of 1; setting that to 8 should speed things up drastically (as it'll do 8 small tiles at once rather than sequentially).
Your biggest limitation seems to be that you're only running a batch size of 1; setting that to 8 should speed things up drastically (as it'll do 8 small tiles at once rather than sequentially).
Thanks for catching that, 1m30s is for batch size of 8, I probably screencapped before adjusting the settings.
@pkuliyi2015 This looks to have resolved my performance issues, the denoising steps in img2img were resulting in higher than normal times, once I compensate for that setting (e.g. multiplying the 20 steps by 0.3 to get ~6-7 steps with the setting enabled), yields ~1 minute to render a 4K output.
@pkuliyi2015 do you see any use for the new sd2.1 unclip model? if somehow implemented there might be no need for regional prompts..
@pkuliyi2015 do you see any use for the new sd2.1 unclip model? if somehow implemented there might be no need for regional prompts..
I knew that and tried their demo when they just published. But in fact, it is far from my satisfaction.
By the way, I have completed the major update on regional prompt control. Currently, it may be the most competitive extension for training-free compositional large image drawing.
secret - I dig topic of upscale by a month so answer is not that simple as "click here that and there" (sampler, model - those things have gigantic impact on results and then you tinker with cfg and denoise )
3090/4090 best works at batch size 9 (learned that from dreambooth trainings), but indeed 8 divide nicely โ and in my case 16, because I can and its 5% faster. Personally, i goes with following settings.
Does Tiled VAE need to be enabled when using multidiffusion? That was not clear to me.
Does Tiled VAE need to be enabled when using multidiffusion? That was not clear to me.
Tiled VAE is not a must, it just saves your VRAM when you want to get larger images. It won't affect your outcomes and is compatible with almost everything (like Highres).
Hello everyone,
I'm fairly new to this whole topic and I was trying to replicate some of your settings in A1111.
I'd like to achieve the detailed upscale looks with ultrasharp but the images just get regularly upscaled like esrgan, no details at all and sometimes even worse,
I only have 4GB Vram, so I can max. upscale to 1024x1024 otherwise the system tells me that I don't have enough memory.
I don't know if there's anything in the settings that I must change.
I have the regular UI, no Multidiffusion or anything like that.
I'd want to just upscale an image that I made with Midjourney, to bring back some details, especially in the face area.
But as I said, img2img or extras didn't do anything for me but just upscale it without any details,
I made a few comparison with Ultimate upscaler (default settings, CFG 10, DDIM, denoise 0.23 ) and mixture of diffuser.
The original image
vs Denoise 0.23, DDIM:
https://imgsli.com/MTY2Njkw
vs Denoise 0.35 DDIM
https://imgsli.com/MTY2Njkx
vs Denoise 0.35 Euler A - cfg 14
https://imgsli.com/MTY2Njky
MD is good at adding extra details, without overcook image, you can go with high denoise and cfg but as far upscaling go, Ultimate SD upscaler still has less pixelated texture when you zoom in, especially the hand and face.
Parameter for MD:
Tiled Diffusion upscaler: 4x-UltraSharp, Tiled Diffusion scale factor: 2, Tiled Diffusion: "{'Method': 'Mixture of Diffusers', 'Latent tile width': 64, 'Latent tile height': 64, 'Overlap': 48, 'Tile batch size': 4, 'Upscaler': '4x-UltraSharp', 'Scale factor': 2, 'Keep input size': True}"
Got way bad result with recommended settings. Maybe I'm doing wrong
I inpainted a bit before upscaling, here is the actual original image if anyone want to try out:
https://files.catbox.moe/wek7ed.png
Hm, I've been using the region prompt, and I've noticed that if anything, it seems even worse about concatenating random people on the boundaries--even if there's nothing in the main prompt about people.
This sort of thing is nearly constant:
In many regards, it's performing even worse than just straight generating a 1024x1024 image. 20 images in a batch, and only one didn't have extra people (instead, it duplicated the entire horizon):
Hm, I've been using the region prompt, and I've noticed that if anything, it seems even worse about concatenating random people on the boundaries--even if there's nothing in the main prompt about people.
This sort of thing is nearly constant:
In many regards, it's performing even worse than just straight generating a 1024x1024 image. 20 images in a batch, and only one didn't have extra people (instead, it duplicated the entire horizon):
Because most models are 512x512 this is more likely to occur the larger the images you try to create. I would first check that your batch size in multidiffusion is 1 with . Increasing Tile size and/or overlap may decrease the likelihood of this occurring. Try 80x80 with an overlap of 16 and latent tile batch size of 8, or 96x96 with an overlap of 32, or even 128x128 with an overlap of 64 and batch size of 4. If that does not fix it than create the images at 768x768 or even 512x512 then upscale them.
The greater the overlap the more context each tile has from its surrounding tiles.
Because most models are 512x512 this is more likely to occur the larger the images you try to create. I would first check that your batch size in multidiffusion is 1 with . Increasing Tile size and/or overlap may decrease the likelihood of this occurring. Try 80x80 with an overlap of 16 and latent tile batch size of 8, or 96x96 with an overlap of 32, or even 128x128 with an overlap of 64 and batch size of 4. If that does not fix it than create the images at 768x768 or even 512x512 then upscale them. The greater the overlap the more context each tile has from its surrounding tiles.
The point is, the region control is supposed to be used to help avoid such things while composing larger images. This is literally the point of it, and what the demonstration pictures were showing.
Because most models are 512x512 this is more likely to occur the larger the images you try to create. I would first check that your batch size in multidiffusion is 1 with . Increasing Tile size and/or overlap may decrease the likelihood of this occurring. Try 80x80 with an overlap of 16 and latent tile batch size of 8, or 96x96 with an overlap of 32, or even 128x128 with an overlap of 64 and batch size of 4. If that does not fix it than create the images at 768x768 or even 512x512 then upscale them. The greater the overlap the more context each tile has from its surrounding tiles.
The point is, the region control is supposed to be used to help avoid such things while composing larger images. This is literally the point of it, and what the demonstration pictures were showing.
Have you tried what I suggested what I suggested?
Because most models are 512x512 this is more likely to occur the larger the images you try to create. I would first check that your batch size in multidiffusion is 1 with . Increasing Tile size and/or overlap may decrease the likelihood of this occurring. Try 80x80 with an overlap of 16 and latent tile batch size of 8, or 96x96 with an overlap of 32, or even 128x128 with an overlap of 64 and batch size of 4. If that does not fix it than create the images at 768x768 or even 512x512 then upscale them. The greater the overlap the more context each tile has from its surrounding tiles.
The point is, the region control is supposed to be used to help avoid such things while composing larger images. This is literally the point of it, and what the demonstration pictures were showing.
Have you tried what I suggested what I suggested?
You realise your suggestions are totally irrelevant, right?
Like, the point of region prompting is that you can have a larger image (with a background prompt using the usual MD merging), and then a specific foreground region (or regions) that are meant to contain specific things. It's even spelled out on the main page, including that tile size doesn't really matter for this one.
Creating an image at 512 or 768 and upscaling also completely defeats the point, which is that your standard SD-sized generation would only be a component of an image with a different aspect ratio, and not full of body part concatenation.
(I think it might actually be that some quality-related things tend to act as catalysts for drawing people; I'm not sure and I'm going to keep poking away)
image at 512 or 768 and upscaling also completely defeats the point, which is that your standard SD-sized generation would only be a component of an image with a different aspect ratio, and not full of body pa
Image height is 1024. 8 tiles are in a batch so a height of 128 prevents the problem you are having. You can see my settings in the screenshot.
image at 512 or 768 and upscaling also completely defeats the point, which is that your standard SD-sized generation would only be a component of an image with a different aspect ratio, and not full of body pa
Image height is 1024. 8 tiles are in a batch so a height of 128 prevents the problem you are having. You can see my settings in the screenshot.
It doesn't, actually, because 128x128 tiles were what I was using when I was testing, and the concatenation kept happening. I'm pretty sure, after some more tests, that random tokens were actually prompting for people (for some unspeakable reason). Getting to this sort of thing consistently was a matter of changing the prompt settings, not messing with the tiles:
image at 512 or 768 and upscaling also completely defeats the point, which is that your standard SD-sized generation would only be a component of an image with a different aspect ratio, and not full of body pa
Image height is 1024. 8 tiles are in a batch so a height of 128 prevents the problem you are having. You can see my settings in the screenshot.It doesn't, actually, because 128x128 tiles were what I was using when I was testing, and the concatenation kept happening. I'm pretty sure, after some more tests, that random tokens were actually prompting for people (for some unspeakable reason). Getting to this sort of thing consistently was a matter of changing the prompt settings, not messing with the tiles:
If you check powershell or command line it shows that at 128x128 Multi Diffusion does not take effect because the image is too small. You have to use a tile width smaller than 128.
Looking at the command line, multidiffusion was doing its thing. Probably because of region control. Which, again, to reference the MAIN PAGE FOR THIS REPO, says (with regard to region prompt)
The tile size parameters become useless; just ignore them
seriously, do you think the person maintaining this knows less about how it works than you do?
Looking at the command line, multidiffusion was doing its thing. Probably because of region control. Which, again, to reference the MAIN PAGE FOR THIS REPO, says (with regard to region prompt)
The tile size parameters become useless; just ignore them
seriously, do you think the person maintaining this knows less about how it works than you do?
He is probably referring to it in the context of img2img, not txt2img.
And yes it possible to know more about how to use a tool than the person that made it. Musicians are better at their instruments than the people that made them.
That's like saying a guitar player knows more about how an amplifier works.
That's like saying a guitar player knows more about how an amplifier works.
No. It's like the thing I said. You can't just come up with a different analogy to discredit my first one.
That's like saying a guitar player knows more about how an amplifier works.
In fact that is the exact opposite of my original analogy which is that the artist can utilize the tool better than the creator. It does not imply that the artist has the ability to design or create the tool.
Your analogy was flawed, because I said how it works. The creator of something is more likely to know whether a certain setting actually does anything for a given setting than a user, even if the user is extremely good at it.
Anyway, I did more testing. It was the prompt causing humans to be generated where they really shouldn't be (like the entire half of an image that was only supposed to be scenery) and concatenating things when adjacent. Seriously, it was doing things like this:
or this:
When there was supposed to be only scenery to either side (and obviously nothing was describing those particular people). As I noted, it seems that a lot of tags that describe image quality are actually tied really strongly to generating people.
Thank you for making attempts on this. This is a classical noise pollution problem where the foreground noises triggered the undesirable multi-character change in the background, when your model is not that good for high resolution image generation.
This can be partly mitigate by adding some negative prompts in the background regions. However, this may not solve the problem totally. I am considering a much more powerful merging strategy and corresponding ui that lets you fuses images better.
you will definitely like it.
Thank you for making attempts on this. This is a classical noise pollution problem where the foreground noises triggered the undesirable multi-character change in the background, when your model is not that good for high resolution image generation.
This can be partly mitigate by adding some negative prompts in the background regions. However, this may not solve the problem totally. I am considering a much more powerful merging strategy and corresponding ui that lets you fuses images better.
you will definitely like it.
It wasn't too bad once there were no triggering tags in the general prompt (only 5 or 6 out of 100), and I got this out of it all with region control and the noise inversion:
But anything that would make for better image composition is great (only about 9 of the 100 had reasonable background coherency).
Hello, I am trying to use Multidifusion to place kemono characters in the background, but the checkpoint I am using requires Hires fix and hypernet to be enabled by default, otherwise it will generate humans.
The overall prompt words only describe the camera and background, as well as enabling hypernet. Enter character prompts for the foreground, and no prompts for the background. The first few steps of the denoising process can generate kemono normally, but in the end, the Hires fix transforms the character into a human. I tried to reduce the denoising value of the Hires fix, but it will result in fewer and more blurry image details. Increasing the denoising will make the character more like a human.
I don't know if this situation is due to the incompatibility between Hires fix and Multidifusion or if hypernet did not start properly.
Hello, I am trying to use Multidifusion to place kemono characters in the background, but the checkpoint I am using requires Hires fix and hypernet to be enabled by default, otherwise it will generate humans.
The overall prompt words only describe the camera and background, as well as enabling hypernet. Enter character prompts for the foreground, and no prompts for the background. The first few steps of the denoising process can generate kemono normally, but in the end, the Hires fix transforms the character into a human. I tried to reduce the denoising value of the Hires fix, but it will result in fewer and more blurry image details. Increasing the denoising will make the character more like a human.
I don't know if this situation is due to the incompatibility between Hires fix and Multidifusion or if hypernet did not start properly.
I make a trial fix. Please switch to the dev branch and have a test. If it works please tell me on time.
Hello, I am trying to use Multidifusion to place kemono characters in the background, but the checkpoint I am using requires Hires fix and hypernet to be enabled by default, otherwise it will generate humans.
The overall prompt words only describe the camera and background, as well as enabling hypernet. Enter character prompts for the foreground, and no prompts for the background. The first few steps of the denoising process can generate kemono normally, but in the end, the Hires fix transforms the character into a human. I tried to reduce the denoising value of the Hires fix, but it will result in fewer and more blurry image details. Increasing the denoising will make the character more like a human.
I don't know if this situation is due to the incompatibility between Hires fix and Multidifusion or if hypernet did not start properly.
I make a trial fix. Please switch to the dev branch and have a test. If it works please tell me on time.
IT doesnt work well.The first image uses Multidifusion with Hires fix Denoising=0.7, while the second image does not use Multidifusion.
You can see that using Multidifusion generates completely different characters, and the third image is a screenshot of the denoising process.
I tried to turn off Hires fix when using Multidifusion in t2i and move the generated blurry image to i2i, but the background details did not increase. To be honest, it was only changed to high-definition, while Hires fix can add things that were not in the original image.
I also tried the other three models proposed by the checkpoint author, neither of which requires Hypernet to be enabled. However, two of these models also encountered a problem with character image changes when opening both Hires fix and Multidifusion, while the other model was able to generate Kemono characters normally.
If you are interested, the model address is below.
https://civitai.com/models/11888?modelVersionId=32830
It has been verified that the model that can use Multidifusion normally is crossfemono2.0, while the models that cannot be used normally are G, G2, F, and D
"RuntimeError: Invalid buffer size: 6.89 GB" How to solve it?
Display 'min and input tensors must be of the same shape' with tiled vae
Can the author show how to generate a realistic style of Qingming River painting through interface manipulation? This plugin will make it easier for me to understand tiling diffuser, area tips and drawing full canvas backgrounds. Thank you very much.
To those of you asking questions on a closed discussion, you need to take some lessons from an old master at the art of asking questions online.
Is there a setting that works with Intel 16-inch high-end model with 16g of RAM and AMD Radeon Pro 5500M with 6g of vram?
And is there a distinction between Python and PyTorch versions that work? Currently, the desired image size cannot be created in Python 3.10.12 and PyTorch Nightly 2.1.0. If R-ESRGAN 4x+ scale exceeds 1.7 in 512 size, cmd will exit with an mps shortage error.
I followed the settings as described in the description, but it fails.