img2img pipeline
realimposter opened this issue · 3 comments
any chance you can add an example of using control net with img2img to the Collab doc (without inpainting)?
I followed the instructions and tried adding the StableDiffusionControlNetInpaintImg2ImgPipeline class without any luck :
from diffusers.utils import load_image
from diffusers import StableDiffusionInpaintPipeline, StableDiffusionControlNetInpaintImg2ImgPipeline, ControlNetModel
# we have downloaded models locally, you can also load from huggingface
# control_sd15_seg is converted from control_sd15_seg.safetensors using instructions above
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-depth")
pipe_control = StableDiffusionControlNetInpaintImg2ImgPipeline.from_pretrained("C:/Users/User/Desktop/cnet_img2img/diffusers/control_openjourney-v2_depth",controlnet=controlnet,torch_dtype=torch.float16).to('cuda')
pipe_inpaint = StableDiffusionInpaintPipeline.from_pretrained("prompthero/openjourney-v2",torch_dtype=torch.float16).to('cuda')
# yes, we can directly replace the UNet
pipe_control.unet = pipe_inpaint.unet
pipe_control.unet.in_channels = 4
# we also the same example as stable-diffusion-inpainting
image = load_image("https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png")
mask = load_image("https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png")
# the segmentation result is generated from https://huggingface.co/spaces/hysts/ControlNet
control_image = load_image("https://raw.githubusercontent.com/haofanwang/ControlNet-for-Diffusers/main/images/desk_depth.png")
image = pipe_control(prompt="Face of a yellow cat, high resolution, sitting on a park bench",
negative_prompt="lowres, bad anatomy, worst quality, low quality",
controlnet_hint=control_image,
image=image,
mask_image=mask,
width=448,
height=640,
num_inference_steps=100).images[0]
image.save("inpaint_seg.jpg")
# complete this scentence```
gives me the error:
```Incorrect configuration settings! The config of `pipeline.unet`: FrozenDict([('sample_size', 64), ('in_channels', 4), ('out_channels', 4), ('center_input_sample', False), ('flip_sin_to_cos', True), ('freq_shift', 0), ('down_block_types', ['CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D']), ('mid_block_type', 'UNetMidBlock2DCrossAttn'), ('up_block_types', ['UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D']), ('only_cross_attention', False), ('block_out_channels', [320, 640, 1280, 1280]), ('layers_per_block', 2), ('downsample_padding', 1), ('mid_block_scale_factor', 1), ('act_fn', 'silu'), ('norm_num_groups', 32), ('norm_eps', 1e-05), ('cross_attention_dim', 768), ('attention_head_dim', 8), ('dual_cross_attention', False), ('use_linear_projection', False), ('class_embed_type', None), ('num_class_embeds', None), ('upcast_attention', False), ('resnet_time_scale_shift', 'default'), ('time_embedding_type', 'positional'), ('timestep_post_act', None), ('time_cond_proj_dim', None), ('conv_in_kernel', 3), ('conv_out_kernel', 3), ('projection_class_embeddings_input_dim', None), ('_class_name', 'UNet2DConditionModel'), ('_diffusers_version', '0.11.0.dev0'), ('_name_or_path', 'C:\\Users\\User\\.cache\\huggingface\\hub\\models--prompthero--openjourney-v2\\snapshots\\32e0aa8629c1d5ed82ff19f1543017bceb5f84d6\\unet')]) expects 4 but received `num_channels_latents`: 4 + `num_channels_mask`: 1 + `num_channels_masked_image`: 4 = 9. Please verify the config of `pipeline.unet` or your `mask_image` or `image` input.```
It should be easy to implement if you are familiar with diffusers pipelines. I just don't want to make this project too redundant. But I can show you some guidance soon in my free time!
Thanks so much! I'm new to diffusers and been struggling a bit to figure it out
@haofanwang i'd also really like to know how to do this too... i don't really get why i can't seem to prepare my own latents for the pipeline's latents= option by using the vae to encode an init image the way the img2img or inpaint pipelines do. i keep getting mismatched tensor sizes when the pipeline tries to add the controlnet output to the sample in the timestep loop