Enhancement Suggestions: Mask for Stable Diffusion, Dynamic Resizing Based on VRAM, and Clarification on Diffuser Integration

Hello,

Today I found out about your project and have been using it a bit using example config files and have some feedback and suggestions for possible improvements.

Mask for Stable Diffusion:

I'm curious if there's a way to use masks for achieving more stable diffusion on inpating target object. It would be a beneficial feature to have, especially for more complex scenarios.

Dynamic Resizing Based on VRAM:

The README mentions a requirement of 24GB VRAM, which might not be feasible for all users.

Suggestion: Offer a way to control the resolution/size dynamically based on the user's available GPU VRAM. This way, users with lower VRAM can still utilize the project without running into memory issues.

As a reference, I've implemented a example function in the project to dynamically resize images based on available VRAM. Here's a snippet for reference:

# get max limit target size
gpu_vram = torch.cuda.get_device_properties(device).total_memory / (1024 ** 3)
# table of gpu memory
gpu_table = {19: 1280, 7: 640, 6: 512, 2: 320}
# get user resize for gpu
target_size = max(val for key, val in gpu_table.items() if key <= gpu_vram)

and after I use resize in video_util.py:

def resize_image(input_image, target_size, use_limit=True):
    H, W, C = input_image.shape

    # Calculate aspect ratio
    aspect_ratio = W / H

    if use_limit:
        if H > W:
            coeff = math.ceil(H / 64)
        else:
            coeff = math.ceil(W / 64)

        if H < target_size and W < target_size:
            target_size = 64 * coeff

        if H < W and target_size > W:
            target_size = W
        elif W > H and target_size > H:
            target_size = H

    if H < W:
        W_new = target_size
        H_new = int(target_size / aspect_ratio)
    else:
        H_new = target_size
        W_new = int(aspect_ratio * target_size)

    # Ensure dimensions are divisible by 8 but not more than target_size
    H_new = math.ceil(H_new / 64) * 64
    W_new = math.ceil(W_new / 64) * 64

    img = cv2.resize(input_image, (W_new, H_new), interpolation=cv2.INTER_LINEAR)
    return img

def prepare_frames(input_path: str, output_dir: str, target_size: int, crop):
    def crop_func(frame):
        return resize_image(frame, target_size)

    video_to_frame(input_path, output_dir, '%04d.png', False, crop_func)

Note: I've tested this on an RTX 3090 with 8GB VRAM with torch==2.0.1 cuda11.8 and xformers==0.0.21, and it seems to work as intended.

Clarification on Diffuser Integration:

The project mentions integration with diffuser. Does this mean that ebsynth will be built into the virtual environment library? And if it will be inside diffuser, you will use from diffusers import ControlNetModel, StableDiffusionControlNetInpaintPipeline, UniPCMultistepScheduler without git clone ControlNet repo? Some clarity on this would be appreciated.

Thanks for the hard work on this project, and I look forward to future updates!

Thank you for your suggestion

Mask for Stable Diffusion

This function is easy to implement based on blended latent diffusion.
Given a mask M and an encoded original video x0_ori
You just need to add one line

img = img* M + (1. - M) * self.model.q_sample(x0_ori, ts)

after

Rerender_A_Video/src/ddim_v_hacked.py

Lines 322 to 323 in fcb7431

    
           img = img_ref * weight + (1. - weight) * ( 
        
               img - dir_xt) + rescale * dir_xt

You can apply blending on specific steps or all steps to find a better results.

The difficulty lies in how to obtain the masks in a video.
It requires tracking the target object, which is out of scope of this project.

Dynamic Resizing Based on VRAM

Thank you for your suggestions! Maybe you can submit a pull request?

Clarification on Diffuser Integration

ebsynth will not be integrated in diffusers.
We just integrate the diffusion part (key frame translation) into diffusers,
and will propose a new pipeline, like StableDiffusionRerenderPipeline combining inpainting and controlnet.

Mask for Stable Diffusion

This function is easy to implement based on blended latent diffusion. Given a mask M and an encoded original video x0_ori You just need to add one line
img = img* M + (1. - M) * self.model.q_sample(x0_ori, ts)
after

Rerender_A_Video/src/ddim_v_hacked.py

Lines 322 to 323 in fcb7431

img = img_ref * weight + (1. - weight) * (

img - dir_xt) + rescale * dir_xt

You can apply blending on specific steps or all steps to find a better results.
The difficulty lies in how to obtain the masks in a video. It requires tracking the target object, which is out of scope of this project.

I went through your recommendations for implementing inpaiting using a mask, and I have a couple of questions:

From my observation in the code, it seems that x0_ori is equivalent to x0. Should there be a distinction between the two or are they intended to be the same?
Could you specify the expected format of the mask M? I currently have two mask image versions: one where the mask values are either True or False, and another where it's a tensor. Which version is more suitable for the code? And how must look mask in file example?

Example 1 (transparent background)	Example 2 (black background)

Given that I have a mask image at "Downloads/test1.png", how can I appropriately load and preprocess it into the required format for the M mask? Because if use img = img* M + (1. - M) * self.model.q_sample(x0_ori, ts), I understand what need to M will be in format as img to use *, img is tesnor with 4 values, for example (1, 4, 72, 64) and based on this, it means M was somehow processed before to get correct format.

From my observation in the code, it seems that x0_ori is equivalent to x0. Should there be a distinction between the two or are they intended to be the same?

Yes, they are the same.

Could you specify the expected format of the mask M? I currently have two mask image versions: one where the mask values are either True or False, and another where it's a tensor. Which version is more suitable for the code? And how must look mask in file example?

mask should be a tensor with values of 1 and 0. Corresponding the your example 2

Given that I have a mask image at "Downloads/test1.png", how can I appropriately load and preprocess it into the required format for the M mask? Because if use img = img* M + (1. - M) * self.model.q_sample(x0_ori, ts), I understand what need to M will be in format as img to use *, img is tesnor with 4 values, for example (1, 4, 72, 64) and based on this, it means M was somehow processed before to get correct format.

the size of the mask should be [1, 1, 72, 64] if img is of [1, 4, 72, 64]

you need to modify the

Rerender_A_Video/src/ddim_v_hacked.py

Line 158 in fcb7431

def sample(self,

and

Rerender_A_Video/src/ddim_v_hacked.py

Line 239 in fcb7431

def ddim_sampling(self,

to add a new input parameter named M
and load M when you load video frames (which looks like frame = cv2.imread(imgs[i]) in the code)
and feed M into ddim_v_sampler.sample(...) in the code

Thank you very much. I have the same need. Can this functionality be added to the project, requiring only file input parameters. Regarding mask information, we can output it into a video or picture sequence through professional software

I have the same need. Can this functionality be added to the project, requiring only file input parameters.

I have created this pull request, if the project owner approves the new changes, then the project can be launched on devices with small VRAM.

I have the same need. Can this functionality be added to the project, requiring only file input parameters.

I have created this pull request, if the project owner approves the new changes, then the project can be launched on devices with small VRAM.

The pull request is under review by @SingleZombie. We are under other deadline pressure, so it will take time.

I think @lymanzhao 's need here is to add masking functionality . However, I'm busy with another project and have no time to add masking functionality recently.

I examined the results of applying a mask in the project and noticed that while the intended region within the mask was altered significantly, the area outside the mask also underwent slight changes. How can we address this to ensure only the masked region is affected?

Result	Mask	Original

I have this params in config:

  interval = 10
  control_strength = 0.7
  loose_cfattn = True
  freeu_args = (1, 1, 1, 1)
  use_limit_device_resolution = True
  control_type = "canny"
  canny_low = 50
  canny_high = 100
  scale = 8.5
  x0_strength = 0.96

  style_update_freq = 10
  mask_strength = 0.5
  color_preserve = True
  mask_period = (0.5, 0.8)
  inner_strength = 0.9
  cross_period = (0, 1)
  ada_period = (0.8, 1)
  warp_period = (0, 0.1)
  smooth_boundary = True

Code in Rerender_A_Video/src/ddim_v_hacked.py line 323:

if inpaint_mask is not None:
          img = img * inpaint_mask + (1. - inpaint_mask) * self.model.q_sample(x0, ts)

I read mask as 0 and 1:

inpaint_mask_path = "/home/user/Downloads/test1.png"
# Read the mask image
inpaint_mask_frame = cv2.imread(inpaint_mask_path, cv2.IMREAD_GRAYSCALE)
# Binarize the image
_, binary_mask = cv2.threshold(inpaint_mask_frame, 128, 1, cv2.THRESH_BINARY)
# Resize to desired shape
resized_mask = cv2.resize(binary_mask, (shape[2], shape[1]))
# Convert to tensor and adjust dimensions
inpaint_mask = torch.tensor(resized_mask, dtype=torch.float32).unsqueeze(0).unsqueeze(0).cuda()

currently i am running with 1280x720 video

it is using 24 GB of RTX 3090 ti + 12 GB shared VRAM

is this expected?

default settings

  mask_period = (0.5, 0.8)

if inpaint_mask is not None:
          img = img * inpaint_mask + (1. - inpaint_mask) * self.model.q_sample(x0, ts)

@wladradchenko Did you add your code inside the if block of

Rerender_A_Video/src/ddim_v_hacked.py

Line 309 in fcb7431

if mask is not None and xtrg is not None:

or the if block of

Rerender_A_Video/src/ddim_v_hacked.py

Line 315 in fcb7431

if weight is not None:

.
If so, the original x0 will be only applied to the mask_period = (0.5, 0.8) steps.
To ensure the consistency, you can set a some hyperparamter like inpainting_mask_period=(0, 1) that ensures your added code is applied untill the final step (the begin step 0 can be tuned)
That means in Line 324, outside the if block, you can add

if i >= inpainting_mask_period[0] * total_steps  and i <= inpainting_mask_period[1] * total_steps:
            img = img * inpaint_mask + (1. - inpaint_mask) * self.model.q_sample(x0, ts)

And in the original paper of blended latent diffusion, it further optimizes the decoder to fit the input image in the unmasked region, which is time-consuming for video processing.

@williamyang1991 thank u. Now work fine.

For suggested dynamically size by VRAM @FurkanGozukara if 24 GB VRAM when limit video resolution will be 1280x1280 fine (for that video it will be 1280x720) if set option use_limit_device_resolution = True and user choose resolution > than resolution limit by VRAM. If 12 GB VRAM, than limit video resolution will be 640x640 (for that video it will be 640x384). I don't know data about behavior to set dynamically VRAM more 24 GB.

if 24 GB VRAM when limit video resolution will be 1280x1280 fine

so 24 GB working for you for 1280x1280?

how is this possible it is using 24 + 12 shared on windows for me for 1280x720

what is your pip freeze?

how is this possible it is using 24 + 12 shared on windows for me for 1280x720

I use cuda 11.8, xformers and torch 2.0.0 for cuda 11.8. Also I have cuDNN. I don't use pip freeze because I think what it is bad practice, I set limit for libs if version need to freeze, and I don't include sub libs. And because I have experimented with difference approach my pip freeze will be not correct.

I would also like to note that we are not talking about RAM, but about VRAM. If you have multiple GPUs on a device, you can use the device to select your GPU, for example cuda:0. It means what u needed to less VRAM if some model will be in first GPU, and another model will be loaded in second GPU. I don't quite understand what it means: shared VRAM.

	img = img_ref * weight + (1. - weight) * (
	img - dir_xt) + rescale * dir_xt