Enhancement Suggestions: Mask for Stable Diffusion, Dynamic Resizing Based on VRAM, and Clarification on Diffuser Integration
wladradchenko opened this issue · 12 comments
Hello,
Today I found out about your project and have been using it a bit using example config files and have some feedback and suggestions for possible improvements.
Mask for Stable Diffusion:
I'm curious if there's a way to use masks for achieving more stable diffusion on inpating target object. It would be a beneficial feature to have, especially for more complex scenarios.
Dynamic Resizing Based on VRAM:
The README mentions a requirement of 24GB VRAM, which might not be feasible for all users.
Suggestion: Offer a way to control the resolution/size dynamically based on the user's available GPU VRAM. This way, users with lower VRAM can still utilize the project without running into memory issues.
As a reference, I've implemented a example function in the project to dynamically resize images based on available VRAM. Here's a snippet for reference:
# get max limit target size
gpu_vram = torch.cuda.get_device_properties(device).total_memory / (1024 ** 3)
# table of gpu memory
gpu_table = {19: 1280, 7: 640, 6: 512, 2: 320}
# get user resize for gpu
target_size = max(val for key, val in gpu_table.items() if key <= gpu_vram)
and after I use resize in video_util.py
:
def resize_image(input_image, target_size, use_limit=True):
H, W, C = input_image.shape
# Calculate aspect ratio
aspect_ratio = W / H
if use_limit:
if H > W:
coeff = math.ceil(H / 64)
else:
coeff = math.ceil(W / 64)
if H < target_size and W < target_size:
target_size = 64 * coeff
if H < W and target_size > W:
target_size = W
elif W > H and target_size > H:
target_size = H
if H < W:
W_new = target_size
H_new = int(target_size / aspect_ratio)
else:
H_new = target_size
W_new = int(aspect_ratio * target_size)
# Ensure dimensions are divisible by 8 but not more than target_size
H_new = math.ceil(H_new / 64) * 64
W_new = math.ceil(W_new / 64) * 64
img = cv2.resize(input_image, (W_new, H_new), interpolation=cv2.INTER_LINEAR)
return img
def prepare_frames(input_path: str, output_dir: str, target_size: int, crop):
def crop_func(frame):
return resize_image(frame, target_size)
video_to_frame(input_path, output_dir, '%04d.png', False, crop_func)
Note: I've tested this on an RTX 3090 with 8GB VRAM with torch==2.0.1 cuda11.8 and xformers==0.0.21, and it seems to work as intended.
Clarification on Diffuser Integration:
The project mentions integration with diffuser. Does this mean that ebsynth will be built into the virtual environment library? And if it will be inside diffuser, you will use from diffusers import ControlNetModel, StableDiffusionControlNetInpaintPipeline, UniPCMultistepScheduler
without git clone ControlNet repo? Some clarity on this would be appreciated.
Thanks for the hard work on this project, and I look forward to future updates!
Thank you for your suggestion
Mask for Stable Diffusion
This function is easy to implement based on blended latent diffusion.
Given a mask M
and an encoded original video x0_ori
You just need to add one line
img = img* M + (1. - M) * self.model.q_sample(x0_ori, ts)
after
Rerender_A_Video/src/ddim_v_hacked.py
Lines 322 to 323 in fcb7431
You can apply blending on specific steps or all steps to find a better results.
The difficulty lies in how to obtain the masks in a video.
It requires tracking the target object, which is out of scope of this project.
Dynamic Resizing Based on VRAM
Thank you for your suggestions! Maybe you can submit a pull request?
Clarification on Diffuser Integration
ebsynth will not be integrated in diffusers.
We just integrate the diffusion part (key frame translation) into diffusers,
and will propose a new pipeline, like StableDiffusionRerenderPipeline
combining inpainting and controlnet.
Mask for Stable Diffusion
This function is easy to implement based on blended latent diffusion. Given a mask
M
and an encoded original videox0_ori
You just need to add one lineimg = img* M + (1. - M) * self.model.q_sample(x0_ori, ts)
after
Rerender_A_Video/src/ddim_v_hacked.py
Lines 322 to 323 in fcb7431
You can apply blending on specific steps or all steps to find a better results.
The difficulty lies in how to obtain the masks in a video. It requires tracking the target object, which is out of scope of this project.
I went through your recommendations for implementing inpaiting using a mask, and I have a couple of questions:
-
From my observation in the code, it seems that x0_ori is equivalent to x0. Should there be a distinction between the two or are they intended to be the same?
-
Could you specify the expected format of the mask M? I currently have two mask image versions: one where the mask values are either True or False, and another where it's a tensor. Which version is more suitable for the code? And how must look mask in file example?
Example 1 (transparent background) | Example 2 (black background) |
---|---|
- Given that I have a mask image at
"Downloads/test1.png"
, how can I appropriately load and preprocess it into the required format for the M mask? Because if useimg = img* M + (1. - M) * self.model.q_sample(x0_ori, ts)
, I understand what need to M will be in format as img to use *, img is tesnor with 4 values, for example(1, 4, 72, 64)
and based on this, it means M was somehow processed before to get correct format.
- From my observation in the code, it seems that x0_ori is equivalent to x0. Should there be a distinction between the two or are they intended to be the same?
Yes, they are the same.
- Could you specify the expected format of the mask M? I currently have two mask image versions: one where the mask values are either True or False, and another where it's a tensor. Which version is more suitable for the code? And how must look mask in file example?
mask should be a tensor with values of 1 and 0. Corresponding the your example 2
- Given that I have a mask image at
"Downloads/test1.png"
, how can I appropriately load and preprocess it into the required format for the M mask? Because if useimg = img* M + (1. - M) * self.model.q_sample(x0_ori, ts)
, I understand what need to M will be in format as img to use *, img is tesnor with 4 values, for example(1, 4, 72, 64)
and based on this, it means M was somehow processed before to get correct format.
the size of the mask should be [1, 1, 72, 64] if img is of [1, 4, 72, 64]
you need to modify the
Rerender_A_Video/src/ddim_v_hacked.py
Line 158 in fcb7431
and
Rerender_A_Video/src/ddim_v_hacked.py
Line 239 in fcb7431
to add a new input parameter named
M
and load M when you load video frames (which looks like
frame = cv2.imread(imgs[i])
in the code)and feed M into
ddim_v_sampler.sample(...)
in the codeThank you very much. I have the same need. Can this functionality be added to the project, requiring only file input parameters. Regarding mask information, we can output it into a video or picture sequence through professional software
I have the same need. Can this functionality be added to the project, requiring only file input parameters.
I have created this pull request, if the project owner approves the new changes, then the project can be launched on devices with small VRAM.
I have the same need. Can this functionality be added to the project, requiring only file input parameters.
I have created this pull request, if the project owner approves the new changes, then the project can be launched on devices with small VRAM.
The pull request is under review by @SingleZombie. We are under other deadline pressure, so it will take time.
I think @lymanzhao 's need here is to add masking functionality . However, I'm busy with another project and have no time to add masking functionality recently.
I examined the results of applying a mask in the project and noticed that while the intended region within the mask was altered significantly, the area outside the mask also underwent slight changes. How can we address this to ensure only the masked region is affected?
Result | Mask | Original |
---|---|---|
I have this params in config:
interval = 10
control_strength = 0.7
loose_cfattn = True
freeu_args = (1, 1, 1, 1)
use_limit_device_resolution = True
control_type = "canny"
canny_low = 50
canny_high = 100
scale = 8.5
x0_strength = 0.96
style_update_freq = 10
mask_strength = 0.5
color_preserve = True
mask_period = (0.5, 0.8)
inner_strength = 0.9
cross_period = (0, 1)
ada_period = (0.8, 1)
warp_period = (0, 0.1)
smooth_boundary = True
Code in Rerender_A_Video/src/ddim_v_hacked.py line 323:
if inpaint_mask is not None:
img = img * inpaint_mask + (1. - inpaint_mask) * self.model.q_sample(x0, ts)
I read mask as 0 and 1:
inpaint_mask_path = "/home/user/Downloads/test1.png"
# Read the mask image
inpaint_mask_frame = cv2.imread(inpaint_mask_path, cv2.IMREAD_GRAYSCALE)
# Binarize the image
_, binary_mask = cv2.threshold(inpaint_mask_frame, 128, 1, cv2.THRESH_BINARY)
# Resize to desired shape
resized_mask = cv2.resize(binary_mask, (shape[2], shape[1]))
# Convert to tensor and adjust dimensions
inpaint_mask = torch.tensor(resized_mask, dtype=torch.float32).unsqueeze(0).unsqueeze(0).cuda()
currently i am running with 1280x720 video
it is using 24 GB of RTX 3090 ti + 12 GB shared VRAM
is this expected?
default settings
mask_period = (0.5, 0.8)
if inpaint_mask is not None: img = img * inpaint_mask + (1. - inpaint_mask) * self.model.q_sample(x0, ts)
@wladradchenko Did you add your code inside the if
block of
Rerender_A_Video/src/ddim_v_hacked.py
Line 309 in fcb7431
if
block of Rerender_A_Video/src/ddim_v_hacked.py
Line 315 in fcb7431
If so, the original x0 will be only applied to the
mask_period = (0.5, 0.8)
steps.To ensure the consistency, you can set a some hyperparamter like
inpainting_mask_period=(0, 1)
that ensures your added code is applied untill the final step (the begin step 0 can be tuned)That means in Line 324, outside the
if
block, you can add
if i >= inpainting_mask_period[0] * total_steps and i <= inpainting_mask_period[1] * total_steps:
img = img * inpaint_mask + (1. - inpaint_mask) * self.model.q_sample(x0, ts)
And in the original paper of blended latent diffusion, it further optimizes the decoder to fit the input image in the unmasked region, which is time-consuming for video processing.
@williamyang1991 thank u. Now work fine.
For suggested dynamically size by VRAM @FurkanGozukara if 24 GB VRAM when limit video resolution will be 1280x1280 fine (for that video it will be 1280x720) if set option use_limit_device_resolution = True
and user choose resolution > than resolution limit by VRAM. If 12 GB VRAM, than limit video resolution will be 640x640 (for that video it will be 640x384). I don't know data about behavior to set dynamically VRAM more 24 GB.
if 24 GB VRAM when limit video resolution will be 1280x1280 fine
so 24 GB working for you for 1280x1280?
how is this possible it is using 24 + 12 shared on windows for me for 1280x720
what is your pip freeze?
how is this possible it is using 24 + 12 shared on windows for me for 1280x720
I use cuda 11.8, xformers and torch 2.0.0 for cuda 11.8. Also I have cuDNN. I don't use pip freeze because I think what it is bad practice, I set limit for libs if version need to freeze, and I don't include sub libs. And because I have experimented with difference approach my pip freeze will be not correct.
I would also like to note that we are not talking about RAM, but about VRAM. If you have multiple GPUs on a device, you can use the device to select your GPU, for example cuda:0
. It means what u needed to less VRAM if some model will be in first GPU, and another model will be loaded in second GPU. I don't quite understand what it means: shared VRAM.