Improvements in the generation
Closed this issue · 9 comments
Hi @HeliosZhao
After going through the complete code and experiments, I see the following issues.
- Issues with the generation quality and long video generation.
- Certainly some part of the background from the source video is missing in the target video though we are using the masks and editing only the protagonist.
Plans to improve:
- Can we use temporalNet from ControlNet as a guidance to improve the consistency.
- Can we use pretrained text to video models or train this architecture on videos dataset to better learn the patterns. As the current one used is T2I ( Text to Image) the frame-to-frame consistency is low compare to the source videos.
- Can we use any other additional guidance for the better generation?
- Can we use weighted temporal attention, I see we calculate the attention with single frame. Can we use moving weighted average so that the information is preserved ( RNN kind of architecture here).
Can you help answering these? Thanks in advance.
Following up, can we use motion vector for the consistent video generation.
https://arxiv.org/pdf/2306.02018.pdf - mentioned in this paper.
Hi @rakesh-reddy95 , thanks a lot for your questions and suggestions.
- Can we use temporalNet from ControlNet as a guidance to improve the consistency.
Yes, you can also include ControlNet with temporal layers in the training.
- Can we use pretrained text to video models or train this architecture on videos dataset to better learn the patterns. As the current one used is T2I ( Text to Image) the frame-to-frame consistency is low compare to the source videos.
The pre-trained generative model should be able to include visual information (CLIP image embedding). Currently, Stable UnCLIP is the only public model with this ability. If there exists a T2V model that can leverage visual information, you can also use it as initialization.
- Can we use any other additional guidance for the better generation?
More ControlNet models (e.g., sketch) can be used.
- Can we use weighted temporal attention, I see we calculate the attention with single frame. Can we use moving weighted average so that the information is preserved ( RNN kind of architecture here).
I am not familiar with the method you mentioned. Can you provide a paper using this?
- Motion vector
Sure. Motion vector can be used as another control signal. You can train a ControlNet model with this signal and use it in training or inference.
@HeliosZhao
Thank you for the responses.
4. I didn't see any paper with this, but was just an idea.
Also, can you please provide the training scripts for controlnet training and saving only the controlnet weights. I see that the weights are provided and the training scripts is said to be released soon in the README.md. Also may I know the data you have used to train the depth controlnet model and pose controlnet model?
We are cleaning the code for ControlNet training and will release it soon.
We use COCO2017 training set to train the depth and pose model.
@HeliosZhao Hello, I'm getting below error while training the controlnet from diffusers. (https://github.com/huggingface/diffusers/tree/v0.15.0/examples/controlnet)
ValueError: class_labels should be provided when num_class_embeds > 0
I see we are not passing any class_labels in the training script of controlnet. While I see that in your inference pipeline. May I know what should I pass for class_labels while I'm training the depth/pose/edge models. Or should I use the "class_embed_type": null instead of "projection"
Hi @rakesh-reddy95 , we input zeros as class_labels
to ControlNet and input the CLIP image embedding as class_labels
to UNet.
@HeliosZhao Can you provide just that part of the script? initiating zeros for controlnet , and clip image embedding to UNet so that I can be able to pass?
The ControlNet training code is released HERE. Feel free to use.
Awesome, that helps for further experiments. Thank you.