pixeli99/SVD_Xtend

Question for the encoder_hidden_states

WayneML opened this issue · 4 comments

When I try to run the script, I found the encoder_hidden_states to be zero.

if args.conditioning_dropout_prob is not None:
random_p = torch.rand(
bsz, device=latents.device, generator=generator)
# Sample masks for the edit prompts.
prompt_mask = random_p < 2 * args.conditioning_dropout_prob
prompt_mask = prompt_mask.reshape(bsz, 1, 1)
# Final text conditioning.
null_conditioning = torch.zeros_like(encoder_hidden_states)
encoder_hidden_states = torch.where(
prompt_mask, null_conditioning.unsqueeze(1), encoder_hidden_states.unsqueeze(1))

I found something strange in this code block,it seems that “random_p = torch.ran(bsz, device=latents.device, generator=generator)” always make random_p is one dimension and the value is 1.when you chose batch size is 1.
make prompt_mask one ture but not a list of Boolean type.
prompt_mask = random_p < 2 * args.conditioning_dropout_prob
prompt_mask = prompt_mask.reshape(bsz, 1, 1)
# Final text conditioning.
null_conditioning = torch.zeros_like(encoder_hidden_states)
encoder_hidden_states = torch.where(
prompt_mask, null_conditioning.unsqueeze(1), encoder_hidden_states.unsqueeze(1))

And is this still for image2video task? It seems that it is used for the text to image.

Hi, I didn't quite understand what you meant. Are you asking why the encoder_hidden_states need to be replaced with zeros?

Can the encoder_hidden_states be replaced with a text embedding for text-to-video tasks?