Weifeng-Chen/control-a-video

The size of tensor a (4) must match the size of tensor b (8)

G-force78 opened this issue · 2 comments

Using these arguments

!python3 /content/control-a-video/inference.py --prompt "a bear practicing kungfu, with a background of mountains" --input_video /content/kungfubear.mp4 --control_mode depth --num_sample_frames 24 --inference_step 10 --guidance_scale 5 --init_noise_thres 0.75

FPS 8 output demo.gif

_/content/control-a-video/inference.py:119 in │
│ │
│ 116 │
│ 117 out = [] │
│ 118 for i in range(num_sample_frames//each_sample_frame): │
│ ❱ 119 │ out1 = video_controlnet_pipe( │
│ 120 │ │ │ # controlnet_hint= control_maps[:,:,:each_sample_frame,:,: │
│ 121 │ │ │ # images= v2v_input_frames[:,:,:each_sample_frame,:,:], │
│ 122 │ │ │ controlnet_hint=control_maps[:,:,i*each_sample_frame-1:(i+ │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/autograd/grad_mode.py:27 in │
│ decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ ❱ 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def wrap_generator(self, func): │
│ │
│ /content/control-a-video/model/video_diffusion/pipelines/pipeline_stable_dif │
│ fusion_controlnet3d.py:418 in call
│ │
│ 415 │ │ │ │ │ if controlhint_in_uncond: │
│ 416 │ │ │ │ │ │ control_maps_single_frame = control_maps_singl │
│ 417 │ │ │ │ │ │
│ ❱ 418 │ │ │ │ │ down_block_res_samples_single_frame, mid_block_res │
│ 419 │ │ │ │ │ │ │ │ latent_model_input_single_frame, │
│ 420 │ │ │ │ │ │ │ │ t, │
│ 421 │ │ │ │ │ │ │ │ encoder_hidden_states=text_embeddings

│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1194 in │
│ _call_impl │
│ │
│ 1191 │ │ # this function, and just call forward. │
│ 1192 │ │ if not (self._backward_hooks or self.forward_hooks or self.
│ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1194 │ │ │ return forward_call(*input, **kwargs) │
│ 1195 │ │ # Do not call functions when jit is used │
│ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1197 │ │ if self._backward_hooks or global_backward_hooks: │
│ │
│ /content/control-a-video/model/video_diffusion/models/controlnet3d.py:464 in │
│ forward │
│ │
│ 461 │ │ controlnet_cond = self.controlnet_cond_embedding(controlnet_co │
│ 462 │ │ # print(sample.shape, controlnet_cond.shape) │
│ 463 │ │ │
│ ❱ 464 │ │ sample += controlnet_cond │
│ 465 │ │ # 3. down │
│ 466 │ │ │
│ 467 │ │ down_block_res_samples = (sample,)

What is the relationship between fps, num_sample_frames and length of output video? Also what does --sampling_rate: skip sampling from the input video actually mean? I notice the default value is 3, what does this do?

Cool setup by the way its like an opensource version of runway Gen1 I imagine they used similar tricks and just have many GPU to run it.

the name should be fix. current may not be understood..