[Bug]: Tensor Dimension mismatch when attempting to render movie
scottlegrand opened this issue · 1 comments
Is there an existing issue for this?
- I have searched the existing issues and checked the recent builds/commits of both this extension and the webui
Are you using the latest version of the extension?
- I have the modelscope text2video extension updated to the lastest version and I still have the issue.
What happened?
Tensor dimensions mismatch? But why? And how do I fix?
ata shape for DDIM sampling is (1, 4, 48, 42, 64), eta 0.0 | 0/30 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/t2v_helpers/render.py", line 30, in run
vids_pack = process_modelscope(args_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/process_modelscope.py", line 214, in process_modelscope
samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_pipeline.py", line 278, in infer
x0 = self.diffusion.sample_loop(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/samplers_common.py", line 190, in sample_loop
x0 = self.sampler.sample(
^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/ddim/sampler.py", line 90, in sample
samples = self.ddim_sampling(conditioning, size,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/ddim/sampler.py", line 154, in ddim_sampling
outs, _ = self.p_sample_ddim(img, c, ts, index=index, use_original_steps=ddim_use_original_steps,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/ddim/sampler.py", line 178, in p_sample_ddim
noise = self.model(x, t, c)
^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 444, in forward
x = torch.cat([x, xs.pop()], dim=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 12 but got size 11 for tensor number 1 in the list.
Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 12 but got size 11 for tensor number 1 in the list.
Steps to reproduce the problem
Any attempt to render a zeroscope movie dies with this
What should have happened?
text2video should do its thing?
WebUI and Deforum extension Commit IDs
webui commit id -
txt2vid commit id -
Torch version
2.0.1
What GPU were you using for launching?
RTX 4090, 24 GB
On which platform are you launching the webui backend with the extension?
Local PC setup (Linux)
Settings
Unclear? Just typed "The matrix but with muppets" as a prompt
Console logs
text2video — The model selected is: <modelscope> (ModelScope-like)
text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'The Matrix with muppets', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 48, 'seed': 2020820483, 'scale': 17, 'width': 512, 'height': 340, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM'}
Sampling random noise.
Data shape for DDIM sampling is (1, 4, 48, 42, 64), eta 0.0 | 0/30 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/t2v_helpers/render.py", line 30, in run
vids_pack = process_modelscope(args_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/process_modelscope.py", line 214, in process_modelscope
samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_pipeline.py", line 278, in infer
x0 = self.diffusion.sample_loop(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/samplers_common.py", line 190, in sample_loop
x0 = self.sampler.sample(
^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/ddim/sampler.py", line 90, in sample
samples = self.ddim_sampling(conditioning, size,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/ddim/sampler.py", line 154, in ddim_sampling
outs, _ = self.p_sample_ddim(img, c, ts, index=index, use_original_steps=ddim_use_original_steps,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/ddim/sampler.py", line 178, in p_sample_ddim
noise = self.model(x, t, c)
^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 444, in forward
x = torch.cat([x, xs.pop()], dim=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 12 but got size 11 for tensor number 1 in the list.
Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 12 but got size 11 for tensor number 1 in the list.
Sampling using DDIM for 30 steps.: 0%| | 0/30 [00:00<?, ?it/s]
text2video — The model selected is: zeroscope (ModelScope-like)
text2video extension for auto1111 webui
Git commit: 3f4a109a
Starting text2video
Pipeline setup
config namespace(framework='pytorch', task='text-to-video-synthesis', model={'type': 'latent-text-to-video-synthesis', 'model_args': {'ckpt_clip': 'open_clip_pytorch_model.bin', 'ckpt_unet': 'text2video_pytorch_model.pth', 'ckpt_autoencoder': 'VQGAN_autoencoder.pth', 'max_frames': 16, 'tiny_gpu': 1}, 'model_cfg': {'unet_in_dim': 4, 'unet_dim': 320, 'unet_y_dim': 768, 'unet_context_dim': 1024, 'unet_out_dim': 4, 'unet_dim_mult': [1, 2, 4, 4], 'unet_num_heads': 8, 'unet_head_dim': 64, 'unet_res_blocks': 2, 'unet_attn_scales': [1, 0.5, 0.25], 'unet_dropout': 0.1, 'temporal_attention': 'True', 'num_timesteps': 1000, 'mean_type': 'eps', 'var_type': 'fixed_small', 'loss_type': 'mse'}}, pipeline={'type': 'latent-text-to-video-synthesis'})
device cuda
Working in txt2vid mode
0%| | 0/1 [00:00<?, ?it/s]Making a video with the following parameters:
{'prompt': 'The Matrix with muppets', 'n_prompt': 'text, watermark, copyright, blurry, nsfw', 'steps': 30, 'frames': 48, 'seed': 2773172104, 'scale': 17, 'width': 512, 'height': 340, 'eta': 0.0, 'cpu_vae': 'GPU (half precision)', 'device': device(type='cuda'), 'skip_steps': 0, 'strength': 1, 'is_vid2vid': 0, 'sampler': 'DDIM'}
Sampling random noise.
Data shape for DDIM sampling is (1, 4, 48, 42, 64), eta 0.0 | 0/30 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/t2v_helpers/render.py", line 30, in run
vids_pack = process_modelscope(args_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/process_modelscope.py", line 214, in process_modelscope
samples, _ = pipe.infer(args.prompt, args.n_prompt, args.steps, args.frames, args.seed + batch if args.seed != -1 else -1, args.cfg_scale,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_pipeline.py", line 278, in infer
x0 = self.diffusion.sample_loop(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/samplers_common.py", line 190, in sample_loop
x0 = self.sampler.sample(
^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/ddim/sampler.py", line 90, in sample
samples = self.ddim_sampling(conditioning, size,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/ddim/sampler.py", line 154, in ddim_sampling
outs, _ = self.p_sample_ddim(img, c, ts, index=index, use_original_steps=ddim_use_original_steps,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/samplers/ddim/sampler.py", line 178, in p_sample_ddim
noise = self.model(x, t, c)
^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/miniconda3/envs/sd/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/slegrand/stable-diffusion-webui/extensions/sd-webui-text2video/scripts/modelscope/t2v_model.py", line 444, in forward
x = torch.cat([x, xs.pop()], dim=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 12 but got size 11 for tensor number 1 in the list.
Exception occurred: Sizes of tensors must match except in dimension 1. Expected size 12 but got size 11 for tensor number 1 in the list.
Sampling using DDIM for 30 steps.: 0%|
Additional information
No response
This issue has been closed due to incorrect formatting. Please address the following mistakes and reopen the issue:
- Include THE FULL LOG FROM THE START OF THE WEBUI in the issue description.
- Make sure the issue title has at least 3 words.