InterDigitalInc/CompressAI

Unable to train video compression model on Vimeo90K dataset

Zayn-Rekhi opened this issue · 6 comments

Unable to train video compression model on Vimeo90K dataset

Hey guys, I am currently trying to run the Vimeo90K dataset on the SSF2020 model

Whenever I try to train the model, I am stopped by this error:
Learning rate: 0.0001
Traceback (most recent call last):
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/examples/train_video.py", line 497, in
main(sys.argv[1:])
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/examples/train_video.py", line 467, in main
train_one_epoch(
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/examples/train_video.py", line 260, in train_one_epoch
out_net = model(d)
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai/models/video/google.py", line 217, in forward
x_hat, likelihoods = self.forward_keyframe(frames[0])
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai/models/video/google.py", line 235, in forward_keyframe
y_hat, likelihoods = self.img_hyperprior(y)
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai/models/video/google.py", line 158, in forward
z_hat, z_likelihoods = self.entropy_bottleneck(z)
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai/entropy_models/entropy_models.py", line 501, in forward
likelihood = self._likelihood(outputs)
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai/entropy_models/entropy_models.py", line 462, in _likelihood
lower = self._logits_cumulative(v0, stop_gradient=False)
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai/entropy_models/entropy_models.py", line 443, in _logits_cumulative
logits = torch.matmul(F.softplus(matrix), logits)
RuntimeError: The size of tensor a (192) must match the size of tensor b (2) at non-singleton dimension 0

To Reproduce

python3 examples/train_video.py -m ssf2020 -d /Users/zaynrekhi/Desktop/BeepBoopBap/data/vimeo_triplet90K/vimeo_triplet --batch-size 16 -lr 1e-4 --save

Steps to reproduce the behavior:

Expected behavior

To train SSF2020

Environment

Please copy and paste the output from python3 -m torch.utils.collect_env
PyTorch version: 1.13.0.dev20220924
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.0.1 (arm64)
GCC version: Could not collect
Clang version: 14.0.0 (clang-1400.0.29.202)
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)] (64-bit runtime)
Python platform: macOS-13.0.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.3
[pip3] pytorch-msssim==0.2.1
[pip3] torch==1.13.0.dev20220924
[pip3] torchac==0.9.3
[pip3] torchaudio==0.13.0.dev20220924
[pip3] torchvision==0.14.0.dev20220924
[conda] Could not collect

Additional context

Related issue: #183

Please show us the output of:

COMPRESSAI_PATH="$(python -c 'import compressai; print(compressai.__path__[0])')"
echo "$COMPRESSAI_PATH"
cd "$COMPRESSAI_PATH"
git rev-parse HEAD

Hello,

Thank you so much for your response. Here is the output of running the following commands:

/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai
14ac02c5182cbfee596abdfea98886be6247479a

According to python3 -m torch.utils.collect_env, the training is being done on a CPU rather than a GPU. Is this correct? Note that CPU training is currently unsupported.

Nonetheless, I'm not sure exactly why there's a mismatch in number of elements. Might be device (CPU/CUDA) related. Or maybe the image batches from the data loader (which are fed into the model) are not of the correct shape/device.

So I was previously trying to use the Vimeo90K Dataset class when trying to load data from the Vimeo90K dataset for training SSF2020. However, I ran into issues with the dims not aligning. I am now using the VideoFolder dataset class to load the Vimeo90K dataset, but I am still running into the following issue:

Traceback (most recent call last):
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/examples/train_video.py", line 500, in
main(sys.argv[1:])
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/examples/train_video.py", line 470, in main
train_one_epoch(
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/examples/train_video.py", line 270, in train_one_epoch
aux_loss = compute_aux_loss(model.aux_loss(), backward=True)
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai/models/video/google.py", line 386, in aux_loss
aux_loss_list.append(m.aux_loss())
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai/models/video/google.py", line 386, in aux_loss
aux_loss_list.append(m.aux_loss())
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai/models/video/google.py", line 386, in aux_loss
aux_loss_list.append(m.aux_loss())
[Previous line repeated 991 more times]
File "/Users/zaynrekhi/Desktop/BeepBoopBap/python/CompressAI/compressai/models/video/google.py", line 384, in aux_loss
for m in self.modules():
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1838, in modules
for _, module in self.named_modules():
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1877, in named_modules
memo.add(self)
RecursionError: maximum recursion depth exceeded while calling a Python object

Please update via git pull --rebase.

Recent regression bug fix in 2156f2b

That fixed it. Thanks :)