Unable to resume training
danishnazir opened this issue · 6 comments
Bug
Hi,
I am trying to resume training on a pretrained model (https://github.com/micmic123/QmapCompression), which is based on compressAI. The pretrained model is based on Hyperprior architecture, with some additions.
To Reproduce
def load_checkpoint(path, model, optimizer=None, aux_optimizer=None, scaler=None, only_net=False):
snapshot = torch.load(path)
itr = snapshot['itr']
print(f'Loaded from {itr} iterations')
model.load_state_dict(snapshot['model'])
if not only_net:
if 'optimizer' in snapshot:
optimizer.load_state_dict(snapshot['optimizer'])
if 'aux_optimizer' in snapshot:
aux_optimizer.load_state_dict(snapshot['aux_optimizer'])
if scaler is not None and 'scaler' in snapshot:
scaler.load_state_dict(snapshot['scaler'])
return itr, model
RuntimeError: Error(s) in loading state_dict for CustomDataParallel:
size mismatch for module.entropy_bottleneck._offset: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for module.entropy_bottleneck._quantized_cdf: copying a param with shape torch.Size([192, 45]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for module.entropy_bottleneck._cdf_length: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for module.gaussian_conditional._offset: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for module.gaussian_conditional._quantized_cdf: copying a param with shape torch.Size([64, 3133]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for module.gaussian_conditional._cdf_length: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for module.gaussian_conditional.scale_table: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).
Expected behavior
should be easily load the model
Environment
Please copy and paste the output from python3 -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.7.1+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.15.5
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB
Nvidia driver version: 470.141.10
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] pytorch-gradcam==0.2.1
[pip3] pytorch-msssim==0.2.0
[pip3] pytorch-transformers==1.0.0
[pip3] torch==1.7.1+cu101
[pip3] torch-tb-profiler==0.4.0
[pip3] torchaudio==0.7.2
[pip3] torchvision==0.8.2+cu101
[conda] _pytorch_select 0.1 cpu_0 anaconda
[conda] blas 1.0 mkl anaconda
[conda] cudatoolkit 10.1.243 h6bb024c_0 anaconda
[conda] libmklml 2019.0.5 h06a4308_0 anaconda
[conda] mkl 2020.2 256 anaconda
[conda] numpy 1.20.1 pypi_0 pypi
[conda] pytorch-gradcam 0.2.1 pypi_0 pypi
[conda] pytorch-msssim 0.2.0 pypi_0 pypi
[conda] pytorch-transformers 1.0.0 pypi_0 pypi
[conda] torch 1.7.1+cu101 pypi_0 pypi
[conda] torch-tb-profiler 0.4.0 pypi_0 pypi
[conda] torchaudio 0.7.2 pypi_0 pypi
[conda] torchvision 0.8.2+cu101 pypi_0 pypi
What is the version and commit hash for your local compressAI repository?
Thanks for your response, can you please tell me how can i find commit hash for local compressAI repo? I installed compressAI using pip install compressai.
The library version is 1.2.2
, not sure about the commit hash, where to find.
Can you please show us the output of:
COMPRESSAI_PATH="$(python -c 'import compressai; print(compressai.__path__[0])')"
echo "$COMPRESSAI_PATH"
cd "$COMPRESSAI_PATH"
git rev-parse HEAD
It sounds like you installed compressai
from PyPI, so that means my recent commits b64b0da and 14ac02c are probably not the cause of the problem. The issue is that module.entropy_bottleneck
buffers are not being pre-allocated with enough space since it's expecting entropy_bottleneck
directly. Good news: the recent commits might actually fix the problem! Consider installing compressai from source instead:
cd ~
git clone https://github.com/InterDigitalInc/CompressAI compressai
cd compressai
pip install -U pip && pip install -e .
Alternatively, you can also just copy paste the new load_state_dict
function into CompressionModel
, defined here:
CompressAI/compressai/models/base.py
Lines 62 to 142 in 14ac02c
Hi,
Thank you for your detailed answer.
Yes you are right, I am not building compressAI from the source. The requested output is as follows:
COMPRESSAI_PATH = /anaconda/envs/azureml_py38/lib/python3.8/site-packages/compressai
.
As for the proposed solution. My CompressionModel
class already looks the same as you have mentioned. I copied it earlier, since there was some issues with Multi-GPU training and copying it worked for me.
Please look at my project over here Entropy Models/ Hyperprior Files
I think the issue arises from using multiple versions at one time? I use Pypi to install compressai, but I redefine the files e.g. entropy_models.py
again in the code, which might be different from the original pypi version. Could this be a problem?
DataParallel
adds a module.
prefix by default to every key in the parallel_model.state_dict()
.
Solutions:
- Save the "non-parallel" model:
module = model.module if isinstance(model, DataParallel) else model
state_dict = module.state_dict()
torch.save("output.pth", state_dict)
- Load checkpoint, rename all the keys, save new checkpoint:
ckpt = torch.load("input.pth")
print(ckpt.keys())
sd = "state_dict" # I forgot what it was called.
print("\n".join(ckpt[sd].keys()))
ckpt[sd] = {k.removeprefix("module."): v for k, v in ckpt[sd].items()}
torch.save("output.pth", ckpt)
-
Same as (2), but do it before loading the state_dict instead.
-
Load the model weights before wrapping it in
DataParallel
.
I would say (1) is the best and least likely to cause problems in the future, and maybe do (4) as well.
Yeah you were right. Everything works now. I am attaching my code. in case if someone else face a similar problem.
def load_checkpoint(path, model):
snapshot = torch.load(path)
itr = snapshot['itr']
dict_ = {}
print(f'Loaded from {itr} iterations')
for k, v in snapshot["model"].items():
k = remove_prefix(k,"module.")
dict_[k] = v
snapshot["model"] = dict_
model.load_state_dict(snapshot['model'])`
and in train.py
, we have
model = model.to(device)
optimizer,aux_optimizer = configure_optimizers(model,config)
if args.resume:
itr, model = load_checkpoint(args.resume, model)
logger.load_itr(itr)
if torch.cuda.device_count() > 1:
model = CustomDataParallel(model)