Unable to resume training

Bug

Hi,
I am trying to resume training on a pretrained model (https://github.com/micmic123/QmapCompression), which is based on compressAI. The pretrained model is based on Hyperprior architecture, with some additions.

To Reproduce

def load_checkpoint(path, model, optimizer=None, aux_optimizer=None, scaler=None, only_net=False):
    snapshot = torch.load(path)
    itr = snapshot['itr']
    print(f'Loaded from {itr} iterations')

    model.load_state_dict(snapshot['model'])

    if not only_net:
        if 'optimizer' in snapshot:
            optimizer.load_state_dict(snapshot['optimizer'])
        if 'aux_optimizer' in snapshot:
            aux_optimizer.load_state_dict(snapshot['aux_optimizer'])
        if scaler is not None and 'scaler' in snapshot:
            scaler.load_state_dict(snapshot['scaler'])

    return itr, model


RuntimeError: Error(s) in loading state_dict for CustomDataParallel:
        size mismatch for module.entropy_bottleneck._offset: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.entropy_bottleneck._quantized_cdf: copying a param with shape torch.Size([192, 45]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.entropy_bottleneck._cdf_length: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional._offset: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional._quantized_cdf: copying a param with shape torch.Size([64, 3133]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional._cdf_length: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional.scale_table: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).

Expected behavior

should be easily load the model

Environment

Please copy and paste the output from python3 -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.7.1+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.15.5

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

Nvidia driver version: 470.141.10
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] pytorch-gradcam==0.2.1
[pip3] pytorch-msssim==0.2.0
[pip3] pytorch-transformers==1.0.0
[pip3] torch==1.7.1+cu101
[pip3] torch-tb-profiler==0.4.0
[pip3] torchaudio==0.7.2
[pip3] torchvision==0.8.2+cu101
[conda] _pytorch_select           0.1                       cpu_0    anaconda
[conda] blas                      1.0                         mkl    anaconda
[conda] cudatoolkit               10.1.243             h6bb024c_0    anaconda
[conda] libmklml                  2019.0.5             h06a4308_0    anaconda
[conda] mkl                       2020.2                      256    anaconda
[conda] numpy                     1.20.1                   pypi_0    pypi
[conda] pytorch-gradcam           0.2.1                    pypi_0    pypi
[conda] pytorch-msssim            0.2.0                    pypi_0    pypi
[conda] pytorch-transformers      1.0.0                    pypi_0    pypi
[conda] torch                     1.7.1+cu101              pypi_0    pypi
[conda] torch-tb-profiler         0.4.0                    pypi_0    pypi
[conda] torchaudio                0.7.2                    pypi_0    pypi
[conda] torchvision               0.8.2+cu101              pypi_0    pypi

What is the version and commit hash for your local compressAI repository?

Thanks for your response, can you please tell me how can i find commit hash for local compressAI repo? I installed compressAI using pip install compressai.
The library version is 1.2.2, not sure about the commit hash, where to find.

Can you please show us the output of:

COMPRESSAI_PATH="$(python -c 'import compressai; print(compressai.__path__[0])')"
echo "$COMPRESSAI_PATH"
cd "$COMPRESSAI_PATH"
git rev-parse HEAD

It sounds like you installed compressai from PyPI, so that means my recent commits b64b0da and 14ac02c are probably not the cause of the problem. The issue is that module.entropy_bottleneck buffers are not being pre-allocated with enough space since it's expecting entropy_bottleneck directly. Good news: the recent commits might actually fix the problem! Consider installing compressai from source instead:

cd ~
git clone https://github.com/InterDigitalInc/CompressAI compressai
cd compressai
pip install -U pip && pip install -e .

Alternatively, you can also just copy paste the new load_state_dict function into CompressionModel, defined here:

CompressAI/compressai/models/base.py

Lines 62 to 142 in 14ac02c

    
           class CompressionModel(nn.Module): 
        
               """Base class for constructing an auto-encoder with any number of 
        
               EntropyBottleneck or GaussianConditional modules. 
        
               """ 
        
               def load_state_dict(self, state_dict, strict=True): 
        
                   for name, module in self.named_modules(): 
        
                       if not any(x.startswith(name) for x in state_dict.keys()): 
        
                           continue 
        
                       if isinstance(module, EntropyBottleneck): 
        
                           update_registered_buffers( 
        
                               module, 
        
                               name, 
        
                               ["_quantized_cdf", "_offset", "_cdf_length"], 
        
                               state_dict, 
        
                           ) 
        
                       if isinstance(module, GaussianConditional): 
        
                           update_registered_buffers( 
        
                               module, 
        
                               name, 
        
                               ["_quantized_cdf", "_offset", "_cdf_length", "scale_table"], 
        
                               state_dict, 
        
                           ) 
        
                   return nn.Module.load_state_dict(self, state_dict, strict=strict) 
        
               def update(self, scale_table=None, force=False): 
        
                   """Updates EntropyBottleneck and GaussianConditional CDFs. 
        
                   Needs to be called once after training to be able to later perform the 
        
                   evaluation with an actual entropy coder. 
        
                   Args: 
        
                       scale_table (torch.Tensor): table of scales (i.e. stdev) 
        
                           for initializing the Gaussian distributions 
        
                           (default: 64 logarithmically spaced scales from 0.11 to 256) 
        
                       force (bool): overwrite previous values (default: False) 
        
                   Returns: 
        
                       updated (bool): True if at least one of the modules was updated. 
        
                   """ 
        
                   if scale_table is None: 
        
                       scale_table = get_scale_table() 
        
                   updated = False 
        
                   for _, module in self.named_modules(): 
        
                       if isinstance(module, EntropyBottleneck): 
        
                           updated |= module.update(force=force) 
        
                       if isinstance(module, GaussianConditional): 
        
                           updated |= module.update_scale_table(scale_table, force=force) 
        
                   return updated 
        
               def aux_loss(self) -> Tensor: 
        
                   """Returns the total auxiliary loss over all `EntropyBottleneck`s. 
        
                   In contrast to the primary "net" loss used by the "net" 
        
                   optimizer, the "aux" loss is only used by the "aux" optimizer to 
        
                   update *only* the `EntropyBottleneck.quantiles` parameters. In 
        
                   fact, the "aux" loss does not depend on image data at all. 
        
                   The purpose of the "aux" loss is to determine the range within 
        
                   which most of the mass of a given distribution is contained, as 
        
                   well as its median (i.e. 50% probability). That is, for a given 
        
                   distribution, the "aux" loss converges towards satisfying the 
        
                   following conditions for some chosen `tail_mass` probability: 
        
                   - `cdf(quantiles[0]) = tail_mass / 2`, 
        
                   - `cdf(quantiles[1]) = 0.5`, and 
        
                   - `cdf(quantiles[2]) = 1 - tail_mass / 2`. 
        
                   This ensures that the concrete `_quantized_cdf`s operate 
        
                   primarily within a finitely supported region. Any symbols 
        
                   outside this range must be coded using some alternative method 
        
                   that does *not* involve the `_quantized_cdf`s. Luckily, one may 
        
                   choose a `tail_mass` probability that is sufficiently small so 
        
                   that this rarely occurs. It is important that we work with 
        
                   `_quantized_cdf`s that have a small finite support; otherwise, 
        
                   entropy coding runtime performance would suffer. Thus, 
        
                   `tail_mass` should not be too small, either! 
        
                   """ 
        
                   loss = sum(m.loss() for m in self.modules() if isinstance(m, EntropyBottleneck)) 
        
                   return cast(Tensor, loss)

Hi,
Thank you for your detailed answer.
Yes you are right, I am not building compressAI from the source. The requested output is as follows:
COMPRESSAI_PATH = /anaconda/envs/azureml_py38/lib/python3.8/site-packages/compressai.

As for the proposed solution. My CompressionModel class already looks the same as you have mentioned. I copied it earlier, since there was some issues with Multi-GPU training and copying it worked for me.
Please look at my project over here Entropy Models/ Hyperprior Files
I think the issue arises from using multiple versions at one time? I use Pypi to install compressai, but I redefine the files e.g. entropy_models.py again in the code, which might be different from the original pypi version. Could this be a problem?

DataParallel adds a module. prefix by default to every key in the parallel_model.state_dict().

Solutions:

Save the "non-parallel" model:

module = model.module if isinstance(model, DataParallel) else model
state_dict = module.state_dict()
torch.save("output.pth", state_dict)

Load checkpoint, rename all the keys, save new checkpoint:

ckpt = torch.load("input.pth")
print(ckpt.keys())
sd = "state_dict"  # I forgot what it was called.
print("\n".join(ckpt[sd].keys()))
ckpt[sd] = {k.removeprefix("module."): v for k, v in ckpt[sd].items()}
torch.save("output.pth", ckpt)

Same as (2), but do it before loading the state_dict instead.
Load the model weights before wrapping it in DataParallel.

I would say (1) is the best and least likely to cause problems in the future, and maybe do (4) as well.

Yeah you were right. Everything works now. I am attaching my code. in case if someone else face a similar problem.

def load_checkpoint(path, model):
    snapshot = torch.load(path)
    itr = snapshot['itr']
    dict_ = {}
    print(f'Loaded from {itr} iterations')
    
    for k, v in snapshot["model"].items():
 
        k = remove_prefix(k,"module.")
        dict_[k] = v
    snapshot["model"] = dict_
    model.load_state_dict(snapshot['model'])`

and in train.py, we have

model = model.to(device)
optimizer,aux_optimizer = configure_optimizers(model,config)
if args.resume:
    itr, model = load_checkpoint(args.resume, model)
    logger.load_itr(itr)
    
if torch.cuda.device_count() > 1:
    model = CustomDataParallel(model)

	class CompressionModel(nn.Module):
	"""Base class for constructing an auto-encoder with any number of
	EntropyBottleneck or GaussianConditional modules.
	"""

	def load_state_dict(self, state_dict, strict=True):
	for name, module in self.named_modules():
	if not any(x.startswith(name) for x in state_dict.keys()):
	continue

	if isinstance(module, EntropyBottleneck):
	update_registered_buffers(
	module,
	name,
	["_quantized_cdf", "_offset", "_cdf_length"],
	state_dict,
	)

	if isinstance(module, GaussianConditional):
	update_registered_buffers(
	module,
	name,
	["_quantized_cdf", "_offset", "_cdf_length", "scale_table"],
	state_dict,
	)

	return nn.Module.load_state_dict(self, state_dict, strict=strict)

	def update(self, scale_table=None, force=False):
	"""Updates EntropyBottleneck and GaussianConditional CDFs.

	Needs to be called once after training to be able to later perform the
	evaluation with an actual entropy coder.

	Args:
	scale_table (torch.Tensor): table of scales (i.e. stdev)
	for initializing the Gaussian distributions
	(default: 64 logarithmically spaced scales from 0.11 to 256)
	force (bool): overwrite previous values (default: False)

	Returns:
	updated (bool): True if at least one of the modules was updated.
	"""
	if scale_table is None:
	scale_table = get_scale_table()
	updated = False
	for _, module in self.named_modules():
	if isinstance(module, EntropyBottleneck):
	updated \|= module.update(force=force)
	if isinstance(module, GaussianConditional):
	updated \|= module.update_scale_table(scale_table, force=force)
	return updated

	def aux_loss(self) -> Tensor:
	"""Returns the total auxiliary loss over all `EntropyBottleneck`s.

	In contrast to the primary "net" loss used by the "net"
	optimizer, the "aux" loss is only used by the "aux" optimizer to
	update only the `EntropyBottleneck.quantiles` parameters. In
	fact, the "aux" loss does not depend on image data at all.

	The purpose of the "aux" loss is to determine the range within
	which most of the mass of a given distribution is contained, as
	well as its median (i.e. 50% probability). That is, for a given
	distribution, the "aux" loss converges towards satisfying the
	following conditions for some chosen `tail_mass` probability:
	- `cdf(quantiles[0]) = tail_mass / 2`,
	- `cdf(quantiles[1]) = 0.5`, and
	- `cdf(quantiles[2]) = 1 - tail_mass / 2`.
	This ensures that the concrete `_quantized_cdf`s operate
	primarily within a finitely supported region. Any symbols
	outside this range must be coded using some alternative method
	that does not involve the `_quantized_cdf`s. Luckily, one may
	choose a `tail_mass` probability that is sufficiently small so
	that this rarely occurs. It is important that we work with
	`_quantized_cdf`s that have a small finite support; otherwise,
	entropy coding runtime performance would suffer. Thus,
	`tail_mass` should not be too small, either!
	"""
	loss = sum(m.loss() for m in self.modules() if isinstance(m, EntropyBottleneck))
	return cast(Tensor, loss)