InterDigitalInc/CompressAI

Unable to resume training

danishnazir opened this issue · 6 comments

Bug

Hi,
I am trying to resume training on a pretrained model (https://github.com/micmic123/QmapCompression), which is based on compressAI. The pretrained model is based on Hyperprior architecture, with some additions.

To Reproduce

def load_checkpoint(path, model, optimizer=None, aux_optimizer=None, scaler=None, only_net=False):
    snapshot = torch.load(path)
    itr = snapshot['itr']
    print(f'Loaded from {itr} iterations')

    model.load_state_dict(snapshot['model'])

    if not only_net:
        if 'optimizer' in snapshot:
            optimizer.load_state_dict(snapshot['optimizer'])
        if 'aux_optimizer' in snapshot:
            aux_optimizer.load_state_dict(snapshot['aux_optimizer'])
        if scaler is not None and 'scaler' in snapshot:
            scaler.load_state_dict(snapshot['scaler'])

    return itr, model

RuntimeError: Error(s) in loading state_dict for CustomDataParallel:
        size mismatch for module.entropy_bottleneck._offset: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.entropy_bottleneck._quantized_cdf: copying a param with shape torch.Size([192, 45]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.entropy_bottleneck._cdf_length: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional._offset: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional._quantized_cdf: copying a param with shape torch.Size([64, 3133]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional._cdf_length: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional.scale_table: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).

Expected behavior

should be easily load the model

Environment

Please copy and paste the output from python3 -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.7.1+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.15.5

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

Nvidia driver version: 470.141.10
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] pytorch-gradcam==0.2.1
[pip3] pytorch-msssim==0.2.0
[pip3] pytorch-transformers==1.0.0
[pip3] torch==1.7.1+cu101
[pip3] torch-tb-profiler==0.4.0
[pip3] torchaudio==0.7.2
[pip3] torchvision==0.8.2+cu101
[conda] _pytorch_select           0.1                       cpu_0    anaconda
[conda] blas                      1.0                         mkl    anaconda
[conda] cudatoolkit               10.1.243             h6bb024c_0    anaconda
[conda] libmklml                  2019.0.5             h06a4308_0    anaconda
[conda] mkl                       2020.2                      256    anaconda
[conda] numpy                     1.20.1                   pypi_0    pypi
[conda] pytorch-gradcam           0.2.1                    pypi_0    pypi
[conda] pytorch-msssim            0.2.0                    pypi_0    pypi
[conda] pytorch-transformers      1.0.0                    pypi_0    pypi
[conda] torch                     1.7.1+cu101              pypi_0    pypi
[conda] torch-tb-profiler         0.4.0                    pypi_0    pypi
[conda] torchaudio                0.7.2                    pypi_0    pypi
[conda] torchvision               0.8.2+cu101              pypi_0    pypi

What is the version and commit hash for your local compressAI repository?

Thanks for your response, can you please tell me how can i find commit hash for local compressAI repo? I installed compressAI using pip install compressai.
The library version is 1.2.2, not sure about the commit hash, where to find.

Can you please show us the output of:

COMPRESSAI_PATH="$(python -c 'import compressai; print(compressai.__path__[0])')"
echo "$COMPRESSAI_PATH"
cd "$COMPRESSAI_PATH"
git rev-parse HEAD

It sounds like you installed compressai from PyPI, so that means my recent commits b64b0da and 14ac02c are probably not the cause of the problem. The issue is that module.entropy_bottleneck buffers are not being pre-allocated with enough space since it's expecting entropy_bottleneck directly. Good news: the recent commits might actually fix the problem! Consider installing compressai from source instead:

cd ~
git clone https://github.com/InterDigitalInc/CompressAI compressai
cd compressai
pip install -U pip && pip install -e .

Alternatively, you can also just copy paste the new load_state_dict function into CompressionModel, defined here:

class CompressionModel(nn.Module):
"""Base class for constructing an auto-encoder with any number of
EntropyBottleneck or GaussianConditional modules.
"""
def load_state_dict(self, state_dict, strict=True):
for name, module in self.named_modules():
if not any(x.startswith(name) for x in state_dict.keys()):
continue
if isinstance(module, EntropyBottleneck):
update_registered_buffers(
module,
name,
["_quantized_cdf", "_offset", "_cdf_length"],
state_dict,
)
if isinstance(module, GaussianConditional):
update_registered_buffers(
module,
name,
["_quantized_cdf", "_offset", "_cdf_length", "scale_table"],
state_dict,
)
return nn.Module.load_state_dict(self, state_dict, strict=strict)
def update(self, scale_table=None, force=False):
"""Updates EntropyBottleneck and GaussianConditional CDFs.
Needs to be called once after training to be able to later perform the
evaluation with an actual entropy coder.
Args:
scale_table (torch.Tensor): table of scales (i.e. stdev)
for initializing the Gaussian distributions
(default: 64 logarithmically spaced scales from 0.11 to 256)
force (bool): overwrite previous values (default: False)
Returns:
updated (bool): True if at least one of the modules was updated.
"""
if scale_table is None:
scale_table = get_scale_table()
updated = False
for _, module in self.named_modules():
if isinstance(module, EntropyBottleneck):
updated |= module.update(force=force)
if isinstance(module, GaussianConditional):
updated |= module.update_scale_table(scale_table, force=force)
return updated
def aux_loss(self) -> Tensor:
"""Returns the total auxiliary loss over all `EntropyBottleneck`s.
In contrast to the primary "net" loss used by the "net"
optimizer, the "aux" loss is only used by the "aux" optimizer to
update *only* the `EntropyBottleneck.quantiles` parameters. In
fact, the "aux" loss does not depend on image data at all.
The purpose of the "aux" loss is to determine the range within
which most of the mass of a given distribution is contained, as
well as its median (i.e. 50% probability). That is, for a given
distribution, the "aux" loss converges towards satisfying the
following conditions for some chosen `tail_mass` probability:
- `cdf(quantiles[0]) = tail_mass / 2`,
- `cdf(quantiles[1]) = 0.5`, and
- `cdf(quantiles[2]) = 1 - tail_mass / 2`.
This ensures that the concrete `_quantized_cdf`s operate
primarily within a finitely supported region. Any symbols
outside this range must be coded using some alternative method
that does *not* involve the `_quantized_cdf`s. Luckily, one may
choose a `tail_mass` probability that is sufficiently small so
that this rarely occurs. It is important that we work with
`_quantized_cdf`s that have a small finite support; otherwise,
entropy coding runtime performance would suffer. Thus,
`tail_mass` should not be too small, either!
"""
loss = sum(m.loss() for m in self.modules() if isinstance(m, EntropyBottleneck))
return cast(Tensor, loss)

Hi,
Thank you for your detailed answer.
Yes you are right, I am not building compressAI from the source. The requested output is as follows:
COMPRESSAI_PATH = /anaconda/envs/azureml_py38/lib/python3.8/site-packages/compressai.

As for the proposed solution. My CompressionModel class already looks the same as you have mentioned. I copied it earlier, since there was some issues with Multi-GPU training and copying it worked for me.
Please look at my project over here Entropy Models/ Hyperprior Files
I think the issue arises from using multiple versions at one time? I use Pypi to install compressai, but I redefine the files e.g. entropy_models.py again in the code, which might be different from the original pypi version. Could this be a problem?

DataParallel adds a module. prefix by default to every key in the parallel_model.state_dict().

Solutions:

  1. Save the "non-parallel" model:
module = model.module if isinstance(model, DataParallel) else model
state_dict = module.state_dict()
torch.save("output.pth", state_dict)
  1. Load checkpoint, rename all the keys, save new checkpoint:
ckpt = torch.load("input.pth")
print(ckpt.keys())
sd = "state_dict"  # I forgot what it was called.
print("\n".join(ckpt[sd].keys()))
ckpt[sd] = {k.removeprefix("module."): v for k, v in ckpt[sd].items()}
torch.save("output.pth", ckpt)
  1. Same as (2), but do it before loading the state_dict instead.

  2. Load the model weights before wrapping it in DataParallel.

I would say (1) is the best and least likely to cause problems in the future, and maybe do (4) as well.

Yeah you were right. Everything works now. I am attaching my code. in case if someone else face a similar problem.

def load_checkpoint(path, model):
    snapshot = torch.load(path)
    itr = snapshot['itr']
    dict_ = {}
    print(f'Loaded from {itr} iterations')
    
    for k, v in snapshot["model"].items():
 
        k = remove_prefix(k,"module.")
        dict_[k] = v
    snapshot["model"] = dict_
    model.load_state_dict(snapshot['model'])`

and in train.py, we have

model = model.to(device)
optimizer,aux_optimizer = configure_optimizers(model,config)
if args.resume:
    itr, model = load_checkpoint(args.resume, model)
    logger.load_itr(itr)
    
if torch.cuda.device_count() > 1:
    model = CustomDataParallel(model)