open-mmlab/mmengine

[Bug] load_from pretrained checkpoint fails using FlexibleRunner and DeepSpeed

Closed this issue · 2 comments

Prerequisite

Environment

Pytorch 2.3.0 installed using conda

OrderedDict([
('sys.platform', 'linux'), 
('Python', '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]'), 
('CUDA available', True), 
('MUSA available', False), 
('numpy_random_seed', 2147483648), 
('GPU 0,1', 'NVIDIA RTX A6000'), 
('CUDA_HOME', '/usr'), 
('NVCC', 'Cuda compilation tools, release 11.5, V11.5.119'), 
('GCC', 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0'), 
('PyTorch', '2.3.0.dev20240211+cu118'), 
('PyTorch compiling details', 'PyTorch built with:\n  - GCC 9.3\n  - C++ Version: 201703\n  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)\n  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n  - LAPACK is enabled (usually provided by MKL)\n  - NNPACK is enabled\n  - CPU capability usage: AVX512\n  - CUDA Runtime 11.8\n  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90\n  - CuDNN 8.7\n  - Magma 2.6.1\n  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, \n'), ('TorchVision', '0.18.0.dev20240211+cu118'), 
('OpenCV', '4.9.0'), 
('MMEngine', '0.10.3')])

Reproduces the problem - code sample

This is examples/distributed_with_flexible_runners.py with load_from args added to FlexibleRunner

# Copyright (c) OpenMMLab. All rights reserved.
import argparse

import torch
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms

from mmengine.evaluator import BaseMetric
from mmengine.model import BaseModel
from mmengine.runner._flexible_runner import FlexibleRunner


class MMResNet50(BaseModel):

    def __init__(self):
        super().__init__()
        self.resnet = torchvision.models.resnet50()

    def forward(self, imgs, labels, mode):
        x = self.resnet(imgs)
        if mode == 'loss':
            return {'loss': F.cross_entropy(x, labels)}
        elif mode == 'predict':
            return x, labels


class Accuracy(BaseMetric):

    def process(self, data_batch, data_samples):
        score, gt = data_samples
        self.results.append({
            'batch_size': len(gt),
            'correct': (score.argmax(dim=1) == gt).sum().cpu(),
        })

    def compute_metrics(self, results):
        total_correct = sum(item['correct'] for item in results)
        total_size = sum(item['batch_size'] for item in results)
        return dict(accuracy=100 * total_correct / total_size)


def parse_args():
    parser = argparse.ArgumentParser(description='Distributed Training')
    parser.add_argument('--local_rank', '--local-rank', type=int, default=0)
    parser.add_argument('--use-fsdp', action='store_true')
    parser.add_argument('--use-deepspeed', action='store_true')
    parser.add_argument('--use-colossalai', action='store_true')
    args = parser.parse_args()
    return args


def main():
    args = parse_args()
    norm_cfg = dict(mean=[0.491, 0.482, 0.447], std=[0.202, 0.199, 0.201])
    train_set = torchvision.datasets.CIFAR10(
        'data/cifar10',
        train=True,
        download=True,
        transform=transforms.Compose([
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize(**norm_cfg)
        ]))
    valid_set = torchvision.datasets.CIFAR10(
        'data/cifar10',
        train=False,
        download=True,
        transform=transforms.Compose(
            [transforms.ToTensor(),
             transforms.Normalize(**norm_cfg)]))
    train_dataloader = dict(
        batch_size=128,
        dataset=train_set,
        sampler=dict(type='DefaultSampler', shuffle=True),
        collate_fn=dict(type='default_collate'))
    val_dataloader = dict(
        batch_size=128,
        dataset=valid_set,
        sampler=dict(type='DefaultSampler', shuffle=False),
        collate_fn=dict(type='default_collate'))

    if args.use_deepspeed:
        strategy = dict(
            type='DeepSpeedStrategy',
            fp16=dict(
                enabled=True,
                fp16_master_weights_and_grads=False,
                loss_scale=0,
                loss_scale_window=500,
                hysteresis=2,
                min_loss_scale=1,
                initial_scale_power=15,
            ),
            inputs_to_half=[0],
            # bf16=dict(
            #     enabled=True,
            # ),
            zero_optimization=dict(
                stage=3,
                allgather_partitions=True,
                reduce_scatter=True,
                allgather_bucket_size=50000000,
                reduce_bucket_size=50000000,
                overlap_comm=True,
                contiguous_gradients=True,
                cpu_offload=False),
        )
        optim_wrapper = dict(
            type='DeepSpeedOptimWrapper',
            optimizer=dict(type='AdamW', lr=1e-3))
    elif args.use_fsdp:
        from functools import partial

        from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy
        size_based_auto_wrap_policy = partial(
            size_based_auto_wrap_policy, min_num_params=1e7)
        strategy = dict(
            type='FSDPStrategy',
            model_wrapper=dict(auto_wrap_policy=size_based_auto_wrap_policy))
        optim_wrapper = dict(
            type='AmpOptimWrapper', optimizer=dict(type='AdamW', lr=1e-3))
    elif args.use_colossalai:
        from colossalai.tensor.op_wrapper import colo_op_impl

        # ColossalAI overwrite some torch ops with their custom op to
        # make it compatible with `ColoTensor`. However, a backward error
        # is more likely to happen if there are inplace operation in the
        # model.
        # For example, layers like `conv` + `bn` + `relu` is OK when `relu` is
        # inplace since PyTorch builtin ops `batch_norm` could handle it.
        # However, if `relu` is an `inplaced` op while `batch_norm` is an
        # custom op, an error will be raised since PyTorch thinks the custom op
        # could not handle the backward graph modification caused by inplace
        # op.
        # In this example, the inplace op `add_` in resnet could raise an error
        # since PyTorch consider the custom op before it could not handle the
        # backward graph modification
        colo_op_impl(torch.Tensor.add_)(torch.add)
        strategy = dict(type='ColossalAIStrategy')
        optim_wrapper = dict(optimizer=dict(type='HybridAdam', lr=1e-3))
    else:
        strategy = None
        optim_wrapper = dict(
            type='AmpOptimWrapper', optimizer=dict(type='AdamW', lr=1e-3))

    runner = FlexibleRunner(
        model=MMResNet50(),
        work_dir='./work_dirs',
        strategy=strategy,
        train_dataloader=train_dataloader,
        optim_wrapper=optim_wrapper,
        param_scheduler=dict(type='LinearLR'),
        train_cfg=dict(by_epoch=True, max_epochs=10, val_interval=1),
        val_dataloader=val_dataloader,
        val_cfg=dict(),
        val_evaluator=dict(type=Accuracy),
        load_from="https://download.pytorch.org/models/resnet50-19c8e357.pth"),

    runner.train()


if __name__ == '__main__':
    # torchrun --nproc-per-node 2 distributed_training_with_flexible_runner.py --use-fsdp  # noqa: 501
    # torchrun --nproc-per-node 2 distributed_training_with_flexible_runner.py --use-deepspeed  # noqa: 501
    # torchrun --nproc-per-node 2 distributed_training_with_flexible_runner.py
    # python distributed_training_with_flexible_runner.py
    main()

Reproduces the problem - command or script

torchrun --nproc-per-node 2 examples/distributed_training_with_flexible_runner.py --use-deepspeed

Reproduces the problem - error message

02/20 10:02:16 - mmengine - INFO - Load checkpoint from https://download.pytorch.org/models/resnet50-19c8e357.pth
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/randd/projects/mmengine/examples/distributed_training_with_flexible_runner.py", line 168, in <module>
[rank0]:     main()
[rank0]:   File "/home/randd/projects/mmengine/examples/distributed_training_with_flexible_runner.py", line 160, in main
[rank0]:     runner.train()
[rank0]:   File "/home/randd/miniconda3/envs/mmengine/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1195, in train
[rank0]:     self.load_or_resume()
[rank0]:   File "/home/randd/miniconda3/envs/mmengine/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1144, in load_or_resume
[rank0]:     self.load_checkpoint(self._load_from)
[rank0]:   File "/home/randd/miniconda3/envs/mmengine/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1528, in load_checkpoint
[rank0]:     self.strategy.load_checkpoint(
[rank0]:   File "/home/randd/miniconda3/envs/mmengine/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 434, in load_checkpoint
[rank0]:     _, extra_ckpt = self.model.load_checkpoint(
[rank0]:   File "/home/randd/miniconda3/envs/mmengine/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2740, in load_checkpoint
[rank0]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank0]:   File "/home/randd/miniconda3/envs/mmengine/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2792, in _load_checkpoint
[rank0]:     sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list, checkpoint_engine=self.checkpoint_engine)
[rank0]:   File "/home/randd/miniconda3/envs/mmengine/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 43, in get_sd_loader
[rank0]:     return MegatronSDLoader(ckpt_list, version, checkpoint_engine)
[rank0]:   File "/home/randd/miniconda3/envs/mmengine/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 193, in __init__
[rank0]:     super().__init__(ckpt_list, version, checkpoint_engine)
[rank0]:   File "/home/randd/miniconda3/envs/mmengine/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 55, in __init__
[rank0]:     self.check_ckpt_list()
[rank0]:   File "/home/randd/miniconda3/envs/mmengine/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 168, in check_ckpt_list
[rank0]:     assert len(self.ckpt_list) > 0
[rank0]: AssertionError

Additional information

I am trying to train a CNN model using MMEngine with the DeepSpeed strategy following the example: distributed_with_flexiblerunner.py on a 2 GPU machine. I would like to use a pretrained checkpoint as the starting model. I am using the load_from: arguments to specify the checkpoint, I have tried both using a URL as well as using a download local file.

Expected behaviour:
I expected to be able to specify a 'normal' pretrained model checkpoint using the load_from argument and for the pretrained checkpoint to be loaded into the model successfully.

It looks like the deepspeed enginer is trying to find sharded checkpoints to load in the first instance; when loading from a pretrained checkpoint these don't exist. Perhaps there is some way to generate them from the pretrained checkpoint but I can't find any documentation or clues on what to do to enable a base pretrained checkpoint.

Hello, the load_from interface is generally used to resume experiments that have been interrupted, so the type of weights it receives is consistent with the weights obtained from training.

If you want to initialize the model with pretrained weights (just like 'normal' weights in torchvision), you can use the model's init_cfg. This section can be found in the Initialize the model with pretrained model section of https://mmengine.readthedocs.io/en/latest/advanced_tutorials/initialize.html

Thanks I was able to get this working with init_cfg.

This would be good information to add to the documentation.