open-mmlab/mmengine

DeepSpeed2 不能自动排除冻结的参数[Bug]

Baboom-l opened this issue · 1 comments

Prerequisite

Environment

mmdetection3.3.0
mmengine 0.8.4
torch1.13.1

Reproduces the problem - code sample

runner_type = 'FlexibleRunner'
strategy = dict(
type='DeepSpeedStrategy',
gradient_clipping=0.1,
fp16=dict(
enabled=True,
fp16_master_weights_and_grads=False,
loss_scale=0,
loss_scale_window=500,
hysteresis=2,
min_loss_scale=1,
initial_scale_power=15,
),
inputs_to_half=['inputs'],
zero_optimization=dict(
stage=2,
allgather_partitions=True,
reduce_scatter=True,
allgather_bucket_size=50000000,
reduce_bucket_size=50000000,
overlap_comm=True,
contiguous_gradients=True,
cpu_offload=False),
)

optim_wrapper = dict(
type='DeepSpeedOptimWrapper',
optimizer=dict(
type='AdamW',
lr=0.0001, # 0.0002 for DeformDETR
weight_decay=0.0001),
# clip_grad=dict(max_norm=0.1, norm_type=2),
paramwise_cfg=dict(custom_keys={
'memory_trans_norm': dict(decay_mult=0),
'in_proj_bias': dict(decay_mult=0),
'backbone': dict(lr_mult=0.01)},
norm_decay_mult=0,
bias_decay_mult=0,
bypass_duplicate=True))

To debug

default_hooks.update(dict(logger=dict(interval=1))) # noqa
log_processor.update(dict(window_size=1)) # noqa

Reproduces the problem - command or script

当我冻结模型中部分权重后运行报错

Reproduces the problem - error message

83dc58d816fced727fd2399df4d3015

Additional information

No response

resolved by #1441