使用教程提供数据集训练yolov7模型加载出现问题。

Question

使用教程提供数据集训练yolov7模型加载出现问题。

Closed this issue 4 months ago · 4 comments

Living190711 commented 9 months ago

1.环境配置：modelart（mindspore_1.10.0-cann_6.0.1-py_3.7-euler_2.8.3镜像），EulerOS 2.0 (SP8), CANN-6.0.1，mindspore1.10， mindyolo r0.1。
2.数据集制作及训练过程文档：https://github.com/mindspore-lab/mindyolo/blob/master/examples/finetune_SHWD/README.md
3.训练过程中出现报错：
RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (21, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.
请问如何让解决。
[注：日志文件见附件。
outputlog.txt

Answer 1 · 2024-03-02T07:26:07.000Z

1.环境配置：modelart（mindspore_1.9.0-cann_6.0.0-py_3.7-euler_2.8.3）， EulerOS 2.0 (SP8), CANN-6.0.RC1，mindyolo r0.1。

2.参考Master分支数据集自建方式（https://github.com/mindspore-lab/mindyolo/tree/master/examples/finetune_SHWD）自建数据集训练模型，训练用的mindyolo_r0.1分支。

3.配置文件：
BASE: [
'/home/ma-user/work/mindyolo-r0.1/configs/yolov8/yolov8n.yaml',
]

per_batch_size: 16 # 16 * 8 = 128
img_size: 640 # image sizes
weight: /home/ma-user/work/mindyolo-r0.1/pre-ckpt/yolov8-n_500e_mAP372-cc07f5bd.ckpt
strict_load: False

data:
dataset_name: shwd
train_set: /home/ma-user/work/mindyolo-r0.1/dataset-test/SHWD/train.txt
val_set: /home/ma-user/work/mindyolo-r0.1/dataset-test/SHWD/val.txt
test_set: /home/ma-user/work/mindyolo-r0.1/dataset-test/SHWD/val.txt
nc: 3

names: [ 'helmet', 'gloves', 'shawl' ]

optimizer:
lr_init: 0.001 # initial learning rate

3.训练过程（模型加载）中出现报错，利用yolov8n、yolov7-tiny、yolov5n预训练模型训练，都出现了模型加载错误：

yolov8n：
[CRITICAL] ME(22510:281472828627520,MainProcess):2024-03-02-14:35:39.167.353 [mindspore/train/serialization.py:112] Failed to combine the net and the parameters for param model.model.22.cv3.0.0.conv.weight.
Traceback (most recent call last):
File "train.py", line 290, in
train(args)
File "train.py", line 128, in train
load_pretrain(network, args.weight, ema, args.ema_weight) # load pretrain
File "/home/ma-user/work/mindyolo-r0.1/mindyolo/utils/utils.py", line 91, in load_pretrain
ms.load_param_into_net(network, param_dict)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 703, in load_param_into_net
_load_dismatch_prefix_params(net, parameter_dict, param_not_load, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 742, in _load_dismatch_prefix_params
_update_param(param, new_param, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 118, in _update_param
raise RuntimeError(msg)
RuntimeError: For 'load_param_into_net', model.model.22.cv3.0.0.conv.weight in the argument 'net' should have the same shape as model.model.22.cv3.0.0.conv.weight in the argument 'parameter_dict'. But got its shape (64, 64, 3, 3) in the argument 'net' and shape (80, 64, 3, 3) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.

yolov7-tiny：
[CRITICAL] ME(44701:281473522788928,MainProcess):2024-03-02-14:51:05.733.431 [mindspore/train/serialization.py:112] Failed to combine the net and the parameters for param model.model.77.m.0.weight.
Traceback (most recent call last):
File "train.py", line 290, in
train(args)
File "train.py", line 128, in train
load_pretrain(network, args.weight, ema, args.ema_weight) # load pretrain
File "/home/ma-user/work/mindyolo-r0.1/mindyolo/utils/utils.py", line 91, in load_pretrain
ms.load_param_into_net(network, param_dict)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 703, in load_param_into_net
_load_dismatch_prefix_params(net, parameter_dict, param_not_load, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 742, in _load_dismatch_prefix_params
_update_param(param, new_param, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 118, in _update_param
raise RuntimeError(msg)
RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (24, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.

yolov5n：
[CRITICAL] ME(49608:281473261058624,MainProcess):2024-03-02-14:53:49.428.28 [mindspore/train/serialization.py:112] Failed to combine the net and the parameters for param model.model.24.m.0.weight.
Traceback (most recent call last):
File "train.py", line 290, in
train(args)
File "train.py", line 128, in train
load_pretrain(network, args.weight, ema, args.ema_weight) # load pretrain
File "/home/ma-user/work/mindyolo-r0.1/mindyolo/utils/utils.py", line 91, in load_pretrain
ms.load_param_into_net(network, param_dict)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 703, in load_param_into_net
_load_dismatch_prefix_params(net, parameter_dict, param_not_load, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 742, in _load_dismatch_prefix_params
_update_param(param, new_param, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 118, in _update_param
raise RuntimeError(msg)
RuntimeError: For 'load_param_into_net', model.model.24.m.0.weight in the argument 'net' should have the same shape as model.model.24.m.0.weight in the argument 'parameter_dict'. But got its shape (24, 64, 1, 1) in the argument 'net' and shape (255, 64, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.

4.在不启用预训练模型模型情况下，可执行训练程序：
2024-03-02 15:21:02,162 [INFO] Epoch 6/300, Step 39/39, step time: 1896.07 ms
2024-03-02 15:21:02,871 [INFO] Saving model to ./runs/2024.03.02-15:10:50/weights/yolov5n_shwd-6_39.ckpt
2024-03-02 15:21:02,872 [INFO] Epoch 6/300, epoch time: 1.24 min.
2024-03-02 15:22:16,137 [WARNING] overflow, still update, loss scale adjust to 1024.0
2024-03-02 15:22:16,147 [INFO] Epoch 7/300, Step 39/39, imgsize (640, 640), loss: 0.2346, lbox: 0.0723, lobj: 0.0548, lcls: 0.1075, cur_lr: 0.0009768999880179763
2024-03-02 15:22:16,149 [INFO] Epoch 7/300, Step 39/39, step time: 1878.87 ms
2024-03-02 15:22:16,761 [INFO] Saving model to ./runs/2024.03.02-15:10:50/weights/yolov5n_shwd-7_39.ckpt
2024-03-02 15:22:16,762 [INFO] Epoch 7/300, epoch time: 1.23 min.
2024-03-02 15:23:31,967 [WARNING] overflow, still update, loss scale adjust to 1024.0
2024-03-02 15:23:31,977 [INFO] Epoch 8/300, Step 39/39, imgsize (640, 640), loss: 0.2195, lbox: 0.0681, lobj: 0.0481, lcls: 0.1034, cur_lr: 0.0009736000210978091
2024-03-02 15:23:31,979 [INFO] Epoch 8/300, Step 39/39, step time: 1928.60 ms
2024-03-02 15:23:32,630 [INFO] Saving model to ./runs/2024.03.02-15:10:50/weights/yolov5n_shwd-8_39.ckpt
2024-03-02 15:23:32,631 [INFO] Epoch 8/300, epoch time: 1.26 min.

请问老师如何解决预训练模型无法载入问题？

Answer 2 · 2024-03-02T08:43:07.000Z

yolov7-tiny：
RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (24, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.

我这边训练的类别数是3(3*(3+5)=24)个，预训练模型类别是80(3*(80+5)=255)。导致了shape不一致，但是在训练过程中我修改了配置，是可以丢掉最后一层的权重（shape）。

但是出现了错误，这种情况下应该如何改进呢？

Answer 3 · 2024-03-12T06:10:41.000Z

看报错应该是模型结构和权重shape不一致可能是修改了最后层分类数导致的

1.环境配置：modelart（mindspore_1.10.0-cann_6.0.1-py_3.7-euler_2.8.3镜像），EulerOS 2.0 (SP8), CANN-6.0.1，mindspore1.10， mindyolo r0.1。 2.数据集制作及训练过程文档：https://github.com/mindspore-lab/mindyolo/blob/master/examples/finetune_SHWD/README.md 3.训练过程中出现报错： RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (21, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same. 请问如何让解决。 [注：日志文件见附件。 outputlog.txt

Answer 4 · 2024-03-12T06:12:08.000Z

权重加载的逻辑是在这个地方进行的可以尝试在这个函数调试下看看
https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/utils.py#L113