使用教程提供数据集训练yolov7模型加载出现问题。
Closed this issue · 4 comments
1.环境配置:modelart(mindspore_1.10.0-cann_6.0.1-py_3.7-euler_2.8.3镜像) ,EulerOS 2.0 (SP8), CANN-6.0.1,mindspore1.10, mindyolo r0.1。
2.数据集制作及训练过程文档:https://github.com/mindspore-lab/mindyolo/blob/master/examples/finetune_SHWD/README.md
3.训练过程中出现报错:
RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (21, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.
请问如何让解决。
[注:日志文件见附件。
outputlog.txt
1.环境配置:modelart(mindspore_1.9.0-cann_6.0.0-py_3.7-euler_2.8.3), EulerOS 2.0 (SP8), CANN-6.0.RC1,mindyolo r0.1。
2.参考Master分支数据集自建方式(https://github.com/mindspore-lab/mindyolo/tree/master/examples/finetune_SHWD)自建数据集训练模型,训练用的mindyolo_r0.1分支。
3.配置文件:
BASE: [
'/home/ma-user/work/mindyolo-r0.1/configs/yolov8/yolov8n.yaml',
]
per_batch_size: 16 # 16 * 8 = 128
img_size: 640 # image sizes
weight: /home/ma-user/work/mindyolo-r0.1/pre-ckpt/yolov8-n_500e_mAP372-cc07f5bd.ckpt
strict_load: False
data:
dataset_name: shwd
train_set: /home/ma-user/work/mindyolo-r0.1/dataset-test/SHWD/train.txt
val_set: /home/ma-user/work/mindyolo-r0.1/dataset-test/SHWD/val.txt
test_set: /home/ma-user/work/mindyolo-r0.1/dataset-test/SHWD/val.txt
nc: 3
names: [ 'helmet', 'gloves', 'shawl' ]
optimizer:
lr_init: 0.001 # initial learning rate
3.训练过程(模型加载)中出现报错,利用yolov8n、yolov7-tiny、yolov5n预训练模型训练,都出现了模型加载错误:
yolov8n:
[CRITICAL] ME(22510:281472828627520,MainProcess):2024-03-02-14:35:39.167.353 [mindspore/train/serialization.py:112] Failed to combine the net and the parameters for param model.model.22.cv3.0.0.conv.weight.
Traceback (most recent call last):
File "train.py", line 290, in
train(args)
File "train.py", line 128, in train
load_pretrain(network, args.weight, ema, args.ema_weight) # load pretrain
File "/home/ma-user/work/mindyolo-r0.1/mindyolo/utils/utils.py", line 91, in load_pretrain
ms.load_param_into_net(network, param_dict)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 703, in load_param_into_net
_load_dismatch_prefix_params(net, parameter_dict, param_not_load, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 742, in _load_dismatch_prefix_params
_update_param(param, new_param, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 118, in _update_param
raise RuntimeError(msg)
RuntimeError: For 'load_param_into_net', model.model.22.cv3.0.0.conv.weight in the argument 'net' should have the same shape as model.model.22.cv3.0.0.conv.weight in the argument 'parameter_dict'. But got its shape (64, 64, 3, 3) in the argument 'net' and shape (80, 64, 3, 3) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.
yolov7-tiny:
[CRITICAL] ME(44701:281473522788928,MainProcess):2024-03-02-14:51:05.733.431 [mindspore/train/serialization.py:112] Failed to combine the net and the parameters for param model.model.77.m.0.weight.
Traceback (most recent call last):
File "train.py", line 290, in
train(args)
File "train.py", line 128, in train
load_pretrain(network, args.weight, ema, args.ema_weight) # load pretrain
File "/home/ma-user/work/mindyolo-r0.1/mindyolo/utils/utils.py", line 91, in load_pretrain
ms.load_param_into_net(network, param_dict)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 703, in load_param_into_net
_load_dismatch_prefix_params(net, parameter_dict, param_not_load, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 742, in _load_dismatch_prefix_params
_update_param(param, new_param, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 118, in _update_param
raise RuntimeError(msg)
RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (24, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.
yolov5n:
[CRITICAL] ME(49608:281473261058624,MainProcess):2024-03-02-14:53:49.428.28 [mindspore/train/serialization.py:112] Failed to combine the net and the parameters for param model.model.24.m.0.weight.
Traceback (most recent call last):
File "train.py", line 290, in
train(args)
File "train.py", line 128, in train
load_pretrain(network, args.weight, ema, args.ema_weight) # load pretrain
File "/home/ma-user/work/mindyolo-r0.1/mindyolo/utils/utils.py", line 91, in load_pretrain
ms.load_param_into_net(network, param_dict)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 703, in load_param_into_net
_load_dismatch_prefix_params(net, parameter_dict, param_not_load, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 742, in _load_dismatch_prefix_params
_update_param(param, new_param, strict_load)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/serialization.py", line 118, in _update_param
raise RuntimeError(msg)
RuntimeError: For 'load_param_into_net', model.model.24.m.0.weight in the argument 'net' should have the same shape as model.model.24.m.0.weight in the argument 'parameter_dict'. But got its shape (24, 64, 1, 1) in the argument 'net' and shape (255, 64, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.
4.在不启用预训练模型模型情况下,可执行训练程序:
2024-03-02 15:21:02,162 [INFO] Epoch 6/300, Step 39/39, step time: 1896.07 ms
2024-03-02 15:21:02,871 [INFO] Saving model to ./runs/2024.03.02-15:10:50/weights/yolov5n_shwd-6_39.ckpt
2024-03-02 15:21:02,872 [INFO] Epoch 6/300, epoch time: 1.24 min.
2024-03-02 15:22:16,137 [WARNING] overflow, still update, loss scale adjust to 1024.0
2024-03-02 15:22:16,147 [INFO] Epoch 7/300, Step 39/39, imgsize (640, 640), loss: 0.2346, lbox: 0.0723, lobj: 0.0548, lcls: 0.1075, cur_lr: 0.0009768999880179763
2024-03-02 15:22:16,149 [INFO] Epoch 7/300, Step 39/39, step time: 1878.87 ms
2024-03-02 15:22:16,761 [INFO] Saving model to ./runs/2024.03.02-15:10:50/weights/yolov5n_shwd-7_39.ckpt
2024-03-02 15:22:16,762 [INFO] Epoch 7/300, epoch time: 1.23 min.
2024-03-02 15:23:31,967 [WARNING] overflow, still update, loss scale adjust to 1024.0
2024-03-02 15:23:31,977 [INFO] Epoch 8/300, Step 39/39, imgsize (640, 640), loss: 0.2195, lbox: 0.0681, lobj: 0.0481, lcls: 0.1034, cur_lr: 0.0009736000210978091
2024-03-02 15:23:31,979 [INFO] Epoch 8/300, Step 39/39, step time: 1928.60 ms
2024-03-02 15:23:32,630 [INFO] Saving model to ./runs/2024.03.02-15:10:50/weights/yolov5n_shwd-8_39.ckpt
2024-03-02 15:23:32,631 [INFO] Epoch 8/300, epoch time: 1.26 min.
请问老师如何解决预训练模型无法载入问题?
yolov7-tiny:
RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (24, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.
我这边训练的类别数是3(3*(3+5)=24)个,预训练模型类别是80(3*(80+5)=255)。导致了shape不一致,但是在训练过程中我修改了配置,是可以丢掉最后一层的权重(shape)。
但是出现了错误,这种情况下应该如何改进呢?
看报错应该是模型结构和权重shape不一致 可能是修改了最后层分类数导致的
1.环境配置:modelart(mindspore_1.10.0-cann_6.0.1-py_3.7-euler_2.8.3镜像) ,EulerOS 2.0 (SP8), CANN-6.0.1,mindspore1.10, mindyolo r0.1。 2.数据集制作及训练过程文档:https://github.com/mindspore-lab/mindyolo/blob/master/examples/finetune_SHWD/README.md 3.训练过程中出现报错: RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (21, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same. 请问如何让解决。 [注:日志文件见附件。 outputlog.txt
权重加载的逻辑是在这个地方进行的 可以尝试在这个函数调试下看看
https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/utils.py#L113