PPYOLOE的_bbox_loss训练自己的数据集时计算损失报错ValueError: Target -6 is out of lower bound

Question

PPYOLOE的_bbox_loss训练自己的数据集时计算损失报错ValueError: Target -6 is out of lower bound

Closed this issue 17 days ago · 5 comments

YJH1108 commented 22 days ago

问题确认 Search before asking

我已经查询历史issue，没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

在使用PPYOLOE训练自己的数据集时计算bbox_loss时出现以下错误
“”“
Traceback (most recent call last):
File ".\tools\train.py", line 211, in
main()
File ".\tools\train.py", line 207, in main
run(FLAGS, cfg)
File ".\tools\train.py", line 160, in run
trainer.train(FLAGS.eval)
File "E:\jingsai\PaddleDetection\ppdet\engine\trainer.py", line 577, in train
outputs = model(data)
File "E:\jingsai\PaddleDetection\venv_pd\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "E:\jingsai\PaddleDetection\venv_pd\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "E:\jingsai\PaddleDetection\ppdet\modeling\architectures\meta_arch.py", line 60, in forward
out = self.get_loss()
File "E:\jingsai\PaddleDetection\ppdet\modeling\architectures\yolo.py", line 147, in get_loss
return self._forward()
File "E:\jingsai\PaddleDetection\ppdet\modeling\architectures\yolo.py", line 93, in _forward
yolo_losses = self.yolo_head(neck_feats, self.inputs)
File "E:\jingsai\PaddleDetection\venv_pd\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "E:\jingsai\PaddleDetection\venv_pd\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "E:\jingsai\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 264, in forward
return self.forward_train(feats, targets, aux_pred)
File "E:\jingsai\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 198, in forward_train
return self.get_loss([
File "E:\jingsai\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 455, in get_loss
assign_out_dict = self.get_loss_from_assign(
File "E:\jingsai\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 500, in get_loss_from_assign
self._bbox_loss(pred_distri, pred_bboxes, anchor_points_s,
File "E:\jingsai\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 364, in _bbox_loss
loss_dfl = self._df_loss(pred_dist_pos, assigned_ltrb_pos,
File "E:\jingsai\PaddleDetection\ppdet\modeling\heads\ppyoloe_head.py", line 319, in _df_loss
loss_left = F.cross_entropy(
File "E:\jingsai\PaddleDetection\venv_pd\lib\site-packages\paddle\nn\functional\loss.py", line 1719, in cross_entropy
raise ValueError("Target {} is out of lower bound.".format(
ValueError: Target -1 is out of lower bound.
”“”

出错的行是
“
ppyoloe_head.py中的
loss_dfl = self._df_loss(pred_dist_pos, assigned_ltrb_pos,
self.reg_range[0]) * bbox_weight
”

我尝试打印了pred_dist_pos和assigned_ltrb_pos两个变量，发现assigned_ltrb_pos经常出现较大的值

不清楚是bug还是我在训练自己的数据集时缺少设置什么参数
pred_dist_pos和assigned_ltrb_pos又是在描述什么呢？

望解答

复现环境 Environment

nothing

Bug描述确认 Bug description confirmation

我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息，确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR？ Are you willing to submit a PR?

我愿意提交PR！I'd like to help by submitting a PR!

Answer 1 · 2024-05-09T13:57:01.000Z

我尝试进一步debug发现，assigned_ltrb值域是正常的，在reg_range的范围之内（默认0~17），但是为什么经过masked_select之后会出现值域之外的值，例如下图中assigned_ltrb_pos出现了28,60,92.......或者负数值

我对mask_select的理解是只会根据mask从原tensor中取值，不知道我是否理解有误

Answer 2 · 2024-05-10T03:07:57.000Z

在CPU版本下masked_select能正常得到结果
我对环境是：
paddlepaddle-gpu 2.3.2
CUDA11.2
cudnn 8.2

code:
"""
import paddle

print(paddle.version)
x = paddle.randn((10,))
mask = x >= 0
y = paddle.masked_select(x, mask)
print(x)
print(mask)
print(y)
"""

Answer 3 · 2024-05-10T03:48:15.000Z

gpu是什么版本的

Answer 4 · 2024-05-10T07:20:45.000Z

gpu是什么版本的

3050Ti ，驱动版本546.80

安装paddlepaddle-cpu使用的是：
python -m pip install paddlepaddle==2.3.2 -i https://mirror.baidu.com/pypi/simple

安装paddlepaddle-gpu 2.3使用的是：
python -m pip install paddlepaddle-gpu==2.3.2.post112 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html

后面我发现使用pd2.6时没有这个问题
安装paddlepaddle-gpu 2.6：
python -m pip install paddlepaddle-gpu==2.6.1.post112 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html

但是我现在参加一个比赛最高只能使用2.3

Answer 5 · 2024-05-11T06:49:54.000Z

这应该是之前的paddle有bug 后面的版本修复的，，试一下dfl那个区间改成 [0-17]