训练自己的数据集问题

Question

训练自己的数据集问题

li2jin opened this issue 3 years ago · 15 comments

作者您好！
我在使用论文中的数据集时，可以正常的训练和测试，在换上自己的数据集时，可以进行第一轮的训练，但是在最小化的过程中，代码报了如下错误，望解答 QAQ
File "tools/train.py", line 267, in
main()
File "tools/train.py", line 203, in main
distributed=distributed, validate=(not args.no_validate), timestamp=timestamp, meta=meta)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmdet/apis/train.py", line 122, in train_detector
runner.run([data_loaders_L, data_loaders_U], cfg.workflow, cfg.total_epochs)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 192, in run
epoch_runner([data_loaders[i], data_loaders_u[i]], **kwargs)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 60, in train
outputs = self.model.train_step(X_L, self.optimizer, **kwargs)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 31, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 228, in train_step
losses = self(**data)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmdet/core/fp16/decorators.py", line 51, in new_func
return old_func(*args, **kwargs)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 162, in forward
return self.forward_train(x, img_metas, **kwargs)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmdet/models/detectors/single_stage.py", line 83, in forward_train
losses = self.bbox_head.forward_train(x, img_metas, y_loc_img, y_cls_img, y_loc_img_ignore)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 81, in forward_train
loss = self.L_wave_min(*loss_inputs, y_loc_img_ignore=y_loc_img_ignore)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmdet/core/fp16/decorators.py", line 131, in new_func
return old_func(*args, **kwargs)
File "/home/zx/anaconda3/envs/active/lib/python3.7/site-packages/mmdet/models/dense_heads/MIAOD_head.py", line 479, in L_wave_min
if y_loc_img[0][0][0]< 0:
IndexError: index 0 is out of bounds for dimension 0 with size 0

Answer 1 · 2021-09-13T03:24:40.000Z

如此处所述，一个可行的方案是：

将 mmdet/models/dense_heads/MIAOD_head.py 中第 479 行的 L_wave_min 函数的：

if y_loc_img[0][0][0] < 0:

改为：

if y_loc_img[0][0] < 0:

如果不行的话，你可以在出错行那里设置一个断点并且打印一下 y_loc_img、y_loc_img[0]、y_loc_img[0][0] 和y_loc_img[0][0][0] 吗？

As described here, a possible solution is:

To change

if y_loc_img[0][0][0] < 0:

in Line 479 in L_wave_min in mmdet/models/dense_heads/MIAOD_head.py to:

if y_loc_img[0][0][0] < 0:

If it doesn't work, can you set a breakpoint at the error line and print y_loc_img, y_loc_img[0], y_loc_img[0][0] and y_loc_img[0][0][0]?

Answer 2 · 2021-09-13T08:55:07.000Z

感谢您的回答，我试着设置断点,print(y_loc_img[0])之后，打印的是如下的列表格式，

tensor([[556.6667, 64.4960, 568.5186, 80.4347]], `device='cuda:0')

而打印y_loc_img[0][0]里有两个类似的列表，打印y_loc_img[0][0][0]会报错，原代码改为y_loc_img[0][0]会报错，现在不知道怎么改了。

Answer 3 · 2021-09-13T09:15:22.000Z

你可以再提供一下在断点处 print(y_loc_img[0][0]) 的结果与将代码改为 y_loc_img[0][0] 后的报错日志吗？

Can you provide the result of print(y_loc_img[0][0]) at the breakpoint and the error log after the code is changed to y_loc_img[0][0] ?

Answer 4 · 2021-09-13T09:35:46.000Z

改为y_loc_img[0][0] < 0的报错为 Boolean value of Tensor with more than one value is ambiguous

上次print(y_loc_img[0][0])的结果是类似[tensor([[556.6667, 64.4960, 568.5186, 80.4347]], `device='cuda:0') , tensor([914.8148, 435.9040, 924.8148, 460.3680], device='cuda:0')]一堆列表
但是刚刚报错
print(y_loc_img[0][0])
IndexError: index 0 is out of bounds for dimension 0 with size 0
打印的列表和print(y_loc_img[0])一样

Answer 5 · 2021-09-13T10:36:47.000Z

我再次尝试了print(y_loc_img[0][0]），输出为tensor([-1., -1., -1., -1.], device='cuda:0')这种格式的

我又尝试了print(y_loc_img[0][0][0])之后，这次打印出列表了，是类似tensor(785.1852, device='cuda:0')

Answer 6 · 2021-09-13T11:12:40.000Z

你说的这几个 y_loc_img[0][0] 和 y_loc_img[0][0][0] 以及对应的报错结果都是在同一个 data batch 下的吗？如果不是，你可以给出一个它们之间的对应关系吗？或者你也可以仅在会报错时将这两个变量打印出来。

Are these y_loc_img[0][0] and y_loc_img[0][0][0] and the corresponding errors for the same data batch? If not, can you give a correspondence between them? Or you can print out these two variables only when an error is reported.

Answer 7 · 2021-09-13T11:23:55.000Z

都是在第一个date batch下的而且是第一批的epoch【3】之后，我现在尝试换您提供的数据集试一次断点那打印出来的格式是否和我的一样，目前正在训练

Answer 8 · 2021-09-13T11:26:43.000Z

都是在第一个date batch下的而且是第一批的epoch【3】之后，我现在尝试换您提供的数据集试一次断点那打印出来的格式是否和我的一样，目前正在训练

Answer 9 · 2021-09-13T11:28:51.000Z

我的意思是希望你能提供在相同的一个 data batch 下的结果，而不是每次训练都会变化的时候。最好这个 data batch 就是报错的那个 data batch。如果你无法使用 PyCharm 等 IDE 设置断点的话，你可以在程序中添加下面几行来实现异常检测：

try:
    # 可能会报错的那一行
except:
    print y_loc_img[0][0][0]
    print y_loc_img[0][0]
    # 再次复制可能会报错的那一行

What I mean is that I hope you can provide the results under the same data batch, not when it changes every time you train. The best data batch is the data batch with the error reported. If you cannot use IDEs like PyCharm to set breakpoints, you can add the following lines to the code to achieve exception detection:

try:
     # The line that may report an error
except:
     print y_loc_img[0][0][0]
     print y_loc_img[0][0]
     # Copy the line that may report an error again

Answer 10 · 2021-09-13T12:34:45.000Z

按照您的建议，这次报错时：
if y_loc_img[0][0][0] < 0:
IndexError: index 0 is out of bounds for dimension 0 with size 0

During handling of the above exception, another exception occurred:
...
print(y_loc_img[0][0][0])
IndexError: index 0 is out of bounds for dimension 0 with size 0

我把print(y_loc_img[0][0][0])注释掉后，报的错误是：
if y_loc_img[0][0][0] < 0:
IndexError: index 0 is out of bounds for dimension 0 with size 0

During handling of the above exception, another exception occurred:
...
print(y_loc_img[0][0])
IndexError: index 0 is out of bounds for dimension 0 with size 0

Answer 11 · 2021-09-13T12:40:45.000Z

那如果将 except 里面的代码改成：

print y_loc_img

会是什么样的结果？

Then what if the code in except is changed to:

print y_loc_img

Answer 12 · 2021-09-13T12:52:47.000Z

最后的日志是：2021-09-13 20:47:11,988 - mmdet - INFO - Epoch [1][90/126] lr: 1.000e-03, eta: 0:00:23, time: 0.316, data_time: 0.009, memory: 2087, l_det_cls: 0.5413, l_det_loc: 0.3209, l_wave_dis: 0.0020, l_imgcls: 0.1809, L_wave_min: 1.0451

打印的结果是：
[tensor([], device='cuda:0', size=(0, 4)), tensor([[107.4074, 305.8000, 116.6667, 320.2560],
[437.7778, 442.2053, 447.7778, 451.1013],
[678.5185, 445.9120, 691.4814, 455.1786],
[778.5185, 114.1653, 788.1481, 130.8453]], device='cuda:0')]

报的错误是：
if y_loc_img[0][0][0] < 0:
IndexError: index 0 is out of bounds for dimension 0 with size 0

During handling of the above exception, another exception occurred:
...
if y_loc_img[0][0][0] < 0:
IndexError: index 0 is out of bounds for dimension 0 with size 0
为什么列表里面没有数值呢？

Answer 13 · 2021-09-14T01:42:01.000Z

可能是因为你在使用自己的数据集时，对于未标注的图像没有添加一个虚拟的边界框。MI-AOD 需要对于未标注集内的每一张图像都要添加任意一个定位框（定位不要求准确）。具体细节请参见 FAQ 的 自定义修改 部分的问题 3 及对应的问题 Issue。

The reason may be that you did not add a virtual bounding box to the unlabeled image when using your own dataset. MI-AOD needs to add any bounding box to each image in the unlabeled set (the positioning is not required to be accurate). For details, please refer to Question 3 in the Custom Modifications part of FAQ and the corresponding Issue.

Answer 14 · 2021-09-14T01:47:48.000Z

好的，感谢您的解答

Answer 15 · 2021-09-14T01:53:18.000Z

好的，感谢您的解答