yuantn/MI-AOD

使用此程序跑自己制作的单类voc数据集出错如何修改。

yushiyundelei opened this issue · 18 comments

我使用这个程序跑了自己制作的VOC数据集,使用的默认的检测器跑的,然后首次循环可以跑得通,出现了评估表ap,然后选择部分数据尽心那个第二次循环。但是因为我自己制作的VOC数据集只有一个类别,因此这个首次选取5%数据时其中的训练参数l_imgcls:0.0000 这个一直是0,然后在第二轮再次选取2.5%数据之后的训练报错,显示index 0 is out of bounds for dimension 0 with size 0,麻烦问下这个如果迁移到只有一个类别的数据即l_imgcls:0.0000时,应该怎么修改将其跑通。谢谢~
第二次循环报错为:
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/models/dense_heads/MIAOD_head.py", line 479, in L_wave_min
if y_loc_img[0][0][0] < 0:
IndexError: index 0 is out of bounds for dimension 0 with size 0

我查看了自定义数据问题的问题3 但是我的数据完全按照voc数据格式制作的即每张jpg图像对应一个xml的标签,全部数据都有标记框但是还是出错显示if y_loc_img[0][0][0] < 0:
IndexError: index 0 is out of bounds for dimension 0 with size 0
然后我按照之前您解答的改为if y_loc_img[0][0] < 0:
报错显示 boolean value of Tensor with more than one value is ambiguous
打印的y_loc_img[0][0] 为 tensor([254.6875, 229.6875, 378.1250, 293.7500], device='cuda:0')
打印的y_loc_img[0][0][0]为tensor(254.6875, device='cuda:0')

我按照自定义数据问题的问题2,因为我的数据只有一个类别,试着加了一个类别,现在有两类,但是新加的类别无任何数据,试着训练了下,比之前单类别跑的长一些,跑到了第三轮,但是第一轮以及第二轮的评估结果mAP全都变为了0.000。
如下:+-----------+-----+-------+--------+-------+
| class | gts | dets | recall | ap |
+-----------+-----+-------+--------+-------+
| cancer | 973 | 54606 | 0.000 | 0.000 |
| no cancer | 0 | 0 | 0.000 | 0.000 |
+-----------+-----+-------+--------+-------+
| mAP | | | | 0.000 |
+-----------+-----+-------+--------+-------+
之前使用一个类别训练时只运行了一轮但是出现了正常的评估结果mAP=0.55,但是出现了前边两条提到的问题。

本次使用两类时出现了问题在第三轮循环快结束时这两个数据都变为l_wave_dis_minus: nan, L_wave_max: nan,如下:
2022-02-24 17:51:50,185 - mmdet - INFO - Epoch [1][50/749] lr: 1.000e-03, eta: 0:03:39, time: 0.157, data_time: 0.006, memory: 1496, l_det_cls: 0.7274, l_det_loc: 0.3047, l_wave_dis_minus: 1.2446, L_wave_max: 2.2767
2022-02-24 17:52:05,592 - mmdet - INFO - Epoch [1][100/749] lr: 1.000e-03, eta: 0:02:32, time: 0.156, data_time: 0.006, memory: 1496, l_det_cls: nan, l_det_loc: 0.3166, l_wave_dis_minus: nan, L_wave_max: nan
然后再循环时就出现下述错误:
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [0,0,0] Assertion input_val >= zero && input_val <= one failed.
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/core/anchor/anchor_generator.py", line 225, in grid_anchors
self.base_anchors[i].to(device),
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
问题文档我也都看了,都试了试,还是跑不通,希望您能看一下进行解答。

  1. 关于 y_loc_img[0][0][0] < 0 的问题,你可以参考一下 问题 #40 的修改,或者 仅在报错的位置 打印 y_loc_img[0][0]y_loc_img[0][0][0],而不是在第一个 batch 时就打印。如果打印出的关于 y_loc_img 相关的变量有问题的话,那么应该是数据集的标注出现了问题,可以对应地查找一下。

  2. 关于添加新类别后 l_det_cls 变为 nan 的问题,可以参考一下 问题 #35,看看是否和第一个问题一样,是数据集标注的问题。


  1. For the problem of y_loc_img[0][0][0] < 0, you can refer to the solution of Issue #40, or print y_loc_img[0][0] and y_loc_img[0][0][0] only when the error is reported, not at the first batch. If there is a problem with the variables related to y_loc_img printed, then there should be a problem with the annotation of the dataset, and you can check accordingly.

  2. For the problem that l_det_cls becomes nan after adding a new class, you can refer to Issue #35 to see if there is a problem with dataset annotation, like the previous problem.

你好,我又重新制作了自己的数据集,加上了判断保证生成的每个标注一定都有标注框,然后我在单类的情况下重新进行了训练,这次7轮可以完整的运行,但是在第一轮评估mAP时显示
+-----------+-----+-------+--------+-------+
| class | gts | dets | recall | ap |
+-----------+-----+-------+--------+-------+
| cancer | 973 | 0 | 0.000 | 0.000 |
+-----------+-----+-------+--------+-------+
| mAP | | | | 0.000 |
+-----------+-----+-------+--------+-------+
而之后的6轮中,每轮选择2.5%的数据之后训练产生的评估结果都正常显示
即每轮之后都有对应的mAP值,问题就是第一轮初始训练时的mAP为什么为0而之后的训练是可以完整跑完的并产生正确的mAP值 后六轮mAP大约0.55-0.65递增

您好,我认为有可能是因为第一轮初始的样本过少导致性能不佳,可以考虑在第一轮训练时增加2.5%的样本量。


Hello, I think there may be too few initial labeled samples in the first cycle resulting in poor performance. You can consider increasing the size of the labeled set by 2.5% in the first cycle.

感谢回复,确实是第一轮样本过少的问题,首轮增加至10%便可正常显示,我想问一下我在您设置的默认MIAOD的参数下成功跑通7轮并且出现正确评估的mAP是不是就可以说我制作的VOC数据集是没有问题的。
我在进行调参再次跑程序时,根据您论文中写到的我将默认的(λ=0.5 k=10000)改为(λ=1.0 k=10000)时在第五轮出现错误 3个参数为nan 跑完此epoch程序就终止了,错误如下,然后我又使用(λ=0.5 k=2000)跑了MIAOD在第四轮的时候出现了相同的错误。
错误显示为:
2022-03-08 03:33:09,887 - mmdet - INFO - Epoch [1][1100/1123] lr: 1.000e-03, eta: 0:00:06, time: 0.150, data_time: 0.005, memory: 1492, l_det_cls: nan, l_det_loc: 0.1330, l_wave_dis_minus: nan, L_wave_max: nan
2022-03-08 03:33:16,739 - mmdet - INFO - Saving checkpoint at 1 epochs
2022-03-08 03:33:21,037 - mmdet - INFO - Start running, host: stu1@dell-server-hgh, work_directory: /home/stu1/Documents/datademo/MI-AOD-master/work_dirs/MI-AOD/20220307_231625
2022-03-08 03:33:21,038 - mmdet - INFO - workflow: [('train', 1)], max: 3 epochs
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [0,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [1,0,0] Assertion input_val >= zero && input_val <= one failed.
Traceback (most recent call last):
File "tools/train.py", line 257, in
main()
File "tools/train.py", line 232, in main
distributed=distributed, validate=args.no_validate, timestamp=timestamp, meta=meta)
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/apis/train.py", line 120, in train_detector
runner.run(data_loaders_L, cfg.workflow, cfg.total_epochs)
File "/home/stu1/Documents/datademo/mmcv-1.0.5/mmcv/runner/epoch_based_runner.py", line 161, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/stu1/Documents/datademo/mmcv-1.0.5/mmcv/runner/epoch_based_runner.py", line 33, in train
outputs = self.model.train_step(X_L, self.optimizer, **kwargs)
File "/home/stu1/Documents/datademo/mmcv-1.0.5/mmcv/parallel/data_parallel.py", line 31, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/models/detectors/base.py", line 228, in train_step
losses = self(**data)
File "/home/stu1/anaconda3/envs/miaod/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/core/fp16/decorators.py", line 51, in new_func
return old_func(*args, **kwargs)
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/models/detectors/base.py", line 162, in forward
return self.forward_train(x, img_metas, **kwargs)
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/models/detectors/single_stage.py", line 83, in forward_train
losses = self.bbox_head.forward_train(x, img_metas, y_loc_img, y_cls_img, y_loc_img_ignore)
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/models/dense_heads/base_dense_head.py", line 64, in forward_train
L_det_2 = self.L_det(*loss_inputs, y_loc_img_ignore=y_loc_img_ignore)
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/core/fp16/decorators.py", line 131, in new_func
return old_func(*args, **kwargs)
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/models/dense_heads/MIAOD_head.py", line 397, in L_det
x_i, valid_flag_list = self.get_anchors(featmap_sizes, img_metas, device=device)
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/models/dense_heads/MIAOD_head.py", line 165, in get_anchors
multi_level_anchors = self.anchor_generator.grid_anchors(featmap_sizes, device)
File "/home/stu1/Documents/datademo/MI-AOD-master/mmdet/core/anchor/anchor_generator.py", line 225, in grid_anchors
self.base_anchors[i].to(device),
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
————————————————————————————————————————————————————
之后我根据错误网上搜索解决办法,然后我在mmdet/models/dense_heads/MIAOD_head.py下
将self.l_imgcls = nn.BCELoss()
改为self.l_imgcls = nn.BCEWithLogitsLoss()
使用(λ=1.0 k=10000)再次跑程序依旧在第五轮同样的地方如上的三个参数显示为nan,不过本次在当前epoch运行完以后程序并没有终止第五轮后边多个参数显示为nan,第五轮显示错误结果mAP=0.00,然后就继续了第6轮训练且参数正常,我想说的是在我改了本行代码以后出现nan的错误并没有解决,但是程序可以继续运行下去了直至结束。
在您的默认最优参数下跑我制作的VOC数据完整跑了下来,并在每轮都出现的正确结果,并未出错,最终mAP结果也还可以,既然能完整正确的跑通可以说明我的数据集没有问题。然而我按照论文中修改了单个参数λ或k时出现了上述错误,可以问下这个有什么解决办法吗?

补充一下,在修改了self.l_imgcls = nn.BCEWithLogitsLoss()之后虽然程序可以跑通,但是从第5轮开始最大化实例不确定性多个参数都产生了错误的结果nan,即5、6、7轮的mAP都为错误结果0.000

只要在某个参数下成功跑通一轮,就证明制作的数据集是没有问题的。

关于两个超参数 λ 和 k,需要将其调整到合适的值以维持模型训练与样本挑选的过程。

λ 是用来调节最大化/最小化实例不确定性与模型正常的有监督训练二者之间的比例,该值过大会导致模型更偏向于最大化/最小化不确定性,从而忽略了对有标注样本的学习。

k 是用来在样本挑选过程中将实例不确定性有选择的传递到样本不确定性的计算过程中,该值过小会导致只挑选那些整个图像中只有较少几个不确定性较大的实例作为不确定性高的图像,但模型对于所有实例的判断不一定都准确,需要选取相当多的实例来参与到计算图像不确定性的过程中,以消除个别特殊实例的影响。

不过也有一种可能,就是将出错的参数再多跑一次,也许会出现不同的结果。主动学习过程本身的随机性就很大,所以我们在实验中需要多训练几次取结果的平均值。


As long as it is successfully run for one cycle under a certain combination of parameters, it proves that the custom dataset is no problem.

For the two hyper-parameters λ and k, they need to be adjusted to appropriate values ​​to maintain the model training and sample selection.

λ is used to adjust the ratio between maximizing/minimizing instance uncertainty and the normal supervised training of the model. If this value is too large, the model will be more inclined to maximize/minimize uncertainty, thus ignoring the learning with labeled samples.

k is used to selectively transfer the instance uncertainty to the calculation of the image uncertainty during sample selection. If the value is too small, only those image with few instances of high uncertainty in the whole image will be selected, and will be regarded as image with high uncertainty. But the model inference for all instances is not necessarily accurate. So it is necessary to select a considerable number of instances to participate in the image uncertainty calculation to eliminate the influence of individual special instances.

However, there is also a possibility that running with the wrong parameters one more time may produce different results. The randomness of the active learning itself is very large, so we need to train several times in the experiment to take the average of the results.

好的,感谢,我多设置些不同参数观察下实验结果,能问下论文3.3中的Ablation Study中的表1、3在MIAOD中应该调整哪些参数,或者说代码可以做到在不同训练方式以及不同选择样本方式进行对比实验,随机选择样本的实验应调整那些参数。

你好 之前我按照默认k=10000 λ=0.5时将自己的数据跑通第一轮因数据过少(5%、371)mAP=0的情况,
然后我按照您的建议在第一轮增加初始数据集至7.5%(570)和10%(750)分别做了两次实验(k=10000 λ=0.5),
但程序又分别在第5轮和第4轮中断了,显示的错误与之前中断的原因一样。
2022-03-15 02:38:30,162 - mmdet - INFO - Epoch [1][1500/1512] lr: 1.000e-03, eta: 0:00:03, time: 0.147, data_time: 0.006, memory: 1492, l_det_cls: nan, l_det_loc: 0.2582, l_wave_dis_minus: nan, L_wave_max: nan
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [0,0,0] Assertion input_val >= zero && input_val <= one failed.
RuntimeError: CUDA error: device-side assert triggered
在5%初始数据下使用默认参数程序是能够跑通的,增加了初始数据了就出现了错误终止,
我想问下数据量的改变,参数不变却跑不通这种情况是需要在不同参数下多次实验寻找适合且能够跑通的参数吗,类似于这种错误有没有什么具体解决办法?
因为我个人感觉只是初始数据增加了一些,参数不变就导致实验又跑不下来好像不太合理,但多次实验确实总会出现类似错误。

关于论文中表 1 的 Training 部分,需要修改 mmdet/models/dense_heads/MIAOD_head.py。关于表 1 的 Sample Selection 部分和表 3,需要修改 mmdet/apis/test.py

关于增加初始数据导致错误终止的问题,可以适当调小 λ 参数,原因之前已经提到过。也可以在 l_det_cls 变为 nan 的时候设置断点,看一下是数据本身还是训练方法导致的 loss 变 nan 的问题。


For the Training part of Table 1 in the paper, you need to modify mmdet/models/dense_heads/MIAOD_head.py. For the Sample Selection part of Table 1 and Table 3, you need to modify mmdet/apis/test.py.

For the problem of error termination caused by increasing the initial data, the λ parameter can be appropriately adjusted for the reasons mentioned earlier. You can also set a breakpoint when l_det_cls becomes nan to see if it is the data itself or the training method that causes the loss to become nan.

您好,论文中两表中的实验在这两个位置中具体如何修改,如何只使用随机选择样例而不使用IUL和IUL模块进行训练方便说吗。想具体做下实验进行对比,如果比较复杂可以麻烦您发下您的邮箱简单描述下如何操作吗,麻烦了

随机选择样例的方法:

  1. 修改 mmdet/models/dense_heads/MIAOD_head.py 中的 L_wave_minL_wave_max 函数返回值。将与 IUL 和 IUR 有关的变量全部乘 0 再返回。

  2. 修改 mmdet/apis/test.py 中的 calculate_uncertainty 函数的返回值,将其乱序后再返回。


To select samples randomly:

  1. Modify the return values of functions L_wave_min and L_wave_max in mmdet/models/dense_heads/MIAOD_head.py. Multiply all variables related to IUL and IUR by 0 and return.

  2. Modify the return value of the function calculate_uncertainty in mmdet/apis/test.py, shuffle it and return.

感谢回复,又细看了下论文,自己的数据如果只有一个类别的情况,即训练时图像分类损失为0,是不是就意味着IUR(示例不确定性重加权)这一步对图像的处理并无实际作用,也达不到缩小示例与图像不确定性差距的效果呢?

训练时图像分类损失为0会意为着IUR不起作用,但只有一个类别的话是可以添加部分反例使IUR进行重加权的过程的。


The image classification loss is 0 during training means that the IUR does not work, but if there is only one class, it is possible to add some negatives to make the IUR re-weight.

您好,测试与评估方面请问实验中除了recall、map是否有其他评估指标,测试单张会出现图像分数,检测框和类别。除此之外有没有另外一些指标对已训练好的检测模型或者在训练过程中进行评估的如PR、ROC等。
论文中的热力图是如何画出来的,在已有训练模型的基础上在实验中应该怎么去操作它?麻烦了。

您好,本代码库是在 MMDetection 的基础上进行修改的,你可以参考此代码库以获取其他评估指标。

论文中的热力图的绘制过程在 问题 #41 中已有阐述,请参考该问题。


Hello, this repository is modified based on MMDetection. You can refer to the repository for other evaluation metrics.

The drawing process of the heatmap in the paper has already been described in Issue #41, please refer to this issue.