MendelXu/ANN

train on my own datasets

andrewwyl opened this issue · 32 comments

@MendelXu sorry for asking questions again,i want to train this model on my own dataset with only two types,what should i do, thanks

Change num_classes to 2, delete line 9 and add one line "reduce_zero_label":false.

"num_classes": 19,
"label_list": [7, 8, 11, 12, 13, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 31, 32, 33],

@MendelXu thanks for the quick reply ,preprocessing program is same with cityscapes_seg_generator.py? thanks

ANN/hypes/seg/cityscapes/fs_annn_cityscapes_seg.json
"details": {
"color_list": [[128, 64, 128], [244, 35, 232], [70, 70, 70], [102, 102, 156], [190, 153, 153],
[153, 153, 153], [250, 170, 30], [220, 220, 0], [107, 142, 35], [152, 251, 152],
[70, 130, 180], [220, 20, 60], [255, 0, 0], [0, 0, 142], [0, 0, 70], [0, 60, 100],
[0, 80, 100], [0, 0, 230], [119, 11, 32]]
},
Does this need to be changed?

@MendelXu thanks for the quick reply ,preprocessing program is same with cityscapes_seg_generator.py? thanks
Yes. If you don't need to do some extra operations, it should be the same.

ANN/hypes/seg/cityscapes/fs_annn_cityscapes_seg.json
"details": {
"color_list": [[128, 64, 128], [244, 35, 232], [70, 70, 70], [102, 102, 156], [190, 153, 153],
[153, 153, 153], [250, 170, 30], [220, 220, 0], [107, 142, 35], [152, 251, 152],
[70, 130, 180], [220, 20, 60], [255, 0, 0], [0, 0, 142], [0, 0, 70], [0, 60, 100],
[0, 80, 100], [0, 0, 230], [119, 11, 32]]
},
Does this need to be changed?

You can change it to the same length as your categories. As you only have two classes,you can use only two colors for visualization. But it's ok if you don't change it.

@MendelXu when i follow your suggest,The following problems were encountered

Traceback (most recent call last):
File "/home/data/venv/torch1x_tf1x/lib/python3.5/site-packages/torch/utils/cpp_extension.py", line 949, in _build_extension_module
check=True)
File "/usr/lib/python3.5/subprocess.py", line 708, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/data/andrew/ANN-master/mian_train.py", line 182, in
runner = method_selector.select_seg_method()
File "/home/data/andrew/ANN-master/methods/method_selector.py", line 104, in select_seg_method
return SEG_METHOD_DICTkey
File "/home/data/andrew/ANN-master/methods/seg/fcn_segmentor.py", line 44, in init
self._init_model()
File "/home/data/andrew/ANN-master/methods/seg/fcn_segmentor.py", line 47, in _init_model
self.seg_net = self.seg_model_manager.semantic_segmentor()
File "/home/data/andrew/ANN-master/models/seg/model_manager.py", line 38, in semantic_segmentor
model = SEG_MODEL_DICTmodel_name
File "/home/data/andrew/ANN-master/models/seg/nets/annn.py", line 17, in init
self.backbone = BackboneSelector(configer).get_backbone()
File "/home/data/andrew/ANN-master/models/backbones/backbone_selector.py", line 31, in get_backbone
model = ResNetBackbone(self.configer)(**params)
File "/home/data/andrew/ANN-master/models/backbones/resnet/resnet_backbone.py", line 176, in call
orig_resnet = self.resnet_models.deepbase_resnet101()
File "/home/data/andrew/ANN-master/models/backbones/resnet/resnet_models.py", line 256, in deepbase_resnet101
norm_type=self.configer.get('network', 'norm_type'), **kwargs)
File "/home/data/andrew/ANN-master/models/backbones/resnet/resnet_models.py", line 107, in init
('bn1', ModuleHelper.BatchNorm2d(norm_type=norm_type)(64)),
File "/home/data/andrew/ANN-master/models/tools/module_helper.py", line 89, in BatchNorm2d
from encoding.nn import BatchNorm2d
File "/home/data/venv/torch1x_tf1x/lib/python3.5/site-packages/encoding/init.py", line 13, in
from . import nn, functions, parallel, utils, models, datasets, transforms
File "/home/data/venv/torch1x_tf1x/lib/python3.5/site-packages/encoding/nn/init.py", line 12, in
from .encoding import *
File "/home/data/venv/torch1x_tf1x/lib/python3.5/site-packages/encoding/nn/encoding.py", line 18, in
from ..functions import scaled_l2, aggregate, pairwise_cosine
File "/home/data/venv/torch1x_tf1x/lib/python3.5/site-packages/encoding/functions/init.py", line 2, in
from .encoding import *
File "/home/data/venv/torch1x_tf1x/lib/python3.5/site-packages/encoding/functions/encoding.py", line 14, in
from .. import lib

Process finished with exit code 1

Is this due to the environment?my environment is ubuntu==16.04,python==3.5,gcc==5.4.0
cuda==9.0,,,or the torch-encoding Version problem ?my Version is 1.0.0

Yes. I think it is mostly due to the version of torch-encoding and CUDA. Please install cuda==9.2.

By the way, the new version of torchcv doesn't have a problem with sync-bn, maybe you can use that.

@MendelXu thanks for your reply,what your version of torch-encoding ?

The newest version.

i am still can not import torch-encoding,It can't be upgraded for the time being for cuda,so can you have Other solutions? thanks

Sorry, I have no idea. Maybe you can ask the author of pytorch-encoding.

ok,thank you very much

sorry to ask again, i solve above qustion,and i meet this qustion:
Traceback (most recent call last):
File "/home/data/andrew/ANN-master/mian_train.py", line 197, in
Controller.train(runner)
File "/home/data/andrew/ANN-master/methods/tools/controller.py", line 40, in train
runner.train()
File "/home/data/andrew/ANN-master/methods/seg/fcn_segmentor.py", line 92, in train
loss = self.pixel_loss(outputs, targets, gathered=self.configer.get('network', 'gathered'))
File "/home/data/venv/torch1x_tf1x/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/data/andrew/ANN-master/extensions/tools/parallel/data_parallel.py", line 130, in forward
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
File "/home/data/andrew/ANN-master/extensions/tools/parallel/data_parallel.py", line 183, in _criterion_parallel_apply
raise output
File "/home/data/andrew/ANN-master/extensions/tools/parallel/data_parallel.py", line 158, in _worker
output = module(input, *target, **kwargs)
File "/home/data/venv/torch1x_tf1x/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/data/andrew/ANN-master/models/seg/loss/seg_modules.py", line 116, in forward
seg_loss = self.ohem_ce_loss(seg_out, targets)
File "/home/data/venv/torch1x_tf1x/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/data/andrew/ANN-master/models/seg/loss/seg_modules.py", line 93, in forward
sort_prob, sort_indices = prob.contiguous().view(-1, )[mask].contiguous().sort()
RuntimeError: copy_if failed to synchronize: device-side assert triggered

It is quite wired. I have never met this problem before. Maybe you can refer to this link.

@MendelXu In my annotation data, 255 represents the annotation object and 0 represents the background. Does this meet the requirements?

@MendelXu In my annotation data, 255 represents the annotation object and 0 represents the background.How should I set fs_annn_cityscapes_seg.json?
i set
image
and
image
but The loss function doesn't converge,i don't know why

Firstly, you should change 255 to 1 in your annotations, or you can add one line "label_list":[0,255].
Secondly, two samples per batch won't work for segmentation. Please change it to at least 8.

i have only two titan-xp,when i set batch to 4,i have meet this:
RuntimeError: CUDA out of memory. Tried to allocate 36.88 MiB (GPU 0; 11.90 GiB total capacity; 10.99 GiB already allocated; 10.94 MiB free; 47.83 MiB cached)

You can switch to resnet50.

Another technique you can try is to fix the parameters of Batchnorm. In this way, you can use smaller batch size.

You mean i can select One of the other three :'batchnorm'、'encsync_batchnorm'、'instancenorm'
which i can select?

Fix batch norm means operations like

bn.eval()
bn.weight.requires_grad=False
bn.bias.requires_grad=False

Fix batch norm means operations like

bn.eval()
bn.weight.requires_grad=False
bn.bias.requires_grad=False

sorry for ask ,I'm a beginner, can you tell me where to add this operations,thanks

Fix batch norm means operations like

bn.eval()
bn.weight.requires_grad=False
bn.bias.requires_grad=False

sorry for ask ,I'm a beginner, can you tell me where to add this operations,thanks

You can refer to
https://github.com/donnyyou/torchcv/blob/e64bb833a8b88b4531ae28f3367eb344e771e062/runner/tools/runner_helper.py#L185

@MendelXu I added it as you told,but Still cannot increase the number of Bach,

in my training,I found that the Pixel ACC and Mean IOU of the val dataset rose steadily despite the beat of the loss value,I want to ask is this normal?

It is normal.

@andrewwyl Hey did you manage to train on human segmentation? Got any results? I am trying to do the same.

This method is still very effective for segmentation,you can try

Can you share your code so I can see how you did it? Can you also share some of your insights in training?

I did not change the code, the code is the author, I think the most important thing is to prepare the data