Visual-Attention-Network/SegNeXt

Encoder-Decoder downsample 1/8x, is too coarse to produce 'seg_logits'

chenxinfeng4 opened this issue · 3 comments

I have the original input feature 830(H)x1280(W), but find seg_logits is downsampled to 1024(Channel)x104(H)x160(W) feature map in ham_head. It's too coarse.

你可以看到 在分割时,动物的边界不是很清晰。这可能是降采太多导致。希望提供指导。

Plus, you will see too many background occuping the image, which is hard to optimize the other class-segmentation. How to optimize the model to overcome this issue. For example the class_weight?

image

# tools/dist_train.sh segnext.large.ratmetric.py 4
# python tools/train.py segnext.large.ratmetric.py
_base_ = [
    'local_configs/segnext/large/segnext.large.512x512.coco_stuff164k.80k.py'
]

num_classes = 3
# load_from = None
load_from = 'work_dirs/segnext.large.ratmetric/latest.pth'

model = dict(
    backbone=dict(init_cfg=dict(type='Pretrained', checkpoint='pretrained/segnext_large_512x512_ade_160k.pth')),
    decode_head=dict(
        num_classes=num_classes,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, class_weight=[1.0/50, 1.0, 1.0], loss_weight=1.0))
)

runner = dict(type='IterBasedRunner', max_iters=6400)
checkpoint_config = dict(by_epoch=False, interval=800)
evaluation = dict(interval=800, metric='mIoU')

data_root = 'data_rat_metric'
img_dir='images'
ann_dir='annotations'
img_wh = (1280,832)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', reduce_zero_label=True),
    dict(type='Resize', img_scale=img_wh, ratio_range=(0.7, 1.5)),
    dict(type='RandomCrop', crop_size=img_wh[::-1], cat_max_ratio=1.0, ignore_index=0),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=img_wh[::-1], pad_val=0, seg_pad_val=0),
    # dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]

test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=img_wh,
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]

data = dict(
    samples_per_gpu=2,
    workers_per_gpu=4,
    train=dict(
        type='COCOStuffDatasetRat',
        data_root=data_root,
        img_dir=img_dir,
        ann_dir=ann_dir,
        pipeline=train_pipeline),
    val=dict(
        type='COCOStuffDatasetRat',
        data_root=data_root,
        img_dir=img_dir,
        ann_dir=ann_dir,
        pipeline=test_pipeline),
    test=dict(
        type='COCOStuffDatasetRat',
        data_root=data_root,
        img_dir=img_dir,
        ann_dir=ann_dir,
        pipeline=test_pipeline))

You can use class_weight to solve the sample imbalance problem, or use OHEM. And you can change in_index by use [0,1,2,3]

OK, i see the imbalance solution. And how to change the feature map size in ham_head? I think it's too coarse.

This question I also meet, a simple way to solve this is to modify the source on the downsampling location,