brats dice seems low?
bhralzz opened this issue · 11 comments
Dear Shaohua
while checking the train process I see the dice is around 9% after percent of total epochs.
the total training on 8 rtx 8000 nvidia takes around 25 hours estimated.
what is the cause of such low value for dice?
waiting to hear from you.
thanks
Hi first my name is Shaohua Li 😃
During the training process what's printed is the dice loss (1- dice score), so you should observe it gradually decrease to a small number. I guess 9% you saw is the loss? It's normal to have this value in the end of the training.
I have no idea why the total training takes 25 hours on 8 rtx8000. 10000 iterations took around 10 hours on 2X Titan RTX (each 24GB). Are the GPUs in near 100% utilization?
Appologize for such mismatch, I searched about you on the net, but it seems wrong way
Ok, so that was a good news that reaching this score.
I have 8*48 memory with 100% utilization.
so maybe is because that my network structure is based on num_attractors=1024 heavy?
is this good setting for brats?
Never mind. There are a lot of "Shaohua" 😄
I used num_attractors=1024 for my training so this shouldn't be an issue.
Are you using one transformer layer? What's the batch size, and how many GPU RAMs are used?
actually all gpu memory and processors are used.
this train3d.py setting
parser = argparse.ArgumentParser()
parser.add_argument('--task', dest='task_name', type=str, default='brats', help='Name of the segmentation task.')
parser.add_argument('--ds', dest='train_ds_names', type=str, default=None, help='Dataset folders. Can specify multiple datasets (separated by ",")')
parser.add_argument('--split', dest='ds_split', type=str, default='train',
choices=['train', 'all'], help='Split of the dataset')
parser.add_argument('--maxiter', type=int, default=10000, help='maximum training iterations')
parser.add_argument('--saveiter', type=int, default=500, help='save model snapshot every N iterations')
parser.add_argument('--lrwarmup', dest='lr_warmup_steps', type=int, default=500, help='Number of LR warmup steps')
parser.add_argument('--dicewarmup', dest='dice_warmup_steps', type=int, default=0, help='Number of dice warmup steps (0: disabled)')
parser.add_argument('--bs', dest='batch_size', type=int, default=50, help='Total batch_size on all GPUs')
parser.add_argument('--opt', type=str, default=None, help='optimization algorithm')
parser.add_argument('--lr', type=float, default=-1, help='learning rate')
parser.add_argument('--decay', type=float, default=-1, help='weight decay')
parser.add_argument('--gradclip', dest='grad_clip', type=float, default=-1, help='gradient clip')
parser.add_argument('--attnclip', dest='attn_clip', type=int, default=500, help='Segtran attention clip')
parser.add_argument('--cp', dest='checkpoint_path', type=str, default=None, help='Load this checkpoint')
parser.add_argument("--local_rank", default=0, type=int)
parser.add_argument("--locprob", dest='localization_prob', default=0.5,
type=float, help='Probability of doing localization during training')
parser.add_argument("--tunebn", dest='tune_bn_only', action='store_true',
help='Only tune batchnorms for domain adaptation, and keep model weights unchanged.')
parser.add_argument('--diceweight', dest='MAX_DICE_W', type=float, default=0.5,
help='Weight of the dice loss.')
parser.add_argument('--deterministic', type=int, default=1, help='whether use deterministic training')
parser.add_argument('--seed', type=int, default=1337, help='random seed')
parser.add_argument("--debug", dest='debug', action='store_true', help='Debug program.')
parser.add_argument('--schedule', dest='lr_schedule', default='linear', type=str,
choices=['linear', 'constant', 'cosine'],
help='AdamW learning rate scheduler.')
parser.add_argument('--net', type=str, default='segtran', help='Network architecture')
parser.add_argument('--bb', dest='backbone_type', type=str, default=None,
help='Backbone of Segtran / Encoder of other models')
parser.add_argument("--nopretrain", dest='use_pretrained', action='store_false',
help='Do not use pretrained weights.')
parser.add_argument('--ibn', dest='ibn_layers', type=str, default=None, help='IBN layers')
parser.add_argument("--translayers", dest='num_translayers', default=1,
type=int, help='Number of Cross-Frame Fusion layers.')
parser.add_argument('--layercompress', dest='translayer_compress_ratios', type=str, default=None,
help='Compression ratio of channel numbers of each transformer layer to save RAM.')
parser.add_argument("--baseinit", dest='base_initializer_range', default=0.02,
type=float, help='Base initializer range of transformer layers.')
parser.add_argument("--nosqueeze", dest='use_squeezed_transformer', action='store_false',
help='Do not use attractor transformers (Default: use to increase scalability).')
parser.add_argument("--attractors", dest='num_attractors', default=512,
type=int, help='Number of attractors in the squeezed transformer.')
parser.add_argument("--noqkbias", dest='qk_have_bias', action='store_false',
help='Do not use biases in Q, K projections (Using biases leads to better performance on BraTS).')
parser.add_argument('--pos', dest='pos_code_type', type=str, default='lsinu',
choices=['lsinu', 'zero', 'rand', 'sinu', 'bias'],
help='Positional code scheme')
parser.add_argument('--posw', dest='pos_code_weight', type=float, default=1.0)
parser.add_argument('--posr', dest='pos_bias_radius', type=int, default=7,
help='The radius of positional biases')
parser.add_argument('--perturbposw', dest='perturb_posw_range', type=float, default=0.,
help='The range of added random noise to pos_code_weight during training')
parser.add_argument("--poslayer1", dest='pos_code_every_layer', action='store_false',
help='Only add pos codes to the first transformer layer input (Default: add to every layer).')
parser.add_argument("--posattonly", dest='pos_in_attn_only', action='store_true',
help='Only use pos embeddings when computing attention scores (K, Q), and not use them in the input for V or FFN.')
parser.add_argument("--squeezeuseffn", dest='has_FFN_in_squeeze', action='store_true',
help='Use the full FFN in the first transformer of the squeezed attention '
'(Default: only use the first linear layer, i.e., the V projection)')
parser.add_argument("--into3", dest='inchan_to3_scheme', default=None,
choices=['avgto3', 'stemconv', 'dup3', 'bridgeconv'],
help='Scheme to convert input into pseudo-RGB format')
parser.add_argument("--dup", dest='out_fpn_upsampleD_scheme', default='conv',
choices=['conv', 'interpolate', 'none'],
help='Depth output upsampling scheme')
parser.add_argument("--infpn", dest='in_fpn_layers', default='34',
choices=['234', '34', '4'],
help='Specs of input FPN layers')
parser.add_argument("--outfpn", dest='out_fpn_layers', default='1234',
choices=['1234', '234', '34'],
help='Specs of output FPN layers')
parser.add_argument("--outdrop", dest='out_fpn_do_dropout', action='store_true',
help='Do dropout on out fpn features.')
parser.add_argument("--inbn", dest='in_fpn_use_bn', action='store_true',
help='Use BatchNorm instead of GroupNorm in input FPN.')
parser.add_argument("--nofeatup", dest='bb_feat_upsize', action='store_false',
help='Do not upsize backbone feature maps by 2.')
parser.add_argument('--insize', dest='orig_input_size', type=str, default=None,
help='Use images of this size (among all cropping sizes) for training. Set to 0 to use all sizes.')
parser.add_argument('--patch', dest='orig_patch_size', type=str, default=None,
help='Crop input images to this size for training.')
parser.add_argument('--scale', dest='input_scale', type=str, default=None,
help='Scale input images by this ratio for training.')
parser.add_argument('--dgroup', dest='D_groupsize', type=int, default=-1,
help='For 2.5D segtran, group the depth dimension of the input images and merge into the batch dimension.')
parser.add_argument('--dpool', dest='D_pool_K', type=int, default=-1,
help='Scale input images by this ratio for training.')
parser.add_argument("--segtran", dest='segtran_type',
default='3d',
choices=['25d', '3d'],
type=str, help='Use 3D or 2.5D of segtran.')
Using random scaling as augmentation usually hurts performance. Not sure why.
parser.add_argument("--randscale", type=float, default=0, help='Do random scaling augmentation.')
parser.add_argument("--affine", dest='do_affine', action='store_true', help='Do random affine augmentation.')
parser.add_argument('--dropout', type=float, dest='dropout_prob', default=-1, help='Dropout probability')
parser.add_argument('--modes', type=int, dest='num_modes', default=-1, help='Number of transformer modes')
parser.add_argument('--modedim', type=int, dest='attention_mode_dim', default=-1, help='Dimension of transformer modes')
parser.add_argument('--mod', dest='chosen_modality', type=int, default=-1, help='The modality to use if images are of multiple modalities')
parser.add_argument('--focus', dest='focus_class', type=int, default=-1, help='The class that is particularly predicted by the current modality (with higher loss weight)')
parser.add_argument('--multihead', dest='ablate_multihead', action='store_true',
help='Ablation to multimode transformer (using multihead instead)')
the batch size is 50 greater than your default setting, but this means that bigger batch of samples processed by processor in one snapshot of processing time, supposing fixed sample number this cant impulse processing time just can affect accuracy! yes?
if I wrong please correct.
Wow a batch size of 50 is huge! No wonder it uses so much RAM and is slow. I'm not sure with such a large batch size, how many iterations you'll need. Probably somewhere between 1000~3000?
is it possible to overfit with 10000 iterations ?
Definitely. When I trained with a batch size of 4, it already slightly overfit at 10000 iterations. I usually use the checkpoint at the 8000th iteration to create the masks for submission.
ok,
So its better to set its iteration around 3000?
or maybe 2500
Yeah I think that's sufficient. You can also stop the training at any time with ctrl-c.
OK,
thanks for all of your help
you were great this time as the same.
You are welcome! Wish you have fun playing.