megvii-model/ShuffleNet-Series

ShuffleNetV2+ cannot be convergent when setting shuffle=False of train_loader

songkq opened this issue · 4 comments

Paramaters:
model-size=Large, auto_continue=False, batch-size=128, num_workers=8,and other default params.
Evironment:
Ubuntu 16.04, PyTorch1.2, Single RTX 2080ti GPU

When I train the ShuffleNetV2+ on ImageNet-1K dataset, it cannot be convergent when setting shuffle=False of train_loader. Can you give some kind advice?

[30 02:15:34] TRAIN Iter 20: lr = 0.499978,	loss = 2.621804,	Top-1 err = 0.162891,	Top-5 err = 0.145703,	data_time = 0.006651,	train_time = 2.031535

[30 02:16:09] TRAIN Iter 40: lr = 0.499956,	loss = 4.062627,	Top-1 err = 0.154688,	Top-5 err = 0.096875,	data_time = 0.006521,	train_time = 1.751516

[30 02:16:41] TRAIN Iter 60: lr = 0.499933,	loss = 48.428082,	Top-1 err = 0.465625,	Top-5 err = 0.323047,	data_time = 0.006557,	train_time = 1.589813

[30 02:16:47] TRAIN Iter 80: lr = 0.499911,	loss = nan,	Top-1 err = 0.628516,	Top-5 err = 0.564063,	data_time = 0.006574,	train_time = 0.323313

[30 02:16:52] TRAIN Iter 100: lr = 0.499889,	loss = nan,	Top-1 err = 1.000000,	Top-5 err = 1.000000,	data_time = 0.006572,	train_time = 0.255579
nmaac commented

Hi @ipScore ,

From the Top-1 err it seems you are fine tuning, if you modify the model please train from scratch.

Hi @nmaac ,
Not really, I just train from scratch. When I set shuffle=True of train_loader, everything is OK. However, it fails to be convergent with shuffle=False of train_loader. It seems a little strange.

nmaac commented

I misunderstood the "shuffle" you mentioned. I was referring to the "channel shuffle operator" in the block.

However, if you are referring to the "shuffle of training data", it is a standard configuration in data augmentation since the training samples are ranked according to their classes. We do not suggest you change it to be False.

@nmaac Thanks.