JiahuiYu/slimmable_networks

reproducing CIFAR10 results for AutoSlim

RudyChin opened this issue · 8 comments

Hi Jiahui,

Thanks for the great work. I'm trying to reproduce AutoSlim for CIFAR-10 (Table 2).
Could you please provide a detailed hyperparameter you used for it?

I'm able to train the baseline MobileNetV2 1.0x to 7.9 Top-1 error using the following hyperparameters:

  • 0.1 initial learning rate
  • linear learning rate decay
  • 128 batch size
  • 300 epochs of training
  • 5e-4 weight decay
  • 0.9 nesterov momentum
  • no label smoothing
  • no weight decay for bias and gamma

To train AutoSlim, I use MobileNetV2 1.5x with the exact same hyperparameter but only trained for 50 epochs on a training set (80% of the real training set). Then, during greedy slimming, I use the extra 20% training set as a validation set to decide channel counts. For greedy slimming, I shrink each layer by a step of 10%, which makes it 10 groups as mentioned in the paper.

The final architecture is trained with the same hyperparameters listed above. But I failed to obtain Top-1 error 6.8% as reported in the paper. I'm getting around 7.8%.

Could you please share with me the final architecture for AutoSlim-MobileNetV2 CIFAR-10 with 88MFLOPs? Also, it would be great if you can let me know the hyperparameters you used for CIFAR experiments.

Thanks,
Rudy

Hi Jiahui,

Thanks for the great work. I'm trying to reproduce AutoSlim for CIFAR-10 (Table 2).
Could you please provide a detailed hyperparameter you used for it?

I'm able to train the baseline MobileNetV2 1.0x to 7.9 Top-1 error using the following hyperparameters:

  • 0.1 initial learning rate
  • linear learning rate decay
  • 128 batch size
  • 300 epochs of training
  • 5e-4 weight decay
  • 0.9 nesterov momentum
  • no label smoothing
  • no weight decay for bias and gamma

To train AutoSlim, I use MobileNetV2 1.5x with the exact same hyperparameter but only trained for 50 epochs on a training set (80% of the real training set). Then, during greedy slimming, I use the extra 20% training set as a validation set to decide channel counts. For greedy slimming, I shrink each layer by a step of 10%, which makes it 10 groups as mentioned in the paper.

The final architecture is trained with the same hyperparameters listed above. But I failed to obtain Top-1 error 6.8% as reported in the paper. I'm getting around 7.8%.

Could you please share with me the final architecture for AutoSlim-MobileNetV2 CIFAR-10 with 88MFLOPs? Also, it would be great if you can let me know the hyperparameters you used for CIFAR experiments.

Thanks,
Rudy

Hi, Rudy, when I greedy slimming the network, I found that the output_channels of SlimmableConv2d didn't change. Did you encounter the same problem?

Hi dada,

I've actually implemented the AutoSlim myself and cross-referenced this code.

I could be wrong but I actually notice some lines of code that I believe to be bugs:

  • In train.py line 559 it tries to init the bn calibration process with input argument being the full model while the definition of bn_calibration_init takes the bn module instead of the full model.

  • In train.py line 592 it uses the attribute of divisor for each layer but I couldn't locate the definition of layers[i].divisor in SlimmableConv2d

Hi, Rudy, Thank you for your reply!

I did encounter some problems when running the code at v3.0.0, when I run

python -m torch.distributed.launch train.py app:apps/autoslim_resnet_train_val.yml

and I have set autoslim_resnet_train_val.yml autoslim: True

But SlimmableConv2d has no definition of us , so in function get_conv_layers, the length of layers is zero.

So it prints Totally 0 layers to slim.

Do I need to replace SlimmableConv2d with USConv2d in the network?

Hi Jiahui,

Thanks for the great work. I'm trying to reproduce AutoSlim for CIFAR-10 (Table 2).
Could you please provide a detailed hyperparameter you used for it?

I'm able to train the baseline MobileNetV2 1.0x to 7.9 Top-1 error using the following hyperparameters:

  • 0.1 initial learning rate
  • linear learning rate decay
  • 128 batch size
  • 300 epochs of training
  • 5e-4 weight decay
  • 0.9 nesterov momentum
  • no label smoothing
  • no weight decay for bias and gamma

To train AutoSlim, I use MobileNetV2 1.5x with the exact same hyperparameter but only trained for 50 epochs on a training set (80% of the real training set). Then, during greedy slimming, I use the extra 20% training set as a validation set to decide channel counts. For greedy slimming, I shrink each layer by a step of 10%, which makes it 10 groups as mentioned in the paper.

The final architecture is trained with the same hyperparameters listed above. But I failed to obtain Top-1 error 6.8% as reported in the paper. I'm getting around 7.8%.

Could you please share with me the final architecture for AutoSlim-MobileNetV2 CIFAR-10 with 88MFLOPs? Also, it would be great if you can let me know the hyperparameters you used for CIFAR experiments.

Thanks,
Rudy

Hi, Rudy
Can you show me the code of MobilenetV2 on CIFAR-10?

Hi All,

Sorry for the late reply. While I fully understand ImageNet requires more compute which researchers may not have, the results on CIFAR are usually misleading for Neural Architecture Search especially for efficient neural networks. That's part of the reason why I didn't include the CIFAR config in this code. But I can post the configs here for your reference:

num_hosts_per_job: 1  # number of hosts each job need
num_cpus_per_host: 36  # number of cpus each job need
memory_per_host: 380  # memory requirement each job need
gpu_type: 'nvidia-tesla-p100'

app:
  # data
  dataset: cifar10
  dataset_id: 0
  dataset_dir: /home/jiahuiyu/.git/mobile/data
  data_transforms: cifar10_basic
  data_loader: cifar10_basic
  data_loader_workers: 36
  drop_last: False

  # info
  num_classes: 10
  test_resize_image_size: 32
  image_size: 32
  topk: [1]
  num_epochs: 100

  # optimizer
  optimizer: sgd
  momentum: 0.9
  weight_decay: 0.0001
  nesterov: True

  # lr
  lr: 0.1
  lr_scheduler: multistep
  multistep_lr_milestones: [30, 60, 90]
  multistep_lr_gamma: 0.1

  # model profiling
  profiling: [gpu]

  # pretrain, resume, test_only
  test_only: False

  # seed
  random_seed: 1995

  # model
  reset_parameters: True

  # app defaults
  optimizer: mobile_sgd
  num_gpus_per_host: 8
  batch_size_per_gpu: 128
  distributed: True
  distributed_all_reduce: True
  num_epochs: 250
  slimmable_training: True
  calibrate_bn: True
  inplace_distill: True
  cumulative_bn_stats: True
  bn_cal_batch_num: 32  # effective batch num is batch_num/gpu_num
  num_sample_training: 4
  lr: 0.5
  lr_scheduler: linear_decaying
  lr_warmup: True
  lr_warmup_epochs: 5

run:
  shell_command: "'python -m torch.distributed.launch --nproc_per_node={} --nnodes={} --node_rank={} --master_addr={} --master_port=2234 train.py'.format(nproc_per_node, nnodes, rank, master_addr)"
  jobs:
    # - name: mobilenet_v1_0.2_1.1_nonuniform_50epochs_dynamic_divisor12
      # app_override:
        # model: models.us_mobilenet_v1
        # width_mult_list_test: [0.2, 1.1]
        # width_mult_range: [0.2, 1.1]
        # universally_slimmable_training: True
        # nonuniform: True
        # num_epochs: 50
        # dataset: cifar10_val5k
        # inplace_distill: True
        # dynamic_divisor: 12
        # nonuniform_diff_seed: True
        # # lr: 1.5
        # # batch_size_per_gpu: 48
        # # num_hosts_per_job: 8
        # lr: 0.125
        # batch_size_per_gpu: 32
        # num_hosts_per_job: 1
        # data_loader_workers: 4
        # # num_gpus_per_host: 1

    # - name: mobilenet_v2_0.15_1.5_nonuniform_50epochs_dynamic_divisor12
      # app_override:
        # model: models.us_mobilenet_v2
        # width_mult_list_test: [0.15, 1.5]
        # width_mult_range: [0.15, 1.5]
        # universally_slimmable_training: True
        # nonuniform: True
        # num_epochs: 50
        # dataset: cifar10_val5k
        # inplace_distill: True
        # dynamic_divisor: 12
        # nonuniform_diff_seed: True
        # lr: 0.5
        # batch_size_per_gpu: 128
        # num_hosts_per_job: 1
        # data_loader_workers: 4

    # - name: mnasnet_0.15_1.5_nonuniform_50epochs_dynamic_divisor12_ngc
      # app_override:
        # model: models.us_mnasnet
        # width_mult_list_test: [0.15, 1.5]
        # width_mult_range: [0.15, 1.5]
        # universally_slimmable_training: True
        # nonuniform: True
        # batch_size_per_gpu: 32
        # num_epochs: 50
        # dataset: imagenet1k_val50k_lmdb
        # inplace_distill: True
        # dynamic_divisor: 12
        # nonuniform_diff_seed: True
        # # lr: 2.0
        # # batch_size_per_gpu: 64
        # # lr: 1.
        # # num_hosts_per_job: 8
        # lr: 0.125
        # num_hosts_per_job: 1
        # dataset_dir: /data/imagenet
        # data_loader_workers: 4

Please also note that the latest version is released under branch v3.0.0, instead of master branch.

(I am keeping this issue open and marking it as good first issue)

Hi, Rudy, Thank you for your reply!

I did encounter some problems when running the code at v3.0.0, when I run

python -m torch.distributed.launch train.py app:apps/autoslim_resnet_train_val.yml

and I have set autoslim_resnet_train_val.yml autoslim: True

But SlimmableConv2d has no definition of us , so in function get_conv_layers, the length of layers is zero.

So it prints Totally 0 layers to slim.

Do I need to replace SlimmableConv2d with USConv2d in the network?

Hi, I also encountered the same problem, how did you solve it?