CDP: Towards Optimal Filter Pruning via Class-wise Discriminative Power (ACM MM '21)


We proposed a novel filter pruning strategy via class-wise discriminative power (CDP). CDP quantizes the discriminative power by introducing the Term Frequency-Inverse Document Frequency (TF-IDF) into deep learning to quantize filters across classes. Specifically, CDP regards the output of a filter as a word and the whole feature map as a document. TF-IDF is used to generate the relevant score between the words (filters) and all documents (classes), i.e., filters that always have low TF-IDF scores are less discriminative and thus need to be pruned. In particular, CDP does not require any iterative training or search process, which is simple and straight forward.



  • python 3.6
  • pytorch 1.7.0
  • torchvision 0.7.0
  • NNI 2.0






Model FLOPs(M) Base FLOPs(M) Accuracy(%) Baseline Acc(%) MindSpore
VGG16 103.3 313.5 94.87 94.47 Link
ResNet20 20.76 40.6 92.49 92.57 Link
ResNet56 60.02 125 94.63 93.93
ResNet56 49.84 125 94.44 93.93


Model FLOPs(M) Base FLOPs(M) Accuracy(%) Base Acc(%)
ResNet18 920 1820 68.76 69.76
ResNet50 2089 4089 75.71 76.83


There are three parts in our experiments:

  1. Record feature maps generated by models based on sampled data
  2. Use feature maps to create CDP pruner, and execute pruning
  3. Retrain pruned models


Because the scale of cifar10 is relatively small and the computation cost of statistical characteristic graph is small, we directly complete three parts in one file.

python \
    --model resnet20 \ # select model
    --pretrained_dir "./ckpt/base/resnet20_model_best.pth.tar" \
    --dataroot "/gdata/cifar10/" \ # dataset dir
    --gpus 0,1 \ # denote which gpu to use
    -j 4 \ # set number of dataloder worker
    --stop_batch 200 \ # set batch number of sampled data
    --sparsity 0.5 # set the sparsity of model != reduced_flops_ratio
    # --coe 0 \ # hyper parameter of CDP

Other parameters are listed in


Because of the large scale of Imagenet and the high computation cost of statistical characteristic graph, we carried out the experiment in two steps

First, accumulate feature maps and prune the model.

python \
    --model resnet50 \
    --pretrained_dir "./ckpt/base/resnet50_model_best.pth.tar" \
    --dataroot "/gdata/image2012/" \ # path to dataset
    --gpus 0,1 \
    -j 4 \ # Number of dataloader worker
    --stop_batch 200\ # Number of sampled data
    --sparsity 0.5 \ # Sparsity of model

Other parameters are listed in

Second, retrain the pruned model

cd ./imagenet
python \
    --model resnet50 \
    --resume "./ckpt/resnet50_s4.pth" \
    --dataroot "/gdata/image2012/" \
    --gpus 0,1 \
    -j 4 \
    --batch-size 128 \
    --epochs 200\
    --make-mask \
    --warmup 1 \
    --label-smoothing 0.1 \

Other parameters are listed in


The entire code is under the MIT License


