BVLC/caffe

Sparse convolutional neural networks

Opened this issue · 194 comments

Anyone has interest to utilize the sparsity to accelerate DNNs?

I am working on the fork https://github.com/wenwei202/caffe/tree/scnn and currently, on average, achieve ~5x CPU and ~3x GPU layer-wise speedups of convolutional layers in AlexNet by off-the-shelf GEMM (after ~2% top1 accuracy loss).

http://papers.nips.cc/paper/6504-learning-structured-sparsity-in-deep-neural-networks.pdf

@wenwei202 could you explain a bit further how to use your fork? Any example? I have convolution layers where 90% of the weights are zero if I use your version of caffe the computations will automatically take advantage of this sparsity? If I use a dense matrix will the computations be slower or will it use the normal way of computing? Thanks for sharing your work 👍

@jpiabrantes You can use conv_mode in each conv layer to indicate which method be utilized to do the computation.
e.g.
layer {
name: "conv2"
type: "Convolution"
bottom: "norm1"
top: "conv2"
convolution_param {
num_output: 256
pad: 2
kernel_size: 5
group: 2
conv_mode: LOWERED_CSRMM # sparse weight matrix in CSR format * lowered feature maps
# conv_mode: LOWERED_GEMM # default original matrix multiplication
}
}

Thanks

I just tested on the Lenet network for the MNIST example. I was able to achieve the following sparse layers:

conv1 is 75.4 percent sparse
conv2 is 94.7 percent sparse
ip1 is 74.5 percent sparse
ip2 is 89.5 percent sparse

I used conv_mode: LOWERED_CSRMM and connectivity_mode: DISCONNECTED_GRPWISE. I used the GPU and the sparse network was not faster. Sometimes it was even slower, my batchsize is 1.

@jpiabrantes in CPU mode, you need to use mkl. LOWERED_CSRMM is only implemented by mkl sparse blas since sparseblas is not supported by openblas and atlas.

@wenwei202 I used the GPU mode.

@jpiabrantes it is normal to achieve very limited 'speedup' in GPU even you have sparsity higher than 90%. Because GPU is high-parallelism, and irregular sparse pattern will impact the performance. I am working on structured sparsity to achieve speedup in GPU.

@wenwei202 I am not able to complete compilation. 'make runtest' fails.

@Rupeshd @wenwei202
When make runtest, use atlas instead of mkl ( seems mkl has some problems to pass some testcases) and export following variables if you have more than one GPU:

export CUDA_VISIBLE_DEVICES=0 # use one GPU

To stabilize the sparsity during training, I zero out weights whose absolute values are smaller than 0.0001 after each weight updating. So, the precision of RMSPropSolverTest may not be enough to pass the test. You can comment the following code if you do not want to zero out (but it is recommended during training to stabilize the sparsity).

template <typename Dtype>
void Net<Dtype>::Update() {
  for (int i = 0; i < learnable_params_.size(); ++i) {
    learnable_params_[i]->Update();
    learnable_params_[i]->Zerout(); //comment this if you do not want to zerout.
  }
}

The only failed (crashed) test case is "TYPED_TEST(ConvolutionLayerTest, Test0DConvolution)" of https://github.com/wenwei202/caffe/blob/scnn/src/caffe/test/test_convolution_layer.cpp#L311.
And, I don't know why. If your guys can figure out, that would be great. Temporarily, I commented codes with in and passed all other test cases. Test0DConvolution was not used for usual 2D or 3D convolution, so it might not be a concern.

Hope this helps.

-Wei

@wenwei202
Hello, I think you have implemented Liu.s CVPR Sparse Convolution Neural Network. But in your fork of caffe_scnn(https://github.com/wenwei202/caffe/tree/scnn), I can't find any procedure to implement that(I know you implemented group_lasso and so on, but how can your code implement methods described in Liu's Paper?? Can you give me a simple tutorial??)
Thank you in advance.

@wenwei202
Besides, I can see you wrote 'models/eilab_reference_sparsenet/deploy_scnn.prototxt' and so on in some python files, but i can't find anyone of them.How can I generate them or where can i find them??

@zhaishengfu The implementation was abandoned. Hardly it can achieve good speedup unless the sparse weights were hardcoded in the source code as the paper did. I didn't try hardcoding weights but you are free to try if you have interest. What the paper did was to convert each conv layer to three small layers. You can use this to generate the equivalent net prototxt and this to generate the corresponding decomposed caffemodel. But the code is deprecated.

@wenwei202 Thank you for your reply. But i don't understand your meaning of 'hardcoded'. I didn't see words describing about it in the paper. According to my understanding, you can get speed-up as long as your network is sparse and you implemented methods of sparse-dense matrix multiplication described in the paper. Am i wrong???

@zhaishengfu Please refer to section 4 in the paper, like "Therefore, the location of non-zero elements are known and can be encoded directly in the compiled multiplication code." The duplication of that work was abandoned because of that tricky scheme. Our speedup is achieved by structured sparsity to overcome the irregular memory access pattern suffered from random distribution of sparse weights in the memory space. Hopefully, we can release our related paper soon.

@wenwei202 Thank you very much. Really looking forward to your paper. Can you let me know when you realease your paper??(or can you tell me the name of your paper??)

Hi @zhaishengfu @jpiabrantes @Rupeshd @pluskid @sergeyk , our paper related to this caffe fork is just accepted by NIPS 2016. You are welcome to contribute, in case you still have interest in sparse convolutional neural networks. [paper] [Github code ]

@wenwei202 Thank you very much!! I will read it carefully!! I really enjoy your contribution to this fork

@wenwei202 hello, i have seen your paper and code roughly. is the code same with your original code?? i did't see any difference(or may be i should see more carefully)
besides, i have used your original code to train my model(regression problem). it is useful but i lose some accuracy. and if i set the learning rate to >10^-5, it will go to "nan". so i can only set it to small number and the convergence is very slow...

@zhaishengfu Please use the scnn fork, and I have updated tutorial. Help that will help.

@wenwei202 ok, indeed i have used your code already. i used all of your related parameter to generate my prototxt as following. I see that you don't use tensor decomposition.
layer {
name: "conv1_1"
type: "Convolution"
bottom: "image"
top: "conv1_1"
param {
lr_mult: 1
decay_mult: 1
breadth_decay_mult: 1.0
kernel_shape_decay_mult: 1.0
block_group_lasso {
xdimen: 9
ydimen: 64
block_decay_mult: 1.0
}
regularization_type: "L1"
}
param {
lr_mult: 1
decay_mult: 1
breadth_decay_mult: 0.0
kernel_shape_decay_mult: 0.0
regularization_type: "L1"
}
connectivity_mode: DISCONNECTED_ELTWISE
convolution_param {
num_output: 64
bias_term: true
pad: 1
kernel_size: 3
group: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}

For the setting:
block_group_lasso {
xdimen: 9
ydimen: 64
block_decay_mult: 1.0
}
what's the meaning of 964? Dose it mean that it will reserve 964 group of weights, and zero-out others?

@hiyijian the code says clearly, the xdimen and ydimen represents the column and row dimension respectively. For example, if you have A rows and ydimen is B, then you will have A/B groups and in each group you will use regularization

Thanks. Clear now.
Is there any guide to set proper xdimen and ydimen in order to achieve better performance in accuracy and speed?

@hiyijian Indeed i also want to know the answer. In my trial of traning(my problem is not classification but regression), when the sparsity gets about >60%, the accuracy will decrease apprently. I think the configuration of xdimen and ydimen is related to your network and question. Maybe you can set the configuration as the paper says(such as xdimen is equal to the columns of your convolution kernel and ydimen is equal to the rows of your convolutional kernel).

Thank you @zhaishengfu
Maybe the network could be fine-tuned without SSL to regain the accuracy as paper report. I will have a try

@zhaishengfu @hiyijian The setups of xdimen and ydimen are based on what kinds of structure sparsity you want. For example, if weight matrix with many all-zero columns are expected, then xdimen = 1 and ydimen = the number of rows. For the trade off between accuracy and sparsity, pls train nn without ssl first to get the baseline, then train it ssl, and finally finetune it without ssl. Make sure your training converges well at every phase.

Thanks @wenwei202 . It very helps.
You introduce 5 ways for group lasso:
1、filter-wise and channle-wise
2、shape-wise
3、depth-wise
4、2D-filter-wise
5、filter-wise and shape-wise

Would you like to make it more clear : How to put them into practice respectively via xdimen/ydimen control ?

Say we have a typical conv layer with nfilter* nchannel * nHeight * nWidth = 128 * 64 * 3 * 3
1、filter-wise and channle-wise: xdimen = 9 and ydimen >= 1
2 、shape-wise: xdimen != 9 and ydimen = 0
3、depth-wise: no idea
4、2D-filter-wise: xdimen = 9 and ydimem = 1
5、filter-wise and shape-wise: xdimen != 9 and ydimen >= 1

Did I do anything obivoius stupid?

@hiyijian In your example, as a filter is reshaped as a row in the weight matrix in Caffe, the setups would be be:
filter-wise: xdimen=64x3x3, ydimen=1
channel-wise: xdimen=3x3, ydimen=128
shape-wise: (1,128)
depth-wise: (64x3x3, 128)
2d-filter-wise: (3x3,1)

You can set up multiple block_group_lasso as you want.

@wenwei202 Cool. I got the point. Thanks

@wenwei202 hello, i have used your methods and train the model to get 90% sparsity. but when i run the model, it doesn't have any accleration. i have set the following parameters
conv_mode: LOWERED_CSRMM
and i compiled caffe with mkl. what is the problem do you think i have??

@wenwei202 I tried all three methods, and their speed is almost the same(0.22s). My network is almost all 3x3 convolutions and has 32 such layers, with no groups. But becuse my mkl can't compile well with your caffe, so i changed the mkl const variables to normal variables(that is , i removed the
const int M and so on to become
int M)

void caffe_cpu_sparse_dense2csr( int M, int N,
float* A,
float* A_nonzero_buf, int* A_nonzero_idx_buf, int* A_idx_pointer_buf){

but i think this is not the reason of not speeding-up. I have no ideas. My gpu is nvidia gtx 770

@zhaishengfu try to profile and locate the bottleneck.

@wenwei202 I find a really strange problem. When i use LOWERED_CSRMM and LOWERED_CCNMM, the speed is slower than LOWERED_GEMM(and the LOWERED_CCNMM is far more slower!!!). now i run the model under 4 environments and the average time with the same model as described previously(the average time is the total time i use caffe forward()):

  1. cuda 7.5+ cudnn-v4-----------------------average time: 0.22s
  2. cuda 7.5 + LOWERED_GEMM(gpu mode)-----------------------average time: 0.18s
  3. cuda 7.5 + LOWERED_CCNMM(gpu mode)---------------------average time: 2.3s
  4. cuda 7.5 + LOWERED_CSRMM(gpu mode)--------------------------average time: 0.24s
    but when I see the convolution time of each layer, it is strange!. I find the time of each is totally different from the overall time of the network!!!(you can see CCNMM is the fastest and CSRMM is the slowest, as depicted by the following picture, I tested the network multiple times, so you can see the regular time pattern. The x asix is my different convolution layer, the y axis is the time of each layer with the unit of time of us.)
    000

I will see deeply why this happens. If you have any ideas, thank you for telling me!!

@wenwei202 I readed another paper of yours titled "Holistic SparseCNN: Forging the Trident of Accuracy,
Speed, and Size". It seems that the direct sparse conv technology introduced by this paper has already intergrated into scnn(by using DIRECT_SCONV).
I got following unclear:
1、“When sparsity is too high, direct sparse convolution becomes bandwidth bound, which leads to diminishing returns on the performance acceleration”. Why's that?
2、When should we use DIRECT_SCONV?

@hiyijian Please use our intel branch for the paper "Holistic SparseCNN", which is mainly contributed by @jspark1105. And I guess the sparsity is defined as the ratio of nonzeros in that paper.

@zhaishengfu Some comments:

  1. To profile, use deploy.prototxt in Python and use train_val.prototxt in caffe time ..., otherwise, the there might be some bugs in original Caffe.
  2. cuDNN is only for training, testing is profiled in Caffe default engine - cuBLAS
  3. https://github.com/wenwei202/caffe/blob/scnn/README.md#notes

@wenwei202 I used cublas in gpu mode and disable cudnn, but i find the time is still strange.I used caffe time and only test the forward() time, finding that GEMM is the fastest, only 100ms, and CSRMM and CCNMM are both about 1200ms .So, what's the problem do you think??(by the way, i tested the sparse model in normal caffe(not yours) and find that in GEMM GPU mode, your caffe is about 100ms and original caffe is about 130ms)

@wenwei202
Dose the scnn brach support DIRECT_SCONV mode ? I noticed that DIRECT_SCONV mode in the caffe,proto.
I failed to compile intel branch. My gun g++ could not recognize "-xhost" flag. I think it is hardware related. So what's the requrements for CPU?

@hiyijian it is not fully supported by scnn. You need intel compiler to use intel branch. Please ask Jongsoo Park for more details

@hiyijian If you're interested in DIRECT_SCONV, please try https://github.com/IntelLabs/SkimCaffe . We've created a new git repository for direct sparse convolution (also described in https://arxiv.org/abs/1608.01409)

@wenwei202 i want to know whether net_pruner.py in your caffe is used to extract the non-zero weights for further purpose, the cod you write is hard to understand for me because of lacking comments.

@Paseam No, that's not the one. You can use connectivity_mode to disconnect zero-weighted connections.

@wenwei202 When I'm testing ConvolutionLayerTest/1.TestGradient3D on CPU, there is always such an error: /test/test.testbin': malloc(): memory corruption: 0x000000000e48a3a0 *** . Do you have any idea about the possible problem? (I'm using Openblas) Thanks in advance!

@knsong Please find previous comment about make runtest, I did not have the issue of TestGradient3D. If you figured out it, also let us know. Thanks.

@wenwei202 In section 4.2, "Group Lasso regularization is only enforced on the convolutional layers between each pair of shortcut endpoints, excluding the first convolutional layer and all convolutional shortcuts." What do you mean "the convolutional layers between each pair of shortcut endpoints"? Could you give me an example? Thanks a lot!
For example, pool1->res1a->bn->res1b->bn->res1 (eltwise: res1b+pool1)-> res2a->bn->... In this example, which convolutional layer/layers we should enforce the group lasso regularization?

@wenwei202 In the experiment of ResNet-20, after the group lasso regularization is enforced, and ssl converges, the error is dropped or not compared to the original 8.82% (I know the final is 7.40%)? When you fine-tune, did you froze the connectivity? Thank you very much!
In this experiment, I was wondering which step, ssl step or fine-tune step, contributes more for the improvement. Of course, I think it is ssl, but how?

@xiyuyu We have an example of resnet. We use 1x1 conv layers as the shortcuts when the dimensions of feature maps do not match, those conv shortcuts are not enforced by group lasso. Moreover, the first conv layer is not between any shortcut, so it is not reasonable to add group lasso on this layer to remove it. The SSL step is essential to learn structured sparsity, and fine-tuning process is the step to recover accuracy a bit by disconnecting zero weights (DISCONNECTED_GRPWISE in the case of resnet).

@wenwei202 Got it! Thank you very much~~

@wenwei202 when you finetune the caffemodel after group-lasso regularisation on resnet_n3, did you zero out the parameters in batchnorm and scale layer? If not, after finetuning, these parameters will not be zero, and will produce output which makes the removal of the whole network unworkable.

@xiyuyu A all-zero conv layer produces all-zero feature maps, batchnorm on all-zero feature maps is trivial. At least, the outputs of those batchnorm layers are constant regardless of the input images. You can fairly add those constant values to bias values in the next layer.

@wenwei202 make sense~~

@wenwei202 Does structured sparsity support innerproduct layer, too?

@irwenqiang Yes, it does support.

make runtest failed , has anyone solved similar runtest problem?

ConvolutionLayerTest/1.TestGradient3D
*** Aborted at 1480994031 (unix time) try "date -d @1480994031" if you are using GNU date ***
PC: @     0x7f4be7a7cc4e (unknown)
*** SIGSEGV (@0x7f4bfe0856c0) received by PID 7887 (TID 0x7f4bef483a80) from PID 18446744073676543680; stack trace: ***
    @     0x7f4be7dd53e0 (unknown)
    @     0x7f4be7a7cc4e (unknown)
    @     0x7f4be7a7e5d4 __libc_malloc
    @     0x7f4be8882f37 caffe::SyncedMemory::mutable_cpu_data()
    @     0x7f4be8819b2c caffe::Blob<>::InitializeConnectivity()
    @     0x7f4be8825be2 caffe::Blob<>::Reshape()
    @           0x488e4b caffe::GradientChecker<>::CheckGradientSingle()
    @           0x4899f3 caffe::GradientChecker<>::CheckGradientExhaustive()
    @           0x6d8fd7 caffe::ConvolutionLayerTest_TestGradient3D_Test<>::TestBody()
    @           0x91b043 testing::internal::HandleExceptionsInMethodIfSupported<>()
    @           0x91465a testing::Test::Run()
    @           0x9147a8 testing::TestInfo::Run()
    @           0x914885 testing::TestCase::Run()
    @           0x915b5f testing::internal::UnitTestImpl::RunAllTests()
    @           0x915e83 testing::UnitTest::Run()
    @           0x46e21d main
    @     0x7f4be7a1b830 __libc_start_main
    @           0x475c89 _start
    @                0x0 (unknown)
^CMakefile:541: recipe for target 'runtest' failed

Hello !@wenwei202
Would you like to tell me which files that you changed compared to the original caffe?I want to use your code on my Caffe. It's very kind of you! Thank you very much!

@mumaal A bunch of files were modified, pls use git diff between master branch and scnn branch to check the modifications. Thanks!

Hello @wenwei202
Thanks for the awesome concept. I was eager to know whether this would work in larger nets like VGG and Resnet-101?

@legolas123 We did not try VggNet and ResNets-101 in the NIPS paper, but applying it to VggNet is undergoing. I will update you when it's done. Please also update here if you get the results. Thanks!

@wenwei202 Sure, will let you know when I get the results.
Thanks

@legolas123 Here is the result of training vggnet-16 with SSL. We get small column sparsity even with ~2.7% improved top-1 accuracy after SSL without further fine-tuning . We are increasing the decay to explore how structurally sparse the VggNet can be, with small accuracy loss. stay in tuned.

@wenwei202 nice to hear that. I am also trying your sparsity code on detection with faster-rcnn framework, It is on VGG16 net. I used the weight decay you used in CIFAR data with the value 0.003. I guess this is a large weight decay factor. Lets see how the results look. Will let you know when it converges.

@wenwei202 I have the results on VGG for detection. While the sparsity is around 65% on all convolutional layers from conv3_1 onwards, accuracy (MAP) dropped from 69 to 64. But the interesting thing is that inference time has gone considerably slower which means I am doing something wrong in inference. Dont know what.

@legolas123 To speedup, you need to remove zero groups and concatenate non zeros to smaller matrix. For accuracy, make sure you did fine-tuning after sparsifying. Thanks!

Yeah I did that. conv_mode: LOWERED_CCNMM in test.prototxt right? Yes I did that. In that only it has become considerably slower. And I am using GPU mode. So no mkl problem either. Could it be that convolutions are getting faster but some other layers like spatial pooling getting slower. I dont see a reason. But still trying to profile with faster-rcnn python wrapper to see if that is the case.

@wenwei202 I used your profile display flag in Makefile.config. And yes convolutions on which sparsity has been done are more than 2 times faster than its dense counterpart. But since profiling does not show FC layers, I cannot find the layer which is slowing down the sparse network. Can you think of anything that might be causing this issue? Thanks!

@legolas123 The tricky part is, in the GPU mode with conv_mode: LOWERED_CCNMM, we temporarily use CPU code to do the lowering, which slows down the computation. If we can modify conv_im2col_gpu to concatenate feature matrix, then it wouldn't be an issue.

@wenwei202 so in every iteration, it does this lowering operation because it can me made to do the lowering operation in just first iteration right? Thanks!

@legolas123 The weight matrices are lowered before the first iteration here, which is one-time computation, but feature maps are dynamic and should be lowered in each iteration. The overhead is small, we just need to skip some indices as I did in CPU mode. GPU code takes a while to hack though.

@wenwei202 Will try to hack into that. My current concern is accuracy because with 90% sparsity the MAP has gone down by 10 points which further fine-tuning is improving just by 1 point. This is weird because you are improving the accuracy of imagenet classification by sparsifying(although you mentioned sparse intensity you had kept low right now) and I am seeing such a drop of accuracy in detection. Thanks!

@legolas123 Do you mind sending me your solver for both SSL and fine-tuning? weiwen.web@gmail.com

hi @wenwei202

i'm interested to try your scnn.
but i'm bit confused with the step of your tutorial.
from my understanding, the step are (based on your cifar-10 tutorial):

  1. train cifar-10
  2. finetune

but in your readme.md, you give an example using block_group_lasso which i didnt find inside the prototxt (cifar10_full_train_test, cifar10_full_train_test_ft or cifar10_full_train_test_kernel_shape) of your cifar-10.

any suggestion for me ? or the block_group_lasso automatically calculated during training ?

thanks

@marifnst You can add block_group_lasso based on what kinds of structured sparsity you want to learn. See an example for reducing layers in resnet, where all weights in the layer form a group.

@wenwei202 I have the results for the sparsification on detection network of faster-rcnn. The sparse training is carried out for 70k iterations at 0.001 learning rate and fine-tuning is performed on another 60k iterations at 0.001 learning rate. So I increased the learning rate for tuning suggested by you. Now the final MAP is 65.5 in comparison to original MAP of 68. Also the sparsity is more than 80% in all the convolutional layers about conv_3.

@legolas123 Great job, Koustubh! Looking forward to seeing you share it with the community.

hi @wenwei202

thank you for your sharing.
can you share which source or library that linked to vsDivCheckZero in math_functions.cpp ?

regards

@marifnst Here it is.

./include/caffe/util/mkl_alternate.hpp:20:DEFINE_VSL_BINARY_FUNC(DivCheckZero, y[i] = (b[i]==0 ? 0: a[i] / b[i]));

Hi @wenwei202

Thank you very much.
I hope can finish integration process and share a result to you.

Best regards

hi @wenwei202

Sorry for always interrupting you.
I have successfully integrated my algorithm with your layer.
Can you help me to explain the purpose of below output:

I0207 16:43:25.079854 8699 base_conv_layer.cpp:17] layer conv1 has sparsity of 0.00085034
I0207 16:43:25.082247 8699 base_conv_layer.cpp:171] ConvolutionParameter ConvMode: DEFAULT
I0207 16:43:25.083804 8699 base_conv_layer.cpp:17] layer conv2 has sparsity of 0.015612
I0207 16:43:25.132417 8699 base_conv_layer.cpp:171] ConvolutionParameter ConvMode: DEFAULT
I0207 16:43:25.134937 8699 base_conv_layer.cpp:17] layer conv3 has sparsity of 0.0147163
I0207 16:43:25.205013 8699 base_conv_layer.cpp:171] ConvolutionParameter ConvMode: DEFAULT
I0207 16:43:25.208621 8699 base_conv_layer.cpp:17] layer conv4 has sparsity of 0.00619092
I0207 16:43:25.311568 8699 base_conv_layer.cpp:171] ConvolutionParameter ConvMode: DEFAULT
I0207 16:43:25.313935 8699 base_conv_layer.cpp:17] layer conv5 has sparsity of 0.00529876
I0207 16:43:25.481933 8699 inner_product_layer.cpp:12] layer fc6 has sparsity of 0.0213456
I0207 16:43:29.525123 8699 inner_product_layer.cpp:12] layer fc7 has sparsity of 0.0181708

?

best regards

@marifnst
I0207 16:43:25.079854 8699 base_conv_layer.cpp:17] layer conv1 has sparsity of 0.00085034 means 0.00085034*100% weights of conv1 are zeros (element-wise sparsity).

I0207 16:43:25.082247 8699 base_conv_layer.cpp:171] ConvolutionParameter ConvMode: DEFAULT means, in layer conv1, convolution computation mode conv_mode is the default GEMM. See more modes here.

hi @wenwei202

I have prototxt like below:

layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
#connectivity_mode: DISCONNECTED_GRPWISE
param {
lr_mult: 1
decay_mult: 1
block_group_lasso {
xdimen: 7
ydimen: 1
block_decay_mult: 1.0
}
}
param {
lr_mult: 2
decay_mult: 1
}
convolution_param {
num_output: 48
kernel_size: 7
pad: 3
stride: 2
}
}

and no loss of accuracy without fine tuning .
Or maybe i have mistake on my configuration so your scnn doesn't work ? any comment ?

And i found from your code that:

  1. Your conv mode (CSR, CNM, etc) only for testing and not for training phase
  2. CNM is slower than CSR (the explanation from thread before, only implemented in CPU and i use GPU mode) in testing phase

So, any suggestion how to find that your layer faster than original layer ?
From my understanding from your paper, you compare execution time per layer (conv layer).

Sorry because i dont notice that you have updated https://github.com/wenwei202/caffe/blob/scnn/README.md. I will check it too based on your explanation there.

Best Regards

@marifnst You'd to configure block_group_decay in your solver to enable group Lasso, since block_decay_mult * block_group_decay is the hyper-parameter of lambda. Default block_group_decay is 0.
Open USE_PROFILE_DISPLAY := 1 in Makefile.config to plot timing results, which essentially include the time of matrix-matrix multiplication only.

Hi @wenwei202

Thank you very much for your responses.
Your suggestion successfully decrease my avg accuracy from 0.60 to 0.15 (without fine tuning) with 0.005 value of block_group_decay in solver & DISCONNECTED_GRPWISE of connectivity_mode.

No problem with that, still in progress to be more understood with your SSL and will be fine-tuned soon.

Best Regards

Hi @wenwei202

In section 4.3 (AlexNet on ImageNet) of your paper, you described the SSL method in 3 steps:

  1. trained with structure regularization;
  2. remove zero groups;
  3. fine-tune without SSL.

Now, I can only find your caffemodels of step 1. Would you please share the fine-tuned model of step 3?

Thanks! :)

@Roll920 The caffemodels in the zoo were fine-tune ones.

@wenwei202 Thanks for your responses, but I find the size of your caffemodel is exactly the same as original AlexNet (232.6MB). Since zero groups are removed in step 2, it should be much smaller with a new structure. Is there anything wrong with my understanding?

@Roll920 I updated the tutorial regarding your concern here.

Hi @wenwei202 , I read your NIPS paper, It looked great! Just wondering what would be the performance if I start training from scratch using SSL. Would that model be accuracy be close enough to the original model trained from scratch without SSL? I cannot use a pretrained model for my project

Hi @madnavs , training from scratch is fine but fine-tuning can get better trade-off between accuracy and sparsity. Thanks!

Thanks for the reply @wenwei202 . Just want to know your insights on how bad would training from scratch be compared to fine tuning.

@madnavs We did't observe significant differences using small dataset, but we didn't try ImageNet. You should use smaller lambda_g if you train it from scratch.

hi @wenwei202

how if i set my prototxt below:

layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
#connectivity_mode: DISCONNECTED_GRPWISE
param {
lr_mult: 1
decay_mult: 1
block_group_lasso {
xdimen: 7
ydimen: 1
block_decay_mult: 1.0
}
}
param {
lr_mult: 2
decay_mult: 1
}
convolution_param {
num_output: 48
kernel_size: 7
pad: 3
stride: 2
}
}

but i dont set block_group_decay value in my solver ? so the training will be processed without SSL ? you commented before that parameter will be set to 0.

Thank you very much

@marifnst Yes, the default block_group_decay is 0.0 as caffe.proto sets up. Note that the number of columns (rows) must be divisible byxdimen(ydimen).

hi there @wenwei202 , here i have some detailed question about the configuration of your experiments on ResNet20 on cifar-10(related to depth-wise sparsity).
In your experiment , does a single group consist of 2 conv layers in a residual block , or you add 2 shortcuts in every block(so that single conv layer is a group)...

I tried the second scheme and find out that when a conv layer is removed , it is likely that another conv layer in this residual group is removed as well.I train the model from scratch with SSL on CIFAR-10 with the standard configuration of resnet20 but the accuracy can hardly be recovered by fine-tuning. I wonder if there is any mistake or trick that I missed in this experiments. Thanks!

@siberiamark Neither! In the paper, the residual block has the original structure, where there is only one shortcut across two layers. Group Lasso is separately enforced on each layer. Theoretically, if one layer in the residual block is regularized to all-zero, the other layer must also be all-zero so as to minimize the target function (because removing the remained layer will not affect data loss but reduce the regularization term). This is also the phenomenon I observed. If you train from scratch, please use smaller hyper-parameter lambda_g, but the trade-off may be better if you fine-tune by SSL.

Hi, @wenwei202 ,I have read your paper and code for some time. I don't kown the deference between breadth_decay and kernel_shape_decay in caffe.proto. What are their respective relations with block_group_decay? Could you give me further explanation please? Thanks a lot!

@ZouKaiwei Group Lasso regularization on each row or column can be specified by block_group_lasso with ydimen: 1 or xdimen: 1. However, we also implemented (breadth_decay_mult & kernel_shape_decay_mult in ParamSpec param) and (breadth_decay & kernel_shape_decay in SolverParameter) to simplify the configuration of group Lasso regularization on each row or column, respectively. For example, in conv1 of LeNet, kernel_shape_decay_mult: 1.5 is equivalent to

param { # weights
  lr_mult: 1
  block_group_lasso { # specify the group lasso regularization each column
    xdimen: 1 
    ydimen: 20 # The size of each column is the number of filters 
    block_decay_mult: 1.5 # the same with kernel_shape_decay_mult
  }
}

and breadth_decay_mult: 1.5 is equivalent to

param { # weights
  lr_mult: 1
  block_group_lasso { # specify the group lasso regularization each row
    xdimen: 75 # The size of each row is the size of filter 5*5*3
    ydimen: 1  
    block_decay_mult: 1.5 # the same with breadth_decay_mult
  }
}

Got it! @wenwei202 Your reply really help me. Thank you very much!

Hi , @wenwei202 I'm trying to get the same result as yours using MLP. I have some doubts about your files. Suppose I need to run the MLP 2
(1) Which net file should I use?
(2) What values should I set for kernel_shape_decay and breadth_decay?
Besides, are the Neuron number per layer in MLP2: 469-294-166-10 been setted in mlp.prototxt ?or these values are just the results of sparse?
I'm looking forward to your reply.Thank you very much.