runtime error in pruning resnet50

Question

runtime error in pruning resnet50

zhaoxin111 opened this issue 3 years ago · 10 comments

Thank you very much for your optimization. I tried to reproduce the pruning effect on classification. But it reported an error. I suspect it is a torch version problem, but after switching to the same version as your experiment, still have this problem. Can you give suggestions?

2021-11-18 21:53:59,417 - mmcls - INFO - Environment info:

sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda-11.1
NVCC: Build cuda_11.1.TC455_06.29069683_0
GCC: gcc (GCC) 5.4.0
PyTorch: 1.8.0+cu111
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0+cu111
OpenCV: 4.5.3
MMCV: 1.3.17
MMCV Compiler: GCC 5.4
MMCV CUDA Compiler: 11.1
MMClassification: 0.15.0+729c6c1

Answer 1 · 2021-11-19T06:43:44.000Z

Sorry for this incompatibility. I am not familiar with the behaviors of pytorch autograd interface under various environments, so I am not sure what is the cause of this problem.
Can you print the op, parents and op.next_functions after line 496 in fisher_pruning.py?
On my computer the results are:

op:<AddmmBackward object at 0x7f2bcccbde20>
parents:[<CudnnConvolutionBackward object at 0x7f2bcccbdb80>, <CudnnConvolutionBackward object at 0x7f2bccd34f40>, <CudnnConvolutionBackward object at 0x7f2bccd34b80>, <CudnnConvolutionBackward object at 0x7f2bccd34a90>]
op.next_functions: ((<AccumulateGrad object at 0x7f2bcccbdeb0>, 0), (<ViewBackward object at 0x7f2bcccbddf0>, 0), (<TBackward object at 0x7f2bcccbdca0>, 0))

If you can find AccumulateGrad in op.next_functions, switching the index in line 497 might solve this problem

Answer 2 · 2021-11-19T08:48:41.000Z

Thanks for your reply, I found that my op and related nodes are completely different from your results

Answer 3 · 2021-11-19T14:20:45.000Z

You can try to replace line 421 in fisher_pruning.py with self.fc2ancest = self.find_module_ancestors(loss, FC) and check if AddmmBackward appears in op2parents.keys() after line 472.

If yes, it means MmBackward might appear in other modules except nn.Linear. Then you can simply skip MmBackward in the iteration begins from line 492; (feel free to leave max_pattern_layer=-1 as this argument is used to accelerate the dfs)
If not, there are 2 possible reasons

the fc layers you use have no bias. Try to turn on it
the autograd interface in your env is totally different from mine. You may need to figure out the computing graph of a fc layer and find the operator that corresponds with bias (check the op and op.next_functions recursively from op=loss.grad_fn)

Answer 4 · 2021-11-21T11:52:17.000Z

Thank you for your reply.

First of all, the problem must be the problem of finding the ancestor node of the fc layer. I printed the backpropagation graph after the loss node. As shown in the figure below, the preorder nodes of the mmback node are unsqueezebackward and TBackward，then there is bias after the TBackward node.

1.the fc layers have bias(the model is mmcls resnet50, I didn't change antthing)
2. I think the computing graph of the fc layer is not right in my exp. In before_run, there is no in_mask in self from modified_forward_linear, so the mask should not been masked. So the backward graph should not have unsqueezebackword. I don’t know if my understanding is correct.

The model, hook and train files are the same as yours, and pytorch is the same. Why does it report such an error? Your comments are welcome

Answer 5 · 2021-11-21T15:27:01.000Z

Your problem has nothing to do with in_mask since the process of assigning in_mask (line 442) is after traversing the graph (line 422).
The computing graph on my computer is shown below.

It seems like the node AddmmBackward is replaced with a more detailed subgraph on your computer. I don't know why now.
---
Thanks for your information, I have updated the code to support both cases. You can try if it works in your env.

Answer 6 · 2021-11-22T01:42:06.000Z

Thank you for your patient reply, I have been able to train successfully. Can you post the reproducing results of the classification and detection later, so that we can better facilitate the correct comparison of the experiment😆

Answer 7 · 2021-11-24T00:13:53.000Z

Hi, in my env, I found the dfs should traverse from paraents[0][0], otherwise fc cannot find any parents.

then, the fc can be grouped with two conv layers

Answer 8 · 2021-12-03T08:55:19.000Z

Hi, I found that different pytorch versions differ greatly in autograd.
When pruning the detection model, I found that the feature and grad_feature dimensions in the figure below do not match. My test environment is consistent with your pytorch1.8, but the 3090 graphics card I used.

I have tried to print the shape of grad_input and grad_output. It is strange that the size of grad_input[0] is consistent with the output shape of the module.

I found this has been discussed here pytorch/pytorch#598

and the grad_input[0] is exactly the grad of the feature map atfer conv2d without add bias, and grad_input[1] is the grad of the bias. If I want to prune a conv2d, I have to set bias =False.

Answer 9 · 2022-09-13T08:12:15.000Z

Thanks for your reply, I found that my op and related nodes are completely different from your results

@zhaoxin111, thank you very much for your previous discussions, they were very helpful to me!

I encounter the same problem with torch==1.8.1. I'm still a bit confused, and why this convolution grad with bias is correct?

Answer 10 · 2022-09-20T10:58:54.000Z

Thanks for your reply, I found that my op and related nodes are completely different from your results

@zhaoxin111, thank you very much for your previous discussions, they were very helpful to me!

I encounter the same problem with torch==1.8.1. I'm still a bit confused, and why this convolution grad with bias is correct?

I do not understand your question