hughperkins/clnn

[Feature request] Groups support for SpatialConvolutionMM

szagoruyko opened this issue · 37 comments

To use https://github.com/szagoruyko/loadcaffe with clnn convolution needs groups support (for most of the networks).
An example is in cudnn: https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua

Ok. Reading a bit, and trying to figure out what groups are, is it something like, eg imagine we have 32 input planes, and 32 output planes. without groups, with full plane connectivity, we would link every pair of input and output planes, ie each output plane would be output from one filter cube, of 32 planes, connected to each of the 32 input planes, for a total of 32 * 32 = 1024 filter planes, ie 1024 input-output plane pairs.

But in groups, rather than doing all to all, we split our input and output planes into groups. eg if we have 4 groups, we will have 4 groups of 8 input planes, and 4 groups of 8 output planes.

Each group connects only 8 of hte input planes to 8 of the output planes (in this example), for a total of 4 * 8 * 8 = 128 filter planes, ie 128 input-output plane pairs?

(And this is fully parallelizable, across the 4 groups, each group is entirely independent in fact?)

yes, that's right, you reduce weight size and have offsets when calling im2col and blas. And then you can have one clCommandQueues per each group in forward and backward.

Ok. Well, I will probalby start with working-but-slow, and then look at optimizing later.

Hmmm, in the cudnn case, how do you construct this? Looks like you dont do like nn.ConvLayer(..., groups=8):cuda(), but something more like cudnn.ConvLayer(..., groups=8) perhaps?

function SpatialConvolution:__init(nInputPlane, nOutputPlane,
                            kW, kH, dW, dH, padW, padH, groups)

but the nn version doesnt have groups I think?

no, because we have cudnn

Created a draft SpatialConvolutionMM:forward implementation, https://github.com/hughperkins/clnn/blob/grouped-conv/SpatialConvolutionMM.lua It basically just narrows the input and output
and feeds it to child unmodified SpatialConvolutionMMs. I haven't thought about how to test it yet, and there's no backprop yet. To run it:

th test/testGroupedConv.lua

It just prints a really big matrix, and the size, but at lesat proves the output is the right size, isnt full of zeros, 1e-38 values, or nans, and it runs.

(by the way, this also adds the same functionality to standard nn, as far as I can see)

also done for updateGradInput (but entirely untested for now...)

you lose all the nice properties of it being a single module with weight, bias, gradWeight and gradBias, I think it has to be implemented low-level

Hmmm... weight and bias, good point. We can however slice those too of course....

(Of course, if you have a low-level implementation, I'm obviously open to that :-) )

basically cudnn.torch and caffe https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu have this low level implementation

Are we targeting multiple-queue case, or single-queue case? If we are targeting multiple-queue case, then we will need one kernel launch per group?

Either way, custom kernels take a lot of time and effort, and I'm unlikely to have time to create one for groups in the near future, at least, not without seeming some strong evidence that a simpler solution doesn't meet the requirements. If I create the simpler solution, will you be able to make use of that, or would you rather hold out for a custom kernel solution sometime?

wait there is no need for custom kernels, it's only adding correct paddings (*offsets) before calling them

it seems like a simple solution to me and the evidence is strong, if you add the offsets then loadcaffe and alexnet caffe trained networks will work with clnn out of the box. I can add them sometime, don't know when though.

wait there is no need for custom kernels, it's only adding correct paddings (*offsets) before calling them

using torch.Narrow is exactly the same as adding offsets before calling them. However, until this sentence of yours saying no need for custom kernels, my understanding was that you want me to use one single kernel to run all groups, in one single kernel launch? As far as I can see, this would need a new kernel, cannot reuse existing SpatialConvolutionMM.cl kernel?

Note that torch.Narrow != nn.Narrow. nn.Narrow is a network module that involves at least one kernel launch in each direction, to do a copy. torch.Narrow merely changes the offset. torch.Narrow is a new view onto an existing Storage. No need for a copy.

Edit: by the way, by 'evidence', I mean measurements. eg, demonstrate that kernel launch time is dominating the training time. I have such evidence for LSTM for example. For now, I have no such evidence for the caffe zoo nets.

My understanding is that you want me to use one single kernel to run all groups?

no, the same kernel in a for loop.
I agree, using narrow is the same as adding offsets without copy.

Ok, so if you look at my lua solution , it is using torch.Narrow. What things about my current lua solution are you hoping could be changed? I'm fine with ensuring that .weight and .bias work as expected.

this logic has to go to cpp side. You should use this constructor https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua#L6, leave lua updateOutput, updateGradInput and accGradParameters unchanged from nn and move narrows from lua to cpp.

What benefits do you see in moving to c++ side? Given that the time will be dominated by the convolutional inner loops most likely, why do you feel that there will be any noticeable benefit at all of moving an outer loop from lua into c++?

how will you make it work lua? you cannot use the conv modules because they're in lua and they have their own weights

You can see how it works: https://github.com/hughperkins/clnn/blob/grouped-conv/SpatialConvolutionMM.lua

function nn.SpatialConvolutionMM:__init(nInputPlane, nOutputPlane, kW, kH, dW, dH, padW, padH, groups)
  self.convs = {}
  for group=1,groups do
    self.convs[group] = self.new(self.groupInputPlane, self.groupOutputPlane, kW, kH, dW, dH, padW, padH)
  end
  return self
end

function nn.SpatialConvolutionMM:updateOutput(input)
  for group=1,self.groups do
    inputslice = input:narrow(2, self.groupInputPlane * (group-1) + 1, self.groupInputPlane)
    self.convs[group].output = self.output:narrow(2, self.groupOutputPlane * (group-1) + 1, self.groupOutputPlane)
    self.convs[group]:updateOutput(inputslice)
  end
  return self.output
end

(simplifying here, you can see the full sourcecode in the link)

at first there is something wrong with it, there are two times more weights than needed. second where do you map weights to group modules?
check it against cudnn, but I've got a feeling that I've already spend more time looking at this then it would take too implement offsets in cunn and clnn :)

second where do you map weights to group modules?

I dont yet, but it will not be hard to do so.

I've got a feeling that I've already spend more time looking at this then it would take too implement offsets in cunn and clnn :)

It's far faster to develop in lua than to write custom kernels, and maintenance is easier too. So, I would prefer to write it in lua first, and then see where the bottlenecks are. But obviously if such a module will go unused, then we can wait for me to write a custom kernel but... firstly, I'm unlikely to have time to do that in the near future, and secondly, in my experience, it's significantly faster to write a custom kernel by starting with a simple cpu implementation first, since it gives one something to compare with, test against and so on.

I added groups to cunn forward here torch/cunn@master...szagoruyko:groups
has to be straightforward to extend it to clnn

Sounds good. I didnt realize you were already working on this. Please ignore my interim lua solution, and go ahead with getting the C++ solution to work.

Closing this for now. If you wish to reopen it, please assign it to yourself. If you would like me to implement it, please provide any requiremetns in terms of api, behavior etc, and state that you are happy to accept my own choices in terms of language etc.

Yes I have started doing it for cunn and then it's the same change for clnn. Will do it sometime.

Sergey, I was vaguely looking at Chainer, and they're trying to unify their cpu and cuda bits in some common way, chainer/chainer#266 . Caffe also have a similar project in progress. torch already has this really. It occurs to me that the common language in evreything is: C. Pretty much every language can call into C, so libraries written in C can be reused across Lua, Python, R, etc ... Could it be worth making a C-only version of torch, which can then be wrapped by a very thin sugar layer of lua/python/r/whatever?

hi Hugh, the thing is we will continue to train models in lua, so C-version of Torch would only be useful to use models in test mode. There are several ways to do it then, one is using C/C++ interface of lua, another one is what you propose, e.g. https://github.com/szagoruyko/cunnproduction or https://github.com/soumith/nnjs, it's better to ask @soumith about this directon

You can call C from Lua. I'm surprised by your reply, since you often seem to me to prefer to implement everything in C/C++, rather than Lua :-)

I should probably do this sometime => reopening.

I think I shall close this. If anyone needs this, they can raise a new request and /or comment into this one. I can provide a lua implementation within a few man-hours probably.