Problems in applying the suggested method to other types of convolutional neural networks

Question

Problems in applying the suggested method to other types of convolutional neural networks

Closed this issue 9 years ago · 1 comments

Hi,

First of all, thank you for sharing your research. I'm highly interested in the way your research suggested; thus, I'm trying to apply dropouts for CNN (and MC dropout when inference) in other practical models (such as models for ImageNets) for fun :)

However, I have some questions to use the result of your paper, since I'm now confused with how dropout should be applied to nn layers.

As far as I understood from your paper (e.g. http://arxiv.org/pdf/1506.02158v6.pdf), it seems like dropout should be applied right before (or right after) each weight product (before non-linear function applied). I would appreciate if you let me know this understanding is right or not. (Based on the code in here, my understanding seems wrong...)
Some quotes from the paper.

Note that sampling from q(W_i) is identical to performing dropout on layer i in a network whose weights are (M_i)_i=1..M. The binary variable z_i,j = 0 corresponds to unit j in layer i − 1 being dropped out as an input to the i’th layer.
...
This is equivalent to an approximating distribution
modelling each kernel-patch pair with a distinct random variable, tying the means of the random
variables over the patches. This distribution randomly sets kernels to zero for different patches. This
is also equivalent to applying dropout for each element in the tensor y before pooling.

2. I also wonder why we don't have to apply dropouts to bias parameter in each layer. I guessed that biases should also be treated with the way weight treated (in probabilistic sense) since Bernoulli random variables seems stick to the parameters of interest in functions. Because of that, I thought dropout should be applied to bias too. (again... based on the code in here, my understanding seems wrong...)

I might be not good at understanding mathematics in your paper; therefore, I would really appreciate if you give me some comments on my short thought.

Sincerely,

Jaehyun

Answer 1 · 2016-02-17T15:00:52.000Z

Hey!
I updated the section in the paper that you quote above - it was a bit confusing before. Putting dropout before or after a layer corresponds to different approximating distributions q(). Dropout just before an inner-product layer corresponds to setting columns of W to zero at random - the first paragraph you quote. Putting a dropout layer just after a convolution layer corresponds to a different approximating distribution q() that ties different patch-kernel pairs (the second paragraph you quote).
Hope that's more clear?
Let me know if you manage to get good results on ImageNet - I couldn't improve on it, but I didn't try too hard either. This reference might be useful for this task:
http://arxiv.org/abs/1511.02680
Alex applies dropout over the top convolutions (in the encoder) which apparently gives better results than putting dropout also over the lower layers (which we know to just extract Gabor filters).
Yarin