trzy/FasterRCNN

step by step understanding approximate joint training method #192

Opened this issue · 12 comments

i don't understand exactly approximate joint training method.
i know RPN and detector merged as a one network during training.
the forward path is started pre trained conv network and pass from RPN and finally arrives to fast rcnn layers. loss is computed :

RPN classification loss + RPN regression loss + Detection classification loss + Detection bounding-box regression loss.

but where is it from the backpropagation path? is it from detector and RPN and finally pretrained convnet?
in this case how derivation performed in decoder section in RPN? offcets produced with 1x1 reg-conv layer in RPN is translated to proposals in decoder.

trzy commented

I don't understand the question. Can you rephrase?

Backprop is performed in this function. There is not a complete path back from the detector losses to the image, if I recall correctly. RPN proposals are where the path is broken. The proposals are generated and fed into the detector stage but the gradient is not propagated back through the proposals and up into the RPN. This means that proposals fed into the detector are treated as constants in each step (that of course change in each step as well) that do not affect the gradient. You can see this in the PyTorch version here, where I detach the proposals from the computational graph. In the TensorFlow version, the equivalent code is here using tf.stop_gradient().

thanks for reply
i mean the back path
is back path as follows?

loss----> sibling(reg& classifier)----> FC---->FC---->RoiPooling---->((((DECODER))))----> 1x1_Conv_sibling(reg)---->3x3_Conv------>last-vgg-layer---->second-vgg-layer

I want to find a path that can compute derivative of :
dLoss/d(vgg_weights)

DECODER:
x = x[anchor]+wt[x]
y = y[anchor]+ht[y]
w = w[anchor]exp(t[w])
h = h[anchor]exp(t[h])

i just want know the gradients flow path in the break point(between RPN and detector).

Is what I understood correct?

1- gradients flow starting from last output layers of detector(fast r-cnn).

2 - pass into the FC&FC layers then reached to Roi-pooling layer

3-the gradients of the ROI pooling layer are computed with respect to the weights of the output layer of the RPN.
During backpropagation, the gradients flow from the ROI pooling layer to the output layer of the RPN.

Sorry
Is there the second direct backward route from the Roi pooling layer to the vgg?
i know forward path. but i am confused in the break point(for gradient flow in backpropagation).

trzy commented

You should look at the function I pointed out and draw a visual diagram.

Looking at the code, we can see that the gradient flows back from the detector through the RoI pool layer (which does not have any weights) but is then stopped. Looking forward, the input enters the backbone and then passes into RPN, which generates proposals, and also directly to the detector (for the RoI pooling to use alongside the proposal regions generated from the RPN). However, the gradient cannot flow through the proposal generation logic (I believe it is not differentiable?) and we explicitly stop it from doing so. So the backprop path is not symmetric to the forward path: that one branch of the network does not support backprop. Makes sense?

tanks. understood.
but
we use rpn_output.detach(). but why?
is it possible to derivate roi(feature_map , Rois) w.r.t the coordinate?

d(roi(feature_map , Rois))/d{x1,y1,x2,y2} = exist?

in here
i mean the crooping part of the roi pool.

d(feature_map [x1:x2 , y1:y2])/d{x1,y1,x2,y2} = exist?

coordinates in feature map are just indexes.

trzy commented

This is commonly done in all Faster R-CNN implementations. RoI is probably differentiable (?) but I'm guessing that's not the real problem. Look at what happens prior to the detach() call.

# Assign labels to proposals and take random sample (for detector training)
    proposals, gt_classes, gt_box_deltas = self._label_proposals(
      proposals = proposals,
      gt_boxes = gt_boxes[0], # for now, batch size of 1
      min_background_iou_threshold = 0.0,
      min_object_iou_threshold = 0.5
    )
    proposals, gt_classes, gt_box_deltas = self._sample_proposals(
      proposals = proposals,
      gt_classes = gt_classes,
      gt_box_deltas = gt_box_deltas,
      max_proposals = self._proposal_batch_size,
      positive_fraction = 0.25
    )

    # Make sure RoI proposals and ground truths are detached from computational
    # graph so that gradients are not propagated through them. They are treated
    # as constant inputs into the detector stage.
    proposals = proposals.detach()
    gt_classes = gt_classes.detach()
    gt_box_deltas = gt_box_deltas.detach()

I'm guessing that _label_proposals() and _sample_proposals() are a big part of the reason why the gradients cannot be propagated through. I would imagine these are not stable, differentiable operations, because they perform a lot of conditional filtering and shuffling of data.

I apologize for my many question. but i am confused and i cant give my answer during any research.
but roi pooling involves non-differentiable operations like indexing(quantizing the coordinate(like 3.5) to integers(3)). However why we detaching the proposals, during backpropagation. how the gradients do flow from the detector back into the RPN and feature extraction network? i dont uderstand this is unnecessary detaching proposal when gradients cant be flowing from roi pooling layer to rpn head and automatically are stoped.
on other hand unlike roi align, outputs of roi pooling has not directly related with coordinates(proposals). (Actually, I did not find a mathematically relationship between roi_output and inputs(just coordinates part).)
i.e mathematically relationship beetween roi-pool outputs and{x1,y1,x2,y2}.
So again is not necessary detaching proposal when there is not relationship beetwen roi pooling output and coordinate inputs.
if d(roi_pool_outputs)/d{x1,y1,x2,y2} are not even exist why we should detach the {x1,y1,x2,y2} to become constant??

Make sure RoI proposals and ground truths are detached from computational graph so that gradients are not propagated through them.(using RPN_output.detach() )

  • my problem is that above operation is unnecessary(RPN_output.detach()). because d(roi_pool_outputs)/d{x1,y1,x2,y2} are not even exist and gradients Automatically can not propagated from detector through the RPN.

i realy confused.

trzy commented

Have you tried removing the detach statements and seeing what happens during training? What happens?

No, I did not

trzy commented

Give it a try and observe. If your hypothesis is correct and this is redundant, there should be no difference in training progression and performance. If PyTorch is attempting to automatically differentiate these functions anyway, I would expect that training would not proceed smoothly and would have difficulty converging.

thank you for attention.