Pose (decoder) network - forward method mismatch of convolutional operation (pconv0, table 5 on paper)

Question

Pose (decoder) network - forward method mismatch of convolutional operation (pconv0, table 5 on paper)

giannisbdk opened this issue a year ago · 2 comments

Hello, I would like to follow this issue #418 up, since it is closed and not answered in detailed .

Indeed, the output of the resnet encoder is a list containing the intermediate outputs of all the resnet's stages:

def forward(self, input_image):
        self.features = []
        x = (input_image - 0.45) / 0.225
        x = self.encoder.conv1(x)
        x = self.encoder.bn1(x)
        self.features.append(self.encoder.relu(x))
        self.features.append(self.encoder.layer1(self.encoder.maxpool(self.features[-1])))
        self.features.append(self.encoder.layer2(self.features[-1]))
        self.features.append(self.encoder.layer3(self.features[-1]))
        self.features.append(self.encoder.layer4(self.features[-1]))
        # My comment:
        # The result here is:
        # list = [tensor[N, 64, 1/2*s, 1/2*s], tensor[N, 64, 1/4*s, 1/4*s], \
        #           tensor[N, 128, 1/8*s, 1/8*s], tensor[N, 256, 1/16*s, 1/16*s] \
        #           tensor[N, 512, 1/32*s, 1/32*s]]
        # where s: original image size h, w, resp.

This list of features is, then, fed into the (pose) decoder network, where the forward method is doing the following:

def forward(self, input_features):
        last_features = [f[-1] for f in input_features]

        cat_features = [self.relu(self.convs["squeeze"](f)) for f in last_features]
        cat_features = torch.cat(cat_features, 1)

        out = cat_features
        for i in range(3):
            out = self.convs[("pose", i)](out)
            if i != 2:
                out = self.relu(out)

        out = out.mean(3).mean(2)

        out = 0.01 * out.view(-1, self.num_frames_to_predict_for, 1, 6)

        axisangle = out[..., :3]
        translation = out[..., 3:]

        return axisangle, translation

However, I cannot see how the self.relu(self.convs["squeeze"](f)) can work, since this was declared in the constructor like:

self.convs[("squeeze")] = nn.Conv2d(self.num_ch_enc[-1], 256, 1)
# My comment:
# self.num_ch_enc = [64, 64, 128, 256, 512], thus, self.num_ch_enc[-1] = 512

It seems that in the PoseNetwork class constructor a convolutional operation is declared that waits an input volume with # input channels as of the resnet's last stage (which is indeed as of the paper, see Table 5), i.e. 512. However, in my point of view, there is a mismatch between the pose network introduced in the paper, and the actual implementation regarding the forward method and the first convolutional operation. (?)

Please, note that that I did not run the code, thus I cannot provide you with any other information. I have read the paper and tried to investigate the code. Also, I do not assume that my analysis is correct (I just described a point of view).

Answer 1 · 2023-06-07T16:49:36.000Z

Hi @johnbdk ,
The result of ResnetEncoder is not directly passed to PoseDecoder, PoseDecoder typically receives a list of tensor[N, 512, 1/32s, 1/32s] features.
See:
https://github.com/nianticlabs/monodepth2/blob/master/trainer.py#L262

Please try running the code and specify what options you are using.

Answer 2 · 2023-06-07T17:53:27.000Z

@daniyar-niantic Well it seems that I did not see the outer square brackets in line https://github.com/nianticlabs/monodepth2/blob/master/trainer.py#L285 when investigating the code.

Bellow, I point out the exact line and its output in case any future reader happens to have the same query.

pose_inputs = [self.models["pose_encoder"](torch.cat(pose_inputs, 1))]
# My comment:
#
# This leaves an output of the encoder's forward method (when having)
# num_pose_frames == 2, and pose_model_type == 'shared', which means
# one frame at a time for the input of the decoder) like:
#
# list_of_list = [ [ tensor[N, 64, 1/2*s, 1/2*s],
#                    tensor[N, 64, 1/4*s, 1/4*s],
#                    tensor[N, 128, 1/8*s, 1/8*s],
#                    tensor[N, 256, 1/16*s, 1/16*s],
#                    tensor[N, 512, 1/32*s, 1/32*s ] ]
# where s: original image size h, w, resp.

Thank you for the clarification!