confused about the use of reference points

MVDeTr/multiview_detector/models/mvdetr.py

Line 130 in 87783a7

    
           reference_points = create_reference_map(dataset, n_points).repeat([dataset.num_cam, 1, 1, 1])

does this reference points means the 2d pixel of bev grid?
when you get world feature from camera features B * N * C * H * W, you did deformable transformer on this feature, so the final features has nothing with the image features, what does this reference points mean?

thank you for your interest.

yes, they are on the BEV world feature map. the reference points are the default points for attention (uniform grids on the BEV map). from these locations (reference points), MVDeTr then learns where to look at (= offsets + reference points) for multiview feature aggregation.
I would not agree with 'the final features has nothing with the image features', as the BEV world features are directly projected from image features.

please see our paper for more details.

best,
Yunzhong

when it refers to "the final features has nothing with the image features", i mean the deform transformer will be operated on the world feature map, but the reference points are not uniform grids on the bev map, it is cacluated by func "create_reference_map" which use the projection matrixs between camera and world axises. @hou-yz , that's the point where i am confused about, these reference points you obtained from func create_reference_map means the pixel axises of image domain, but the offsets should be learned on the bev world domain.

        elif world_feat_arch == 'deform_trans':
            n_points = 4
            reference_points = create_reference_map(dataset, n_points).repeat([dataset.num_cam, 1, 1, 1])
            self.world_feat = DeformTransWorldFeat(dataset.num_cam, dataset.Rworld_shape, base_dim,
                                                   n_points=n_points, stride=2, reference_points=reference_points)

please run the code and you'll see that the references points are indeed uniform grid.