yuweihao/KERN

Some questions about the input

karenyun opened this issue · 11 comments

Hi @yuweihao , Sorry that I haven't read the code, but when I read the paper I have some questions about the input of the graph node.

  1. After the detector producing the object bounding box, do you only use 'roipooling features'(maybe) as the input of the graph node? Or do you trained a feature extraction network also to extract the box feature then concatenate with 'roipooling features' or others?

I am confused about it and not sure how you make the input. Could you give me some advice? Thanks very much!

Hi, @karenyun , thanks for noticing our work. For object nodes in GGNN_obj and GGNN_rel, we use features after ROIpooling and a mapping function (map the dimensional size from 4096 to 512). For relationship nodes in GGNN_rel, we use the features that also encodes box information, and please refer this line of code

rects_np = draw_union_boxes(pair_rois, self.pooling_size*4-1) - 0.5
.

This code of extracting features is from @rowanz's repo neural-motifs. Thanks for their sharing nice code.

Thanks very much!

So did you only use the 'visual feature'(appearance) of 'union box' rather the 'relationship representation'(by using a relationship classification network to extract feature) for the relationship nodes?

As for relationships, there are two types of feature maps. One type is from ROIAlign of union box, the other type is from box mask of subject and object (input is two channels, one channel is subject box mask, the other is subject box mask. https://github.com/yuweihao/KERN/blob/master/lib/get_union_boxes.py#L49 After this sequence of convolution https://github.com/yuweihao/KERN/blob/master/lib/get_union_boxes.py#L31-L39 , we can get feature maps of box masks, whose size is the same as that after ROIAlign). After element-wise addition of these two types of feature maps, we could obtain a new feature maps which encode visual information and box (position) information. Then, use a few layers of fully connected layers like VGG to map it to 4096. Next, we use another mapping function (a fully connected layer) to map dimensional size from 4096 to 512 to initialize the relationship nodes.
Please refer to
[1] get two types of feature maps and then add them together: https://github.com/yuweihao/KERN/blob/master/lib/get_union_boxes.py#L15-L53
[2] map the new feature map to 4096:

KERN/lib/kern_model.py

Lines 241 to 248 in 0250f39

else:
roi_fmap = [
Flattener(),
load_vgg(use_dropout=False, use_relu=False, use_linear=pooling_dim == 4096, pretrained=False).classifier,
]
if pooling_dim != 4096:
roi_fmap.append(nn.Linear(4096, pooling_dim))
self.roi_fmap = nn.Sequential(*roi_fmap)

[3] map 4096 to 512:
vr = self.rel_proj(vr)

Many Thanks! Got it!

(^_^)

Hi @yuweihao , Could you shed the light on the details of the 'draw_union_boxse' with your main idea? Please!

I consider that the union box mask filled with 0/1(or 0/255) indicates where is the object, do I misunderstand it?

Thanks a lot!

Hi @karenyun ,

def draw_union_boxes(bbox_pairs, pooling_size, padding=0):
for each object pairs, there are two boxes. For each box, we could draw a mask map where pixel in the box is 1 and outside is 0. There are two object boxes, so we could get two mask map. We regard each map as a channel, so we could regard it as a image with two channels.

Hi @yuweihao , thanks for your quick reply!

Why you use 0/1 to set the mask value rather the 0/255, is any special meaning for this format?

When I use the 0/1 mask to learn the spatial relation between two parts similar as the way you did, but the several conv layers do not learn well from this format input, maybe resulted from the indirect gt.

Hi @karenyun , because of this line of code, the value is from -0.5 ~ 0.5.

rects_np = draw_union_boxes(pair_rois, self.pooling_size*4-1) - 0.5

This repo is based on the repo neural-motifs and this part of code is from their repo where the value is from -0.5 ~ 0.5.

Okay, many thanks!