micts/acgcn

query about code (●'◡'●)

Closed this issue · 2 comments

hello,
Thanks for your excellent code, I want to consult you about an operation: (Both models use I3D as a backbone (pre-trained on Imagenet + Kinetics-400) and augmented with a RoI pooling layer for action detection. )Is this not a pre-trained i3D model, but a re-trained one? If the pre-training model is used, is the BBox extracted in advance? Finally, if the output feature map of i3D is too small, will it affect the original BBox? I am a novice, hope to get your reply!

Best,
jun0

micts commented

Hi jun0,

Is this not a pre-trained i3D model, but a re-trained one?

Both the Baseline and GCN models use the I3D backbone. I3D is pre-trained on Imagenet and Kinetics-400. We further train our models (Baseline and GCN, along with the backbone) on the DALY dataset.

is the BBox extracted in advance?

Yes, the bounding boxes are extracted in advance. Here we provide the boxes along with their labels in annotated_data.pkl, which is available for download (see https://github.com/micts/acgcn#download-human-tubes-and-annotations). Although the file is called annotated_data, it actually contains both detected boxes and annotated boxes along with corresponding labels. The process of generating the detected boxes is described in the original DALY paper: https://arxiv.org/pdf/1605.05197.pdf. The authors have shared the detected boxes with us, and you can find them in annotated_data.pkl.

Finally, if the output feature map of i3D is too small, will it affect the original BBox?

We apply the RoI pooling on the output of Mixed_4f layer of I3D, which is a good choice to balance fine-grained details and extraction of deep features. We follow prior works on this (it is very common to apply RoI pooling in the output of Mixed_4f layer). Most probably, the original bounding box will be indeed larger than the size of the feature map, so common practice is to rescale the bounding box proportionally to the size of the output feature map. In our code, rescaling is implemented in the following lines:

person_boxes_frame[:, [0, 2]] = (np.copy(person_boxes_frame[:, [0, 2]]) / W) * OW # rescale box width to feature map size

person_boxes_frame[:, [1, 3]] = (np.copy(person_boxes_frame[:, [1, 3]]) / H) * OH # rescale box height to feature map size

Hope I have answered your questions. Let me know if something is not clear and/or if you have any other questions.

Hi jun0,

Is this not a pre-trained i3D model, but a re-trained one?

Both the Baseline and GCN models use the I3D backbone. I3D is pre-trained on Imagenet and Kinetics-400. We further train our models (Baseline and GCN, along with the backbone) on the DALY dataset.

is the BBox extracted in advance?

Yes, the bounding boxes are extracted in advance. Here we provide the boxes along with their labels in annotated_data.pkl, which is available for download (see https://github.com/micts/acgcn#download-human-tubes-and-annotations). Although the file is called annotated_data, it actually contains both detected boxes and annotated boxes along with corresponding labels. The process of generating the detected boxes is described in the original DALY paper: https://arxiv.org/pdf/1605.05197.pdf. The authors have shared the detected boxes with us, and you can find them in annotated_data.pkl.

Finally, if the output feature map of i3D is too small, will it affect the original BBox?

We apply the RoI pooling on the output of Mixed_4f layer of I3D, which is a good choice to balance fine-grained details and extraction of deep features. We follow prior works on this (it is very common to apply RoI pooling in the output of Mixed_4f layer). Most probably, the original bounding box will be indeed larger than the size of the feature map, so common practice is to rescale the bounding box proportionally to the size of the output feature map. In our code, rescaling is implemented in the following lines:

person_boxes_frame[:, [0, 2]] = (np.copy(person_boxes_frame[:, [0, 2]]) / W) * OW # rescale box width to feature map size

person_boxes_frame[:, [1, 3]] = (np.copy(person_boxes_frame[:, [1, 3]]) / H) * OH # rescale box height to feature map size

Hope I have answered your questions. Let me know if something is not clear and/or if you have any other questions.

hi (●'◡'●),
No problem at the moment, thank you very much for your answer, you are really a responsible, kind man ! o( ̄▽ ̄)ブ

Best,
jun0