Gorilla-Lab-SCUT/frustum-convnet

Refinement stage query

Closed this issue ยท 5 comments

Hi,
Awesome work! Thanks for sharing it.
I had a couple of questions regarding the refinement stage:

  1. In the paper you mention that you use the output from the 1st fconvnet and feed it (after some adjustments) into a 2nd fconvnet. Do I understand this correctly? So if we were to use the refined network it would in fact be 2 fconvnets joined end-to-end?
  2. When training the refined network, does the model_best.pth obtained contain BOTH these fconvnets? or does it only contain the 2nd fconvnet that uses the results of the 1st one?
  3. In the case of training the model for car, pedestrian and cyclists, the number of classes for fconvnet would still remain 2 right (object, background)? Is the classification then taken from the 2D object detection model?

Cheers!

All you said is right.

  1. Yes. We use prediction of the 1st f-convnet as the input of our 2nd f-convnet during evaluation. The command of python kitti/prepare_data_refine.py --car_only --gen_val_rgb_detection save the prediction of the first stage. The whole pipeline includes two f-convnet networks.
  2. Only contain parameters of 2nd f-convnet.
  3. Yes. We encode class prediction from 2D detection as one-hot label and use it as the intermediate features in our network. So, we only need to distinguish foreground and background. Actually, we found in our experiments the RGB features are more distinctive than point clouds.

Thanks for the quick reply back ๐Ÿ‘

Hi!
Thanks for sharing this information. @sharma-n @zhixinwang

I have some further questions about this issue.
If the classification directly taken from 2D detector, doesn't this mean that frustum-convnet does not train category classification at all? So this architecture only trains localization?

For training, do you use the ground truth 2D bounding boxes and ground truth classification to get the frustums as input?

From what i understand, FConvNet is responsible for the 3D box regression. The two classification classes of background/foreground allows us to get some kind of "confidence score" from FConvNet.

You could also potentially train 3 separate FConvNets for each class (slightly higher accuracy?). This would require you to pipe them accordingly during inference time based on the 2d detection. But you'll be using 3 times the GPU memory to keep the 3 networks loaded.

And yes, to train the FConvNet you'll need 2D detection boxes and classifications. Once you have that, the code provided by @zhixinwang automatically calculates the frustums. Also, @zhixinwang also already provides those rgb detections for the KITTI dataset (thanks!), stored in frustum-convnet/kitti/rgb_detections/rgb_detections_train.txt

@sharma-n Thanks for your quick response and answer!!