DaLi-Jack/SSR-code

2D & 3D object detection module

Closed this issue · 4 comments

Hi, it's great work! I would like to inquire about the implementation of 2D and 3D object detection network. Do you apply the same ODN module as in Total3D(or InstPiFu), like faster-rcnn for 2D object detector and Total3D 3D ODN for 3D object detector? And Do you train the object detectors separately or train the detectors and reconstrucion module end-to-end? If you train the object detectors separately, will the performance decrease if training them together?

Thank you very much! Looking forward to your kind reply.

Hi! This is a good question.
We follow InstPIFu to use the Im3D detector (Im3D followed Total3D to use ODN detector). We train the 3D detector and 3D reconstruction model separately, but I think it's worth a trial for joint training, which may increase the difficulty of training convergence but could potentially improve the performance.
Furthermore, I'd like to share our experience on the 3D detector choice. We first tried Omni3D as a 3D detector (faster r-cnn is built into Omni3D as a 2D detector). The officially provided ckpts have been trained on large-scale datasets and realize amazing results in zero-shot single-view 3D detection. However, it's important to note that the Omni3D detection results may not be sufficiently accurate for object pose due to their use of a chamfer-type loss to penalize 3D bounding box corners and object rotation. Nonetheless, accurate object poses are important for the reconstruction. We tried changing the 3D loss and using FRONT3D data to fine-tune Omni3D, but the predicted object pose was still not good enough. Thus, we finally chose to use ODN detector. While the generalization ability of the out-of-domain dataset may not be as good as Omni3D, it offers significantly improved object pose prediction on the in-domain dataset.

Hi! This is a good question. We follow InstPIFu to use the Im3D detector (Im3D followed Total3D to use ODN detector). We train the 3D detector and 3D reconstruction model separately, but I think it's worth a trial for joint training, which may increase the difficulty of training convergence but could potentially improve the performance. Furthermore, I'd like to share our experience on the 3D detector choice. We first tried Omni3D as a 3D detector (faster r-cnn is built into Omni3D as a 2D detector). The officially provided ckpts have been trained on large-scale datasets and realize amazing results in zero-shot single-view 3D detection. However, it's important to note that the Omni3D detection results may not be sufficiently accurate for object pose due to their use of a chamfer-type loss to penalize 3D bounding box corners and object rotation. Nonetheless, accurate object poses are important for the reconstruction. We tried changing the 3D loss and using FRONT3D data to fine-tune Omni3D, but the predicted object pose was still not good enough. Thus, we finally chose to use ODN detector. While the generalization ability of the out-of-domain dataset may not be as good as Omni3D, it offers significantly improved object pose prediction on the in-domain dataset.

Hi @DaLi-Jack , thank you very much for your quick reply! If I understand correctly, you first train faster-rcnn with the 2D bounding boxes annotation, and predict the 2D bounding boxes. Then store the predicted 2D bounding boxes and use them to train 3D object detection model (since we have the predicted 2D boxes, we only need to predict the depth and rotation).

And I also have a question for the ODN training details, if we predict the 2D bounding boxes using faster-rcnn, how to guarantee the predicted 2D bounding box have the corresponding 3D bounding box annotation? For example, if faster-rcnn predicts there is a desk in the corner but in the 3D annotation it's not, or if faster-rcnn predicts wrong position, how to solve such problems?

Btw, do you have quantitative results for the 2D & 3D object detection task and 3D reconstruction task for SUNRGBD? I am a little confused with the object detection part. Looking forward to your kind reply!

Hi! I'm so sorry that I'm not deeply involved in the field of 3D object detection, and I mainly follow the previous work for 3D object detector, not add anything new, so I didn't do quantitative results for object detection.
As far as I know, generally, a 3D object detector uses a pre-trained faster-rcnn module for 2D detection, and for each proposed 2D bbox (selected after NMS), uses the 3D head to predict depth and rotation (or other values), and their corresponding GT 3D bbox will be obtained by the matching algorithm. So for the desk in your question, it maybe match an error GT 3D bbox or be removed due to lower confidence when selecting proposed 2D bbox.
The above is just my understanding, and you can read this paper for more details Disentangling Monocular 3D Object Detection, Omni3D. I hope my answer can help you.

Okay, thank you very much for your help! I will go through the total3d code and omni3D paper to understand the ODN part in details.