THU-DA-6D-Pose-Group/CATRE

Question about the paper about not using CAD models during training

Closed this issue ยท 6 comments

Thanks for sharing great work with code.
I have a simple question related to the papers. I saw that your paper emphasizes not using CAD models during training and inference but utilizing shape priors from SPD. From my understanding, the shape priors you used (SPD) are generated by Farthest Point Sampling (FPS) exact CAD models. I agree that not using exact CAD during inference, but not training. Can you explain the reason you also mentioned training?? Have I missed something??

To solve this dilemma, we propose a novel method for CATegory-level object pose REfinement (CATRE), leveraging the abstract shape prior information instead of exact CAD models.

First, it relies on CAD models to provide supervision signals (i.e., NOCS map) for training, while CATRE does not need exact
CAD models during training or inference.

https://github.com/mentian/object-deformnet/blob/a2dcdb87dd88912c6b51b0f693443212fde5696e/preprocess/shape_data.py#L33

Hi.

A mean shape is reconstructed from the mean latent embedding for each category trained with a large amount of synthetic point cloud models from ShapeNet using an encoder-decoder framework. (Sec3.2)
So only the models from ShapeNet are used rather than the exact CAD models of the training/testing set.

Moreover, for CATRE, you can use 3D bounding boxes or axis as shape prior. They don't rely on any models.

(1) As you know, the CAMERA25 dataset comes from the ShapeNet, as mentioned in the original NOCS paper,

we introduce a spatially context-aware mixed reality method to automatically generate large amounts of data (275K training, 25K testing) composed of realistic-looking synthetic objects from ShapeNetCore [8] composited with real tabletop scenes.

(2) SPD paper mentioned they use the same instance of CAMERA training set to train mean shape.

We collect all the instances in the CAMERA training dataset to train the autoencoder. Shape priors are learned from this collection and used in all experiments.

(3) From my understanding, the points used for training the mean shapes come from ShapeNet CAD models. SPD uses farthest point sampling (FPS) from the ShapeNet CAD models with the encoder-decoder framework.

I agree that 3D bounding boxes or axis are not using CAD during training, but your final proposed CATRE model uses the mean shape. From (1-3) observations, I wonder if you still can argue not using 3D CAD models compared to NOCS. Or Do you use a different dataset to train the mean shape compared to SPD?? What do you think?

Thanks for the discussion. I think the description "does not need exact CAD models" might mislead some readers. As you said the shape prior from SPD was trained with farthest point sampling (FPS) from the ShapeNet CAD models. In my opinion, FPS models are not "exact" models. Actually, shape priors can be obtained from non-exact models (such as inaccurate reconstructed models) in theory (also for training SPD). In the extreme cases, they can be 3D bboxes or axes. Moreover, we do not rely on any "exact" CAD models explicitly in training CATRE. So I think it is still valid for the argument in the paper.

I agree with Gu's opinion.
Actually, the "mean shape" is a shared non-exact model in one category, so you can extract them in some other instances but it still works for the training/testing instances.
As evidence, the main experiments in the paper are conducted on the REAL275 dataset rather than the CAMERA25 dataset, but CATRE works in these two datasets simultaneously.

I really appreciate your replies, @wangg12 and @shanice-l.
I have some follow-up questions and am sorry for the many questions to understand your method. I think my question starts from not clearly understanding "exact CAD models".

Q1. If you say FPS models, not exact models. What is the difference regarding not using exact models compared to existing FPS methods (ex, SPD)? Does your method have any new methods for not using exact models? Or the advantage of not using a CAD model comes from SPD?

Q2. NOCS map is generated by rendered points, and CAD is not used as explicitly as training signals. It could also be seen as not an exact model, what do you think? What is your definition of using exact CAD models?

Q3. When I thought, it seems more important to whether the use or not a CAD model than to whether the use or not an "exact" CAD model. What do you think?

I think I have answered your question in the title. If you have any further questions, plz open a new issue :)