chenhsuanlin/signed-distance-SRN

Pascal Evaluation

stalkerrush opened this issue · 12 comments

Hi @chenhsuanlin , thanks for the great work! I have few questions about your Pascal evaluation. I realize you mentioned in the paper that you applied ICP for Pascal pointcloud comparison but not shapenet. I'm wondering what's the specific reason for using ICP on Pascal? Also is the segmentation masks you are using the unoccluded version from mesh projection?
Thanks!

You're correct about the segmentation masks -- we treat them as the input silhouettes in this case.
Since the actual data for training SDF-SRN is cropped from the raw image data with the provided 2D bounding boxes, the resulting 3D shape prediction (aligned with the 2D masks) would (most likely) not live in the same canonical space as the ground-truth CAD model; in other words, there would be an unknown (but rather small) translational offset between ground truth and prediction. For this reason, we run ICP to align them together for evaluation. We don't need ICP for evaluating ShapeNet because we already know the images are rendered from the CAD models in the same canonical space.

Hope this helps!

Thanks for the prompt reply!
I'm still a litttle bit confused though. When you say cropping, does that also includes resizing? I think if it's only cropping you could do a translation in the shape space to recover the canonical shape right? Resizing does kinda make sense to me, but I still feel that you could recover that by resampling the points for marching cubes. Please let me know if I miss something here :)
Another thing I'm wondering is that you mentioned the weak perspective camera model is used in the paper, is the scale annotated already by the dataset or you learn it instead?

Sorry that I didn't look into the code carefully before, I realized you are using an orthographic camera where the size is dependent on the input. In this case, I guess there will be a size mismatch between the pascal annotation and the predicted shape as ICP won't deal with scaling (pls correct me if I am wrong). Could you please slightly provide some thoughts on this for evaluation? I see that you do something specifically for car cropping, not sure if this is relevant. (pls ignore my previous comment as I didn't take the size issue into consideration, which seems to make perfect alignment impossible anyway)

Sorry for the late reply. Since we don't know (just from the images) where the actual 3D center of the objects are defined, we can only heuristically normalize them by centering them with the bounding box. There will still be some translational misalignment, and that's why we run ICP when comparing pred/GT shapes.

For scale, weak perspective camera models can be thought of as orthographic cameras + scale handling. So yes, we do use the distance of the camera to the objects but only for evaluation purposes -- they are not given during the training process. You can find more details in Section B.2 of the supplementary document.
The car cropping part isn't too relevant here, it's just a different way of rescaling the objects (assuming they have a common height) before centering them in the image.

Hope this helps!

Thanks for the reply, looks like I misunderstood the train/eval pipeline of your method. Based on what you said and the supplementary document, I guess for training, you use an orthographic camera without scaling. But because input/mask has been normalized with bounding box, you could roughly say that the reconstructed shape is canonical in [-1,1]^3. And during evaluation, the gt shape from pascal3D is the fixed-scale canonical shape (idk the range though, based on 2/64, is it [-32,32]?) and thus so are the scanned pointclouds. Then for every sample you would rescale the pointcloud with cam_dist/f of this sample and thus it lies in [-1,1]^3 as well. To make the predicted and gt point more aligned you use ICP. Is all this correct?

If so, I assume the success of evaluation lies in the fact that the object is sphere-like (i.e. lead to similar sized 2D bounding boxes from different angles during training). Then for cars, height*3 is like the heuristic that you find would lead to a rough [-1,1]^3 shape (by hypothesizing gt car CADs have all similar heights around 0.67), right?

Yes, this all looks reasonable to me! Indeed the underlying assumption is that the objects are "sphere-like" and are observed from different viewpoints (looking into the objects) across the dataset. Since we don't have a principled way of rescaling with respect to the actual 3D ground truth during training, we do need to rely on some sort of heuristics for 2D rescaling.

Got it this makes sense, thanks!

Hi @chenhsuanlin , another quick question, I didn't seem to find the definition of chamfer distance you used in your paper. Could you please provide the detailed formula? The Chamfer distance people use often differs in details, so I want to understand your number in a better sense... Thank you!

Hi @stalkerrush, here's a definition of the "uncombined" (unidirectional) Chamfer distance I used for evaluation:

S1 = prediction and S2 = ground truth for shape accuracy, and vice versa for surface coverage (completeness). Note this metric is not squared (opposed to being used as a loss function for optimization), as it measures the average 3D Euclidean distance.
Please also see Sec 1.4 of the supplementary document of DVR, which has a very nice description of the metric, as well as discussions on their definition (autonomousvision/differentiable_volumetric_rendering#10).

Thanks for the reply, it makes sense now. I saw you also did the 10x multiplication, so the only difference to DVR is that you sampled 30000 instead of 100000 on gt, right?

Yes that's right 🙂

Got it thanks!