eigenvivek/DiffPose

Question about the mTRE

Pragyanstha opened this issue · 5 comments

Hello @eigenvivek, thanks for sharing the code for your awesome work !
I have a query regarding the calculation of the mean Target Registration Error (mTRE) as discussed in your arXiv paper compared to its implementation in the code.

In the paper, the mTRE is described as the mean error of projected points on the detector, which suggests a measurement directly in the detector plane. However, in the code, it appears that these projected points are backprojected into deepfluoro's coordinate system prior to calculating the mean error, aligning the metric with anatomical coordinates.

def project(self, pose):
extrinsic = convert_diffdrr_to_deepfluoro(self.specimen, pose)
x = perspective_projection(extrinsic, self.intrinsic, self.fiducials)
x = -self.specimen.focal_len * torch.einsum(
"ij, bnj -> bni",
self.intrinsic_inv,
pad(x, (0, 1), value=1), # Convert to homogenous coordinates
)
extrinsic = (
self.flip_xz.inverse().compose(self.translate.inverse()).compose(pose)
)
return extrinsic.transform_points(x)

Additionally, within the Evaluator.call method, the reprojection error calculated in this manner is subsequently multiplied by a factor of 0.194 mm/px.

registration_error *= 0.194 # Pixel spacing is 0.194 mm / pixel isotropic

Could you please clarify if this conversion is intended to standardize the error measurement to physical units for clinical relevancy, or if there might be an oversight in this part of the implementation?

Hi @Pragyanstha, thank you for your kind words and the interesting question! I thought your SCR approach was very interesting and enjoyed stopping by your poster at MICCAI last year. After reading a lot more 2D/3D registration papers since the publication of DiffPose, I agree that the evaluation metric we use is sort of a hybrid between reprojection error (RPE) and what other authors traditionally refer to as mTRE. Originally, I was evaluating the registrations using RPE (projecting fiducials into the image plane and then converting from pixels to spatial units by multiplying by the pixel spacing), but I found that this wasn't sensitive to big misestimations of the camera's depth. Therefore, I backprojected the 2D fiducials to effectively measure discrepancies between the estimated and ground truth image planes, and since this is just a bunch of matrix multiplications, scalar multiplication by the pixel spacing is commutative so I did it at the end of the calculation. However, after reading a lot more works, I agree that this is not how the field traditionally measures camera pose error, so in future works, I plan to instead report RPE, traditional 3D mTRE, and perhaps even geodesic distances between camera poses (i.e., the loss terms used for training the pose regression network). Also the estimated poses produced by all methods in the DiffPose paper were evaluated using using the same Evaluator class that you linked above, so the quantitative accuracy of the different registration algorithms in the paper are comparable.

@eigenvivek Thanks a lot for getting back to me, I really appreciate the detailed explanation you provided. It's too bad we couldn't meet in person last yeat at MICCAI (maybe this year ??).
I totally agree on your view of RPE being less sensitive to the depth estimates and understand the shift from RPE to a more hybrid approach. However, I'm still a bit puzzled about the need for multiplying by the pixel spacing after scaling with the focal length, given that the focal lengths are already defined in millimeters.

x = -self.specimen.focal_len * torch.einsum(
"ij, bnj -> bni",
self.intrinsic_inv,
pad(x, (0, 1), value=1), # Convert to homogenous coordinates
)

Since the intrinsic matrix is already defined using pixels, the inverse essentially gives a unit less ratio or direction of the 3D points in normalized coordinate system that projects to the 2D pixel. By multiplying with the focal length, which is defined in mm, we get back to world coordinate system defined in mm (Please correct me if this is wrong). So could the multiplication with pixel spacing after this step be introducing an unconventional unit (mm^2/px) that might complicate comparison with other existing results related to this field? Any clarification on this would be greatly appreciated.

I have the same question with @Pragyanstha. I think there is no need to multiply the physical size of pixels.

Hi @Pragyanstha, thank you for the detailed explanation. I’ve gone back and worked through all the camera projection formulas & units very carefully, and I agree with you that post-multiplying by the pixel spacing is erroneous. Thank you very very much for spotting this mistake in the implementation! I am going to fix this bug, recompute the registration errors, and post an updated version on arXiv. May I please acknowledge you for finding the mistake in the updated arXiv version?

Importantly, this fix will not change the rankings of the methods compared in the paper since all methods were evaluated using the same Evaluator class. For all methods, the mTRE values just need to be multiplied by the same scalar.

I’m most likely not going to attend MICCAI this year but I hope that we get the opportunity to meet in person some time and discuss other research problems!

@eigenvivek Thanks for the update! Apologies for making you go through the trouble, I'm glad to hear that my comments were helpful, and of course, I'd be honored to be acknowledged in your updated arXiv paper.