facebookresearch/PoseDiffusion

how to convert the pose in the real world to ndc coordinate?

wendyliuyejia opened this issue · 6 comments

Thank you so much for releasing the wonderful code.
I'm so interested in the diffusion part and here's my confusion. If given some images with gt poses in the real world and some other images with predicted poses, how can I use the diffusion model to refine the predicted poses? I suppose I should replace the initial part with the gt poses and predicted poses. Of course, I may convert gt poses to ndc coordinates before the diffusion part. So I wonder how to convert the pose in the real world to ndc coordinate?

jytime commented

Hi,

There are two things you may need to be careful if you want to use additional GT poses to the model: (1) change your GT poses to the coordinate order as "+X points left, and +Y points up and +Z points out from the image plane" (as discussed here ). (2) normalize the translation vectors ((as done by the code of dev branch here L107-L110) and transform the cameras to make the first camera as the pivot one (here).

By the way, although we have not tried similar applications before, if you would like to feed given cameras poses to the model, I personally think it would be better to give them at the sampling steps T between 30 and 10. Providing camera poses too early in the process (e.g., T=100), the model will treat them as total noise, while feeding them to the model too late (e.g., T=0) will likely yield negligible refinement.

Thank you so much for your quick and helpful reply. It really works.

Hi,

There are two things you may need to be careful if you want to use additional GT poses to the model: (1) change your GT poses to the coordinate order as "+X points left, and +Y points up and +Z points out from the image plane" (as discussed here ). (2) normalize the translation vectors ((as done by the code of dev branch here L107-L110) and transform the cameras to make the first camera as the pivot one (here).

By the way, although we have not tried similar applications before, if you would like to feed given cameras poses to the model, I personally think it would be better to give them at the sampling steps T between 30 and 10. Providing camera poses too early in the process (e.g., T=100), the model will treat them as total noise, while feeding them to the model too late (e.g., T=0) will likely yield negligible refinement.

@jytime Hi, I plan to build my gt poses in the form of Co3d, but when I try to reproduce the pose results of co3d with the same data, I met some alignment problems, which mainly occurs in 0-centered and scale-normalized. Your data preprocessing work seems to have done the scale normalization of pose and take the first frame as the camera reference frame, so do I still need to perfectly align the co3d trajectory? Do I need to consider the rotation and scale of the pose? If not, will relying on these preprocesses affect the performance of the posediffusion? (I have done a coordinate conversion from colmap to pytorch3d).

jytime commented

I am not sure what do you mean by "align the co3d trajectory". As long as your cameras are consistent within the same coordinate, our pre-processing strategy should be able to deal with it. But I would suggest to visualize all the cameras

The result of co3d is the middle teddy bear point cloud and the top red trajectory (maybe in the different coordinate system), my result is the bottom teddy bear point cloud and the middle two circles of trajectory (yellow is the pose of converting pytorch3d coordinate system, green is the original pose of colmap).Does pose need to be aligned in position perfectly?
1
2
3

jytime commented

no you don't need to align them, but just ensuring all your cameras are consistent