facebookresearch/nonrigid_nerf

Working with Multiview Data

MatthewGong opened this issue · 4 comments

Hello,

Thank you for the great repo.

I've been trying to use this on a multi-view data set and I'm having some trouble getting a network converge on good results.

The data I'm training on is taken from ~20-30 synced cameras(depending on how many colmap finds in the SFM) set up semi-evenly in a room. The cameras are static, but the scene is dynamic, albeit slow moving. I modified the data loading to take a json that contains frames from each camera. When building a training set, I made the assumption that the order of images loaded in the training is how the model expects frames to be ordered in time. Frames are picked sequentially from each camera, e.g If there's 30 cameras and 150 frames, camera 1 will contribute frames 1,31,61,91...etc.

I've gotten the network to run and train on the dataset, and the outputs are recognizable, but there's a lot of artifacts. Any help building intuition or advice on how to improve the quality of the outputs would be much appreciated.

Original image:
1

Outputs after 250k iterations:

001
disp_001
disp_jet_001
error_001

Looks very typical for using the wrong coordinate system conventions. Towards the end of the readme, there's an explanation of what the coordinate system should look like. Plus the colmap wrappers do a bit of stuff with the direct colmap results before they return the poses to the dataset loading functions. So you should make sure that you take the extrinsics after they really are fully processed (which requires writing some code somewhere). You can also take a look at logs/cameras.obj to help a bit with debugging. Gives you an idea whether your cameras are remotely reasonable (they aren't from what it looks like). The depth makes it clear that nothing remotely correct is learned in 3D. It completely overfits to each input camera with artifacts because the camera images are inconsistent with each other.

Just to add, although I don't believe it's the immediate cause of these results, but there is no assumption about order of frames. For multi-view, you need to provide a image_to_camera_id_and_timestep.json that says which timestep an image belongs to. Even if the cameras weren't perfectly synchronized, it would still look somewhat blurry but 3D would be there. Currently, 3D is completely broken.

What I would do to make sure that the extrinsics are in the right format:

  • create a dataset with frames from only one timestep, i.e. one image per camera.
  • Then use the provided preprocessing pipeline to run colmap on those frames and get extrinsics and intrinsics etc.
  • Then run a dummy training where you write the extrinsics to disc that are returned by the loading function before the main train loop. maybe even run an actual training and render it. the depth should be good, not broken like this. Then it's clear that the extrinsics work.
  • Rewrite the loading function such that it uses those stored extrinsics for the actual, full dataset and returns them without modifying them (e.g. flipping axes or whatever). This part will take the most time.

Thank you for the fast replies and suggestions! I'll make sure that my coordinates system are in the same conventions.

When populating the image_to_camera_id_and_timestep.json, how is time be represented? (0.0-1.0 relative to clip, frame number ...etc)

Frame number, as in time index. So all images taken at the same timestep should have the same integer assigned to them. The timesteps should start at 0 and ideally not leave a hole (although I think that's not a problem if it happens). So the first 20 frames, all taken at the same time by 20 camera, would have timestep=0, the next 20 frames taken by 20 cameras at the same time have timestep=1, etc.