facebookresearch/nonrigid_nerf

How to use colmap to generate `calibration.json`

SuX97 opened this issue · 7 comments

SuX97 commented

Hi,

I am wondering how can I get the ex/intrinsics by colmap and store it into calibration.json.

What I have done is using colmap gui and get the parameters and export them to .txt:

# Camera list with one line of data per camera:
# CAMERA_ID, MODEL, WIDTH, HEIGHT, PARAMS[]
# Number of cameras: 1
1 SIMPLE_RADIAL 1920 1080 4442.37 960 540 -0.0920246

# IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME

1 0.97039 -0.238968 0.0104339 -0.0335974 -1.07221 -2.75473 2.19396 1 down.jpg

2 0.997979 -0.0634341 0.00200262 -0.00320611 -0.830256 -0.54455 0.924563 1 front.jpg

3 0.969716 -0.0510358 -0.238216 0.0172955 2.78444 -0.361813 0.919001 1 left_30.jpg

4 0.924454 -0.0686466 -0.370858 -0.056008 4.44329 -0.847429 1.52017 1 left_60.jpg

5 0.96706 -0.0668973 0.244203 0.0261598 -4.22796 -0.607595 1.86112 1 right_30.jpg

6 0.924637 -0.0629251 0.373441 0.0403556 -5.59472 -0.66673 2.52344 1 right_60.jpg

7 0.993836 0.110606 -0.00483683 -0.00566639 -0.826491 1.70379 1.96883 1 top.jpg

What more do I need to do to use the multi-view feature?

  1. How can I get the min_bound, max_bound? There is no such information in the result of colmap.
  2. Is TX, TY, TZ corresponding to translation matrics?
  3. Is QW, QX, QX, QZ corresponding to rotation matrics? But your sample shows rotation is 3x3 matrics, mine is 1x4.
  4. The camera model is SIMPLE_RADICAL,I wonder how can I get yours? Or if all my cameras are the same, can I just use a random array for all intrinsics for all cameras?

Thanks!

The multi-view feature isn't integrated with colmap, it assumes that you somehow have access to the camera parameters and have them converted correctly (see bottom of the Readme for coordinate system details, for example).

If you want to use colmap, for questions 1-3, it's probably best to essentially call the preprocessing step yourself: https://github.com/Fyusion/LLFF/blob/c6e27b1ee59cb18f054ccb0f87a90214dbe70482/llff/poses/pose_utils.py#L259 I'm not sure if that code supports different camera intrinsics or assumes all of them to be the same, so you'd need to check the result. In any case, the NR-NeRF code uses colmap results that were passed through that preprocessing (it gives the min_bound and max_bound and converts colmap's extrinsics such that they are compatible with NR-NeRF).

Regarding 4, the GUI allows to change the camera model to SIMPLE_PINHOLE, if I remember correctly.

You could also consider using the preprocessing wrapper from NR-NeRF, especially if all cameras have the same intrinsics. If they are static cameras, you could consider running preprocessing on only one timestep (with multiple images). That way you'd get the extrinsics and intrinsics in the correct format for those images. You'd then need to write code to store them in calibration.json.

SuX97 commented

Thanks for your reply @edgar-tr !

The multi-view feature isn't integrated with colmap, it assumes that you somehow have access to the camera parameters and have them converted correctly (see bottom of the Readme for coordinate system details, for example).

If you want to use colmap, for questions 1-3, it's probably best to essentially call the preprocessing step yourself: https://github.com/Fyusion/LLFF/blob/c6e27b1ee59cb18f054ccb0f87a90214dbe70482/llff/poses/pose_utils.py#L259 I'm not sure if that code supports different camera intrinsics or assumes all of them to be the same, so you'd need to check the result. In any case, the NR-NeRF code uses colmap results that were passed through that preprocessing (it gives the min_bound and max_bound and converts colmap's extrinsics such that they are compatible with NR-NeRF).

Thanks! I have tried to use the preprocessing code, treating the multi-view pictures of one time-step to reconstruct the scene model. However, the bundle adjustment always fails to converge, and it does converge in my Colmap GUI, so I suspect that it's due to the hyper-params used to extract feature points and point-matching.

Regarding 4, the GUI allows to change the camera model to SIMPLE_PINHOLE, if I remember correctly.

Yes, I will have a try.

You could also consider using the preprocessing wrapper from NR-NeRF, especially if all cameras have the same intrinsics. If they are static cameras, you could consider running preprocessing on only one timestep (with multiple images). That way you'd get the extrinsics and intrinsics in the correct format for those images. You'd then need to write code to store them in calibration.json.

I wrote a snippet to convert the .txt output of colmap to the calibration.json if anyone has the same issue.

import json
import numpy as np


def quaternion2matrix(q):
    q0 = q[0]
    q1 = q[1]
    q2 = q[2]
    q3 = q[3]
    print(q)
    m = np.zeros((3, 3))
    m[0, 0] = q0 * q0 + q1 * q1 - q2 * q2 - q3 * q3
    m[0, 1] = 2 * (q1 * q2 - q0 * q3)
    m[0, 2] = 2 * (q1 * q3 + q0 * q2)

    m[1, 0] = 2 * (q1 * q2 + q0 * q3)
    m[1, 1] = q0 * q0 - q1 * q1 + q2 * q2 - q3 * q3
    m[1, 2] = 2 * (q2 * q3 - q0 * q1)

    m[2, 0] = 2 * (q1 * q3 - q0 * q2)
    m[2, 1] = 2 * (q2 * q3 + q0 * q1)
    m[2, 2] = q0 * q0 - q1 * q1 - q2 * q2 + q3 * q3

    return m.tolist()

# This part is copy-pasted from the .txt output of colmap.
intrinsics = [1920, 1080, 4442.37, 960, 540, -0.0920246]
extrinsics = [
    '1 0.97039 -0.238968 0.0104339 -0.0335974 -1.07221 -2.75473 2.19396 1 down.jpg',
    '2 0.997979 -0.0634341 0.00200262 -0.00320611 -0.830256 -0.54455 0.924563 1 front.jpg',
    '3 0.969716 -0.0510358 -0.238216 0.0172955 2.78444 -0.361813 0.919001 1 left_30.jpg',
    '4 0.924454 -0.0686466 -0.370858 -0.056008 4.44329 -0.847429 1.52017 1 left_60.jpg',
    '5 0.96706 -0.0668973 0.244203 0.0261598 -4.22796 -0.607595 1.86112 1 right_30.jpg',
    '6 0.924637 -0.0629251 0.373441 0.0403556 -5.59472 -0.66673 2.52344 1 right_60.jpg',
    '7 0.993836 0.110606 -0.00483683 -0.00566639 -0.826491 1.70379 1.96883 1 top.jpg'
]
camera_param_dict = {}
for ex in extrinsics:
    camera_name = ex.split(' ')[-1].split('.')[0]
    (camera_id, qw, qx, qy, qz, tx, ty, tz) = map(float, ex.split(' ')[:-2])
    camera_param_dict[camera_name] = {
        'translation': [
            qx, qy, qz
        ],
        'rotation': quaternion2matrix((qw, qx, qy, qz)),
        'center_x': intrinsics[2],
        'center_y': intrinsics[2],
        'focal_x': intrinsics[3],
        'focal_y': intrinsics[4],
        'height': intrinsics[1],
        'width': intrinsics[0]
    }

template = {
    "min_bound": 0.0,
    "max_bound": 2.0189487179595886,
    "0": {
        "translation": [
            -0.041070333333333334,
            1.1753333333333333,
            0.49935666666666667
        ],
        "rotation": [
            [
                0.0577962,
                -0.997661,
                -0.0364925
            ],
            [
                0.558001,
                0.00197212,
                0.829838
            ],
            [
                -0.827825,
                -0.0683243,
                0.55681
            ]
        ],
        "center_x": 2572.48,
        "center_y": 1875.78,
        "focal_x": 5363.46,
        "focal_y": 5363.46,
        "height": 3840,
        "width": 5120
    }
}
template.pop('0')
for camera_name in camera_param_dict.keys():
    template[camera_name] = camera_param_dict[camera_name]
print(template)

with open("calibration.json", "w") as json_file:
    json.dump(template, json_file, indent=4)

Thanks for sharing the snippet :)

The NR-NeRF preprocessing code assumes a SIMPLE_PINHOLE camera model. I have encountered issues with convergence when the images were even slightly distorted. In my case, SIMPLE_PINHOLE wasn't enough but SIMPLE_RADIAL worked via the Colmap GUI. (If the GUI fails with SIMPLE_PINHOLE, I'd suspect some slight distortion to be the reason.) Because the camera model wasn't just SIMPLE_PINHOLE, it wasn't directly compatible with the training code. So I went the way of undistorting the images first (via the Colmap GUI) and then using the standard preprocessing code on them. Alternatively, one could modify the get_rays() functions in nerf_helpers.py to work with a non-pinhole camera model, which should be relatively straightforward. The rest of the code doesn't assume any particular camera model, it takes whatever rays are returned by the get_rays() functions.

SuX97 commented

Get it, seems that I should solve the distortion and get intrinsic by PINHOLE model first.

I am now treating the FOCAL_LENGTH of SIMPLE_RADIAL model the same as the FOCAL_X and FOCAL_Y(I am not quite familiar with camera models, but in the tutorial of colmap seems that FOCAL_X is almost equivalent to FOCAL_LENGTH in SIMPLE_RADIAL)

# Camera list with one line of data per camera:
#   CAMERA_ID, MODEL, WIDTH, HEIGHT, PARAMS[]
# Number of cameras: 3
1 SIMPLE_PINHOLE 3072 2304 2559.81 1536 1152
2 PINHOLE 3072 2304 2560.56 2560.56 1536 1152
3 SIMPLE_RADIAL 3072 2304 2559.69 1536 1152 -0.0218531

I guess it may introduce errors as get_rays() may return wrong o and d, still I'll update the result if works.

SuX97 commented

Some updates here. I have fixed the ColMap and now using a PINHOLE model calibration, the multiview still not working.

When I use the single view, the model generalize to unseen latent well.

M039_single_view_test_with_diff_latend_training_200000_rgb.3.mp4

However, when I training with multi-view setting, it gets static and blurry with seen view unseen latent, and cannot generalize to seen latent(with deformation set as train) and unseen view (with spiral).

less_reg_train_fixed_video_rgb.1.mp4
train_spiral_video_rgb.1.mp4

Could you please give some hint? Maybe too few views(I have only 7 views, fixed camera poses rather than moving sequence). I have tried to use less regulation loss as suggested in this issue, but it does not work.

This looks similar to what I get when the camera extrinsics are in the wrong coordinate system. In such a case, it overfits to each input camera individually using a ton of artifacts (as seen in the spiral video). But because it overfits to each view, it is less sharp. There is no remotely meaningful 3D model learned because the images are inconsistent with how the cameras are positioned in NR-NeRF's coordinate system.

Can you look at logs/cameras.obj to see whether the cameras look sensible? Note that even if the cameras look sensible, it could still be that the images at those cameras are rotated weirdly due to some axis flips. I've had to deal with that issue a lot.

You could test this by creating a small dataset from only a single timestep, using the normal preprocessing code without any modifications, and training the model on that but with ray_bending=None (and turning off the regularization losses because they'll throw errors without ray bending, I assume). That trains a standard, rigid NeRF on a single timestep. Novel view synthesis should work somewhat in that case, even though seven static cameras is already quite little. If it does create recognizable novel views, then the issue is the camera coordinate systems being wrong.

SuX97 commented

Thanks! It turned out to be a wrong intrinsics and poses. I have tried to use a single timestamp with rigid NeRF, and it fails similar to this one. And then I tried to use NeRF--, which optimizes the intrinsics and poses with bp. The result starts to look reasonable.

depth-custom-1.mp4
img-custom-1.mp4