NVlabs/Deep_Object_Pose

Question regarding data generation pipeline

Opened this issue · 4 comments

Hello,

As you know from some other issues, I have been trying to generate some ground truth data for centerpose using dope's pipeline. Here is a json data containing the annotation for cup model from objectron dataset (generated using the methods mentioned inside the original centerpose repo) :

{
    "AR_data": {
        "plane_center": [
            0.026276886463165283,
            0.03733876347541809,
            -0.42468586564064026
        ],
        "plane_normal": [
            -0.7663699388504028,
            0.09618763625621796,
            0.6351574063301086
        ]
    },
    "camera_data": {
        "camera_projection_matrix": [
            [
                1.6554118394851685,
                0.0,
                0.019000232219696045,
                0.0
            ],
            [
                0.0,
                2.2072157859802246,
                -0.004737734794616699,
                0.0
            ],
            [
                0.0,
                0.0,
                -0.9999997615814209,
                -0.0009999998146668077
            ],
            [
                0.0,
                0.0,
                -1.0,
                0.0
            ]
        ],
        "camera_view_matrix": [
            [
                -0.26714298129081726,
                -0.7525513172149658,
                -0.6019145250320435,
                -0.055233731865882874
            ],
            [
                -0.9542973041534424,
                0.11975152790546417,
                0.27381736040115356,
                0.18261873722076416
            ],
            [
                -0.13398146629333496,
                0.6475539207458496,
                -0.7501484751701355,
                -0.00018225116946268827
            ],
            [
                0.0,
                0.0,
                0.0,
                1.0
            ]
        ],
        "height": 800,
        "intrinsics": {
            "cx": 298.370361328125,
            "cy": 392.1915690104167,
            "fx": 662.1647135416667,
            "fy": 662.1647135416667
        },
        "location_world": [
            0.15949289500713348,
            -0.06331707537174225,
            -0.08338691294193268
        ],
        "quaternion_world_xyzw": [
            -0.583792361067781,
            0.7309314302392926,
            0.3151360368846342,
            0.1600468734586658
        ],
        "width": 600
    },
    "objects": [
        {
            "class": "cup",
            "keypoints_3d": [
                [
                    -0.027369018644094467,
                    0.04407183825969696,
                    -0.38022491335868835
                ],
                [
                    -0.0025680139660835266,
                    -0.032148003578186035,
                    -0.44896677136421204
                ],
                [
                    0.05524785816669464,
                    -0.021686285734176636,
                    -0.38079142570495605
                ],
                [
                    -0.10985984653234482,
                    -0.01868179440498352,
                    -0.3600447177886963
                ],
                [
                    -0.05204399302601814,
                    -0.008220136165618896,
                    -0.2918694317340851
                ],
                [
                    -0.002694040536880493,
                    0.09636379778385162,
                    -0.4685803949832916
                ],
                [
                    0.05512183904647827,
                    0.10682550072669983,
                    -0.40040507912635803
                ],
                [
                    -0.10998588055372238,
                    0.10982996970415115,
                    -0.37965837121009827
                ],
                [
                    -0.05217001214623451,
                    0.12029168009757996,
                    -0.3114830255508423
                ]
            ],
            "location": [
                -0.02736901868021091,
                0.044071842568774056,
                -0.38022490531119946
            ],
            "mug": true,
            "mug_handle_visible": true,
            "name": "cup_0",
            "projected_cuboid": [
                [
                    378,
                    344
                ],
                [
                    253,
                    388
                ],
                [
                    263,
                    488
                ],
                [
                    266,
                    189
                ],
                [
                    282,
                    273
                ],
                [
                    437,
                    388
                ],
                [
                    478,
                    483
                ],
                [
                    493,
                    200
                ],
                [
                    557,
                    281
                ]
            ],
            "provenance": "objectron",
            "quaternion_xyzw": [
                0.19061728062246633,
                0.29139854153287625,
                0.6446484091470697,
                0.6805735602451516
            ],
            "scale": [
                0.12999999523162842,
                0.14000000059604645,
                0.09000000357627869
            ]
        }
    ]
}

Now, here is the data that has been generated by dope's pipeline (nvisii):

{
    "camera_data": {
        "camera_look_at": {
            "at": [
                1.0,
                0.0,
                0.0
            ],
            "eye": [
                0.0,
                0.0,
                0.0
            ],
            "up": [
                0.0,
                0.0,
                1.0
            ]
        },
        "camera_view_matrix": [
            [
                0.0,
                0.0,
                1.0,
                0.0
            ],
            [
                -1.0,
                0.0,
                0.0,
                0.0
            ],
            [
                0.0,
                -1.0,
                0.0,
                0.0
            ],
            [
                0.0,
                0.0,
                0.0,
                1.0
            ]
        ],
        "height": 720,
        "intrinsics": {
            "cx": 640.0,
            "cy": 360.0,
            "fx": 521.6779174804688,
            "fy": 521.6779174804688
        },
        "location_worldframe": [
            -0.0,
            0.0,
            -0.0
        ],
        "quaternion_xyzw_worldframe": [
            -0.5,
            0.5,
            -0.5,
            0.5
        ],
        "width": 1280
    },
    "objects": [
        {
            "bounding_box_minx_maxx_miny_maxy": [
                484,
                520,
                538,
                581
            ],
            "class": "mug",
            "local_cuboid": null,
            "local_to_world_matrix": [
                [
                    -0.5250795483589172,
                    0.8295515179634094,
                    0.19009393453598022,
                    -0.0
                ],
                [
                    -0.5906684398651123,
                    -0.5160276293754578,
                    0.6203438639640808,
                    0.0
                ],
                [
                    0.612700879573822,
                    0.2134474217891693,
                    0.7609454393386841,
                    -0.0
                ],
                [
                    1.7121046781539917,
                    0.42501959204673767,
                    -0.6440945863723755,
                    1.0
                ]
            ],
            "location": [
                -0.42501959204673767,
                0.6440945863723755,
                1.7121046781539917
            ],
            "location_worldframe": [
                1.7121046781539917,
                0.42501959204673767,
                -0.6440945863723755
            ],
            "name": "mug_0",
            "projected_cuboid": [
                [
                    502,
                    557
                ],
                [
                    482,
                    545
                ],
                [
                    508,
                    545
                ],
                [
                    493,
                    532
                ],
                [
                    519,
                    533
                ],
                [
                    484,
                    582
                ],
                [
                    511,
                    582
                ],
                [
                    495,
                    570
                ],
                [
                    522,
                    570
                ]
            ],
            "provenance": "nvisii",
            "px_count_all": 0,
            "px_count_visib": 0,
            "quaternion_xyzw": [
                -0.21100440621376038,
                -0.38616928458213806,
                0.8750137686729431,
                -0.20174583792686462
            ],
            "quaternion_xyzw_worldframe": [
                0.23979294300079346,
                0.2490515112876892,
                0.8369666337966919,
                0.4242163896560669
            ],
            "segmentation_id": 4,
            "visibility": 1,
            "scale": [
                0.12563899159431458,
                0.09907500445842743,
                0.09066499769687653
            ]
        },
        {
            "bounding_box_minx_maxx_miny_maxy": [
                514,
                547,
                477,
                512
            ],
            "class": "mug_model",
            "local_cuboid": null,
            "local_to_world_matrix": [
                [
                    -0.10810719430446625,
                    0.8605160713195801,
                    -0.49782025814056396,
                    0.0
                ],
                [
                    -0.6656666994094849,
                    0.3092678487300873,
                    0.679147481918335,
                    0.0
                ],
                [
                    0.7383770942687988,
                    0.404803067445755,
                    0.539382815361023,
                    -0.0
                ],
                [
                    1.935137152671814,
                    0.3712746202945709,
                    -0.4784972667694092,
                    1.0
                ]
            ],
            "location": [
                -0.3712746202945709,
                0.4784972667694092,
                1.935137152671814
            ],
            "location_worldframe": [
                1.935137152671814,
                0.3712746202945709,
                -0.4784972667694092
            ],
            "name": "mug_1",
            "projected_cuboid": [
                [
                    530,
                    491
                ],
                [
                    521,
                    492
                ],
                [
                    542,
                    480
                ],
                [
                    509,
                    479
                ],
                [
                    531,
                    466
                ],
                [
                    530,
                    517
                ],
                [
                    551,
                    504
                ],
                [
                    517,
                    503
                ],
                [
                    540,
                    490
                ]
            ],
            "provenance": "nvisii",
            "px_count_all": 0,
            "px_count_visib": 0,
            "quaternion_xyzw": [
                -0.14164598286151886,
                -0.33278805017471313,
                0.9052680134773254,
                0.22288650274276733
            ],
            "quaternion_xyzw_worldframe": [
                0.10397374629974365,
                0.46850624680519104,
                0.5784077644348145,
                0.6596482992172241
            ],
            "segmentation_id": 5,
            "visibility": 1,
            "scale": [
                0.12563899159431458,
                0.09907500445842743,
                0.09066499769687653
            ]
        }
    ]
}

I have modified a little bit to the nvisii script to add the scale, rename and change projected cuboid orientation to match the objectron dataset coordintate frame. Now I have 3 questions:

  1. You can clearly see the missing field in the synthetic data generation pipeline is the "camera_projection_matrix". Now, usually I think we have this in pixel but in this case, it seems like it's in some other format. Also, as this is coming from the real dataset for objectron dataset, each camera is different so is the projection matrix. How to set this/solve this problem for synthetic dataset where i am using the same camera? Attaching the camera settings for synthetic dataset generation pipeline:
{
    "camera_settings": [
        {
            "name": "camera",
            "intrinsic_settings": {
                "resX": 1280,
                "resY": 720,
                "fx": 521.6779174804688,
                "fy": 521.6779174804688,
                "cx": 640.0,
                "cy": 360.0,
                "s": 0.0
            },
            "captured_image_size": {
                "width": 1280,
                "height": 720
            }
        }
    ]
}
  1. In the centerpose json data, Inside AR_data field, there is plane_center and plane_normal which is missing in the synthetic data generation pipeline. Also, the keypoints_3d ( actual location) value is missing in the synthetic datasets. Are these values needed if I am trying to train a single instance of mug which has no axis symmetry for centerpose ??

  2. In vast amount of the centerpose annotation json data, you have the z value of the "location" field to be negetive like this:
    -0.38022490531119946 but for the synthetic data generation pipeline, the data is more or less bounded by a volume like this:

xmin = -0.75, xmax = 0.75, 
ymin = -0.75, ymax = 0.75 
z min = 0.15, zmax = 2.0  

where z value is never negetive for synthetic data generation pipeline (which also makes sense as in front of camera, the z depth should be positive not negetive). Why is the discrepency and how to mitigate through this problem ??

  1. Nvisii has a set camera intrinsics that uses opencv style cam int. You should be able to change that.
  2. Depends what you want to do, you could export the 3d poses, use the cuboid_ child of the object you want to export to export their 3d position, follow the same workflow as the main object.
  3. yeah the z value is negative because the camera is an opengl camera.
    image
    You can see that everything in front of the camera will have z negative, and if you pay attention the y should also be inverted. Check chatgpt on how to go from opengl coordinate frame to opencv.

I think overall you are on the right direction and you are making good progress, good job. advice with 3d, take your time and make one change at the time and then debug it. Easy to introduce errros that are hard to recover from.

Hello @TontonTremblay ,

Thanks for the reply. I understood 1 & 3 but couldn't understand point 2.

I can see the https://github.com/NVlabs/Deep_Object_Pose/blob/master/scripts/nvisii_data_gen/single_video_pybullet.py#608
in this line, you have commented out the cuboids. I don't see any definition for this one inside the single_video_pybullet.py file.

Inside the utils.py file, in the export_to_ndds_files function definition, it's set to none: https://github.com/NVlabs/Deep_Object_Pose/blob/master/scripts/nvisii_data_gen/utils.py#1103.

I do see some cuboid definitions inside the utils.py file as add_cuboid here: https://github.com/NVlabs/Deep_Object_Pose/blob/master/scripts/nvisii_data_gen/utils.py#913

And here: https://github.com/NVlabs/Deep_Object_Pose/blob/master/scripts/nvisii_data_gen/utils.py#998

Lastly, I see a logic here: https://github.com/NVlabs/Deep_Object_Pose/blob/master/scripts/nvisii_data_gen/utils.py#1233

        if not cuboids is None:
            cuboid = cuboids[obj_name]
        else:
            cuboid = None

I am somewhat confused as there are some things parallely going on as well as I need those cuboid keypoints with respect to camera frame (not world frame). To ensure everything is on right track, it's important that I don't do anything wrong.

Also, just wanted to confirm that have you used opengl camera convention for centerpose prediction ?? I believe centerpose is using objectron dataset and in that logic, objectron dataset is also following opengl convention. If that's the case, inside utils.py, if I commented out the cam_matrix 180 degree rotation part, and add camera_projection_matrix like the same way as cam_matrix to produce everything in opengl format, is that ok ??

# camera view camera
    cam_matrix = visii.entity.get(camera_name).get_transform().get_world_to_local_matrix()

    # rotate camera by 180° around x
    # from: X right, Y up and Z out of the image towards the viewer
    # to: X right, Y down and Z into the image away from the viewer (= OpenCV / NDDS / ROS "optical" frame)
    # cam_matrix = visii.mat4(1, 0, 0, 0,
    #                         0, -1, 0, 0,
    #                         0, 0, -1, 0,
    #                         0, 0, 0, 1) * cam_matrix

    cam_matrix_export = []
    for row in cam_matrix:
        cam_matrix_export.append([row[0],row[1],row[2],row[3]])
    
    # camera_projection_matrix
    cam_proj_matrix = visii.entity.get(camera_name).get_camera().get_projection()

    cam_proj_matrix_export = []
    for row in cam_proj_matrix:
        cam_proj_matrix_export.append([row[0],row[1],row[2],row[3]])

After that, here is how my json file was updated:

{
    "camera_data": {
        "camera_look_at": {
            "at": [
                1.0,
                0.0,
                0.0
            ],
            "eye": [
                0.0,
                0.0,
                0.0
            ],
            "up": [
                0.0,
                0.0,
                1.0
            ]
        },
        "camera_projection_matrix": [
            [
                0.8151217699050903,
                0.0,
                0.0,
                0.0
            ],
            [
                0.0,
                1.4491053819656372,
                0.0,
                0.0
            ],
            [
                0.013498933985829353,
                0.014036557637155056,
                -1.0010005235671997,
                -1.0
            ],
            [
                0.0,
                0.0,
                -0.10005002468824387,
                0.0
            ]
        ],
        "camera_view_matrix": [
            [
                0.0,
                0.0,
                -1.0,
                0.0
            ],
            [
                -1.0,
                0.0,
                0.0,
                0.0
            ],
            [
                0.0,
                1.0,
                0.0,
                0.0
            ],
            [
                0.0,
                0.0,
                0.0,
                1.0
            ]
        ],
        "height": 720,
        "intrinsics": {
            "cx": 640.0,
            "cy": 360.0,
            "fx": 521.6779174804688,
            "fy": 521.6779174804688
        },
        "location_worldframe": [
            -0.0,
            0.0,
            -0.0
        ],
        "quaternion_xyzw_worldframe": [
            0.5,
            -0.5,
            -0.5,
            0.5
        ],
        "width": 1280
    },
    "objects": [
        {
            "bounding_box_minx_maxx_miny_maxy": [
                492,
                533,
                129,
                177
            ],
            "class": "mug",
            "local_cuboid": null,
            "local_to_world_matrix": [
                [
                    -0.2890670597553253,
                    -0.8582406044006348,
                    0.4241040050983429,
                    -0.0
                ],
                [
                    0.09283934533596039,
                    0.41579586267471313,
                    0.904707133769989,
                    -0.0
                ],
                [
                    -0.9527969360351562,
                    0.30089423060417175,
                    -0.040514618158340454,
                    -0.0
                ],
                [
                    1.481677532196045,
                    0.3374212980270386,
                    0.6008970737457275,
                    1.0
                ]
            ],
            "location": [
                -0.3374212980270386,
                0.6008970737457275,
                -1.481677532196045
            ],
            "location_worldframe": [
                1.481677532196045,
                0.3374212980270386,
                0.6008970737457275
            ],
            "name": "mug_0",
            "projected_cuboid": [
                [
                    518.2252883911133,
                    149.8012661933899
                ],
                [
                    491.6792297363281,
                    167.93272018432617
                ],
                [
                    477.51625061035156,
                    136.42764329910278
                ],
                [
                    503.6228942871094,
                    117.82564401626587
                ],
                [
                    540.2409744262695,
                    165.09751796722412
                ],
                [
                    515.386848449707,
                    181.55617475509644
                ],
                [
                    502.17926025390625,
                    152.40899562835693
                ],
                [
                    526.6552734375,
                    135.54736375808716
                ],
                [
                    509.7786331176758,
                    151.17395639419556
                ]
            ],
            "provenance": "nvisii",
            "px_count_all": 0,
            "px_count_visib": 0,
            "quaternion_xyzw": [
                0.013572394847869873,
                0.15302777290344238,
                -0.21785865724086761,
                -0.9638134241104126
            ],
            "quaternion_xyzw_worldframe": [
                -0.28967729210853577,
                0.6605637073516846,
                0.456277459859848,
                -0.5211083292961121
            ],
           "scale": [
                0.12563899159431458,
                0.09907500445842743,
                0.09066499769687653
            ]
        }
    ]
}

When we did objectron, I am afraid I do not remember exactly how we parsed the data. So I am afraid I wont be able to help, but normally we train algorithms on opencv conventions to make our life a little easier on the robot.

But to answer your question, yes, if you remove the rotation you will get the camera pose in opengl coordinate frame. nvisii uses opengl coordinate frames (and going to opencv is pretty simple).

In the script when you pass the cuboid data structure, you will export the cuboid in its local frame, I did not include it in the final version of the code to avoid confusion with the exported data with the projected_cuboid. But also it is a data structure to keep track of. If you check the add_cuboid in the utils, it returns the cuboid in the object coordinate frame, you can use that to export the normalize cuboid centerpose needs. I hope this helps. https://github.com/NVlabs/Deep_Object_Pose/blob/master/scripts/nvisii_data_gen/utils.py#L996 I think the cuboid to export_ndds needs object_name:cuboid dictionary.