NVlabs/Deep_Object_Pose

Custom training model won't recognise anything

Closed this issue · 11 comments

Hello,

Firstly, thank you very much for your responsiveness and for maintaining DOPE! I am trying to train DOPE with new synthetic data to test if it will improve the detection of our objects (which have slightly different appearance than the one you trained DOPE originally with). However, there is a strange behaviour that I will describe below.

Details about my training process

  • I use this script to generate the synthetic data.
    • I modified the above script as well as the one is calling (single_video_pybullet.py) such that they do not put any other random objects in the scene, except my custom object (a Cheezit with a custom texture). I specify my custom object using --path_single_obj path/to/my/model.obj. That is, all synthetic images include our object only (or occasionally no object because the randomiser places far away/outside the image boundaries?).
    • I generate 20,000 images. These are generated under dope/scripts/nvisii_data_gen/output/dataset. It includes 100 directories numbered like 000, 001 etc. Each of the numbered directories includes 200 data points (these files for each data point: a .png file, a .depth.exr file, a .json file, and a .seg.exr file).
    • I then split these 100 directories into two directories. I keep 80% in the dataset and created a dope/scripts/nvisii_data_gen/output/test_data to keep 20% of the data.
  • I then use this script to train DOPE. The exact command I use is the following:
python3 train.py \
        --data nvisii_data_gen/output/dataset/ \
        --datatest nvisii_data_gen/output/test_data/ \
        --object single_obj_0 \
        --epochs 10 \
        --batchsize 24
  • This training results in a directory train_tmp/ which includes: header.txt, loss_test.csv, loss_train.csv, test_metric.csv, and a .pth file for each epoch.
  • I pick one of the .pth files from train_tmp and place it under dope/weights. I then modify the config/config_pose.yaml to let DOPE know about the new weights file.
  • I then use the launch file to start DOPE.

What happens with a real image view

When I use the new .pth file, DOPE fails to recognize any valid points, even though a Cheezit is clearly within the RealSense camera's view. However, if I alter the config_pose.yaml file to replace my custom .pth file with the pre-trained cracker_60.pth that you provide, DOPE successfully detects our Cheezit despite slight differences in appearance between our Cheezit and yours!

Sanity checks - strange behaviour

Considering these observations, I created a mock 'realsense' ROS script that constantly publishes an image from the synthetic dataset. I then run DOPE using my custom .pth file to check if it will work with an image from the training dataset. However, the issue persisted: DOPE fails to detect any valid points using my custom .pth file. I then used the dummy script again but switched to the cracker_60.pth file instead of my custom one. Using your cracker_60.pth file it kind of worked, however, it throws the following error:

...

cv2.solvePnP failed with an error
9 valid points found
cv2.solvePnP failed with an error
9 valid points found
cv2.solvePnP failed with an error
9 valid points found
cv2.solvePnP failed with an error
9 valid points found

...

For reference, here is the image my dummy script constantly publishes (i.e., an image from the synthetic dataset):

image

I would appreciate any pointers to address this.

Thank you in advance.

It sounds like you are doing the right thing, have you visualize the belief maps? If you are using inference.py I think you can do --show_beliefs. In ros node you need to pass it to the config I think. Then in ros you have access to a topic for the beliefs. If the belief maps looks good, then you start debugging the pnp part, possibly the cuboid size is creating problems. It also might be that the version you trained the point orders is wrong (which I hope is not the case, it would be annoying to debug but possible). You can also look at these two https://github.com/NVlabs/Deep_Object_Pose/blob/master/src/dope/inference/cuboid_pnp_solver.py#L100-L101. Report the belief maps then we can see.

Hello @TontonTremblay,

Thanks for getting back to me. Here is the belief using my custom .pth file:

image

Here is the belief using your cracker_60.pth file:

image

I had a look at the training data, they looked a little suspicious (too good to be true?). The loss during testing goes down to 1.14e-08!

This almost makes me think that the script during training isn't reading the "ground truth" properly. Here is an example .json file from the synthetic data:

{
    "camera_data": {
        "camera_look_at": {
            "at": [
                1.0,
                0.0,
                0.0
            ],
            "eye": [
                0.0,
                0.0,
                0.0
            ],
            "up": [
                0.0,
                0.0,
                1.0
            ]
        },
        "camera_view_matrix": [
            [
                0.0,
                0.0,
                1.0,
                0.0
            ],
            [
                -1.0,
                0.0,
                0.0,
                0.0
            ],
            [
                0.0,
                -1.0,
                0.0,
                0.0
            ],
            [
                0.0,
                0.0,
                0.0,
                1.0
            ]
        ],
        "height": 500,
        "intrinsics": {
            "cx": 250.0,
            "cy": 250.0,
            "fx": 603.5535278320312,
            "fy": 603.5535278320312
        },
        "location_worldframe": [
            -0.0,
            0.0,
            -0.0
        ],
        "quaternion_xyzw_worldframe": [
            -0.5,
            0.5,
            -0.5,
            0.5
        ],
        "width": 500
    },
    "objects": [
        {
            "bounding_box_minx_maxx_miny_maxy": [
                80,
                289,
                361,
                500
            ],
            "class": "obj",
            "local_cuboid": null,
            "local_to_world_matrix": [
                [
                    1.8920022249221802,
                    -0.16087310016155243,
                    -0.6280502080917358,
                    -0.0
                ],
                [
                    -0.6219491362571716,
                    -0.9973906874656677,
                    -1.6181448698043823,
                    0.0
                ],
                [
                    -0.18304753303527832,
                    1.7260745763778687,
                    -0.9935604333877563,
                    -0.0
                ],
                [
                    1.5750807523727417,
                    -0.03503342717885971,
                    -0.4613076448440552,
                    1.0
                ]
            ],
            "location": [
                0.03503342717885971,
                0.4613076448440552,
                1.5750807523727417
            ],
            "location_worldframe": [
                1.5750807523727417,
                -0.03503342717885971,
                -0.4613076448440552
            ],
            "name": "single_obj_0",
            "projected_cuboid": [
                [
                    95.4332947731018,
                    444.68867778778076
                ],
                [
                    76.83844864368439,
                    444.17738914489746
                ],
                [
                    134.66638326644897,
                    572.567343711853
                ],
                [
                    149.38680827617645,
                    561.771810054779
                ],
                [
                    231.4603179693222,
                    363.9290928840637
                ],
                [
                    225.31068325042725,
                    356.3215136528015
                ],
                [
                    292.0815348625183,
                    475.01134872436523
                ],
                [
                    292.8772568702698,
                    472.88691997528076
                ],
                [
                    187.41339445114136,
                    458.8717818260193
                ]
            ],
            "provenance": "nvisii",
            "px_count_all": 0,
            "px_count_visib": 0,
            "quaternion_xyzw": [
                0.3544142246246338,
                0.9339839220046997,
                -0.006243467330932617,
                -0.927586019039154
            ],
            "quaternion_xyzw_worldframe": [
                1.104870319366455,
                -0.17712989449501038,
                -0.18352781236171722,
                -0.7566995620727539
            ],
            "segmentation_id": 2,
            "visibility": 1
        }
    ]
}

Based on the above json file, when I train DOPE I used the following command:

python3 train.py \
        --data nvisii_data_gen/output/dataset/ \
        --datatest nvisii_data_gen/output/test_data/ \
        --object single_obj_0 \
        --epochs 10 \
        --batchsize 24

Pointing the script to the single_obj_0 in the .json file?

Yeah your model is not working. Sorry, can you check the tensorboard images in the training dir/ you will see your annotation from your data. I think you did not train long enough tbh. It is a big a network to train.

Hello @TontonTremblay, not sure how to get those? The train.py script just creates a train_tmp dir which only includes .pth files and few .csv files?

can you try with train2.py ?

Unfortunately that script isn't runnable on my machine due to CUDA issues. I will try few more things to get that script running, but I have the feeling that my CUDA version (12.0) and pytorch supported version conflict with the code which throws exceptions. I know it's been a while since you ran that script, but do you know which CUDA / PyTorch versions you were using?

So if you check the train2.py there is a part where tensorboard is used to save belief maps. Just take that part and manually save belief maps as you train. This will help you know when it is trained enough.

Hi @TontonTremblay,

A quick update. Apparently, everything works. When I used the train.py I was passing --object single_obj_0 to tell it which object to train on, but then realised that the CSV parser in the script was failing to pick anything. I simply removed that flag when I trained and everything seems to work.

I realised that, when comparing my custom trained model (which uses a different texture for Cheezit that matches the object we have in the lab) with yours, yours still performs better. I am trying to understand what I am doing wrong:

  1. Reading your DOPE paper, I see that you generated 60k photorealistic images and mixed with 60k domain-randomised images.
  2. I generated 120,000 images using your nvisii script here and trained with these. I thought that the script will generate noisy images as a proxy to domain-randomised images?

Can I please ask if you used a different approach to train DOPE? Did you also use the nvisii script to generate all the data? There are some flags in that script like --motionblur which I did not use, for example.

Many thanks.

Wow amazing job, sorry this took a lot of time to get there. I think the problem is more related to data diversity. The domain randomization style dataset works well but will never work as well as mixing data style. I think the original DOPE paper has some experiments about mixing in percentage DR and Photorealistic.

When recreating this with your own data, the problem arises to recreate the same sort of data. Here I only shared half of the solution. DR will get you to a solution that will work fine for most cases. For example, for our models for HOPE I only used DR, as generating this data is quite simple. Generating photorealistic is a lot more work, e.g., finding 3d scenes that are correctly light up, that is harder. I had started working on some solutions on nvisii for this, but ended up having to put this aside.

So what solution do I have for you. I have a one, but yet you will have to learn a new tool, sorry this somewhat my fault. A while ago I wanted to learn more about blender, and I rewrote part of the data export we used in nvisii in blender. https://github.com/TontonTremblay/mvs_objaverse#falling-scene you can check this one. These scenes are not quite the same as what we used in DR, but it would probably get you to where you want to get.

If you want my real two cents, I would not go down this direction. I would keep the weights you have there, and use it as a detector, then use something like megapose and or diff-dope to get a really good pose. Or if you want something faster, you could check https://github.com/nv-nguyen/gigaPose that runs quite fast, although the code is not yet available, it should be in the next couple days though. Anyway, I would say DOPE is accessible, but in general it has an older approach to pose estimation. Sorry this is messy, I am working on a new pipeline to simplify this process.

Hi @TontonTremblay,

Thank you very much for all the information! I will have a look around and see how to move forward. I have seen your HOPE paper, and we ordered those objects so we can use your models, hopefully we can have more accurate DOPE predictions with HOPE objects since they should be identical with the ones you trained with.

I would like to thank you for being so responsive and for offerings so much to the community, including high-quality code and projects. Thanks!

It is a pleasure to be as helpful as possible. I appreciate the kind words, it motivates me to continue what we are doing. ❤️