NVlabs/Deep_Object_Pose

YCB 003_cracker_box estimation not good with official weights

wetoo-cando opened this issue · 7 comments

I ran inference on my dataset using the cracker_60.pth weights provided in the Google drive corresponding to this repository, and the train2/inference.py file.

The scene is quite simple, however the estimation results are not that great (see gifs below). The beliefs are splattering all over the place.

How should I investigate further to improve the results?

imgs

bel_maps

yeah the results of DOPE on this dataset are not that great for some reason, maybe because the objects are somewhat small in the image. You can see the belief maps being not bad, but somewhat not precise. I would recommend running https://github.com/NVlabs/FoundationPose after the detections or something like that.

@TontonTremblay thanks for responding with always helpful suggestions!

Wondering about "maybe because the objects are somewhat small". I am not sure how your official weights were obtained. I am currently running nvisii to generate synthetic data. At least from the images rendered so far, I see that objects are quite well spread out in the image and also in depth. Do you think training on such data may give better results than what you see above?

Great that FoundationPose's been released, I'll definitely check it out.

I think the architecture might be limiting here, since you go from 400 to 50 a 1/8 reduction, it is hard to be precise. I would think that a new architecture, like transformer could help here. I tried a couple different architectures 4 years ago and saw some improvements for this case, but I never quantified it. Probably training data that fits this distribution in sizes would help. But I think using something like a pose refiner afterwards would probably make it snap. diff-dope would help. In general, there are still no clean solutions for detection and initial pose that runs at a descent frame rate.

@TontonTremblay Thanks for your insights.

Yes, the downsampling to 50x50 probably degrades performance here. I plan to explore the "full" network to try to deal with this, at least initially.

Here are a few points that make me (and perhaps others) persist with DOPE:

  • relatively simple architecture,
  • the promise of real-time,
  • poses of multiple objects can be estimated simultaneously (although a separate model needs to be run per object, the size of the model is ca. 200 MB, so that in principle, it should be possible to run 10--15 models in parallel on a SOTA GPU),
  • good traction in the community (evident from the number of github issues),
  • and GREAT support from you 😃

Pose refiners like diff-dope will help once DOPE does its job. But they are iterative, which slows things down.

Transformers may perform better but have pretty cumbersome architectures (difficult to understand and train). Also, I doubt how fast they are at inference time. From the FoundationPose paper:
"Intel i9-10980XE CPU and NVIDIA RTX 3090 GPU ... pose estimation takes about 1.3 s for one object, where pose initialization takes 4 ms, refinement takes 0.88 s, pose selection takes 0.42 s. Tracking runs much faster at ∼32 Hz, since only pose refinement is needed and there are not multiple pose hypotheses. In practice, we can run pose estimation once for initialization and switch to tracking mode for real-time performance."

So refinement takes ca. 1 s per object! I wonder how the "Tracking runs much faster at ∼32 Hz" is related to this 1 s refinement. NVlabs/FoundationPose#29

When dealing with multiple objects in the scene , the FoundationPose author suggests running the model "sequentially" over each object (NVlabs/FoundationPose#5). This sounds far from real-time.

thank you for your kind words about DOPE. I do really appreciate the sentiments, I have always try to seek simple and accessible solutions to problems. As for speed, someone in the tensorRT team back in the days made a demo of current DOPE architecture running at 30 fps through some optimization, but I never followed-up on this.

As the nature of research goes, people tend to seek newer and fancier solutions to the same problems or fancier and newer problems, thus why this repo has not change much in a couple years, eg, I have been focusing on other problems, or fancier solutions to similar problems.

The biggest limitation of DOPE is the training requirements (data and compute). Having you generate dataset is complicated (we are working on a blender solutions since nvisii is not well supported -- hopefully this will help remove some friction), but training the network for like 2 days is a big draw back. Could DINOv2 or other pretrained network helped here to provide DOPE with better init. Maybe something like loftr plus some fine tuning on object data could help, here I had written the outlines of a short project that could do that, (outline, initial_code)
I would love to find solutions to both of these problems.

I sadly do not have first hand experience for foundation pose, my understanding of it and its modalities are quite limited (a brief read of the paper 2 months ago). So I will believe you assessment, in the lab I tried the refiner from megapose last summer to compare with diffdope, and when using megpose only on RGB the results in real world were quite similar to dope, about 2 cm off the GT.

Anyway if you are interested in improving upon DOPE, e.g., architecture, for speed and training time, I would love to collaborate.

@TontonTremblay I do have some ideas for improvement, but I need to spend some more time with DOPE to understand the capabilities/liimitations more comprehensively. I will write to you once I am ready.

BTW I have quite some experience with both blender and blenderproc (see: https://github.com/wetoo-cando/blender-hoisynth). I am mainly interested in tracking objects while they are being interacted with by hands. Let me know if you have any questions.

These are really cool renders. Tracking is for sure a different problem than detection like DOPE is solving. bundle-sdf, megapose refiner, foundation pose all do very good tracking. Centerposetrack as well is a work that has a simplified architecture that can do tracking.