facebookresearch/votenet

training on kitti?

lucasjinreal opened this issue ยท 13 comments

Does anyone tried training on kitti?

I am playing around with KITTI right now. After some tweaks to the backbone network architecture, and some minor param changes in the Voting module, I am at around ~50 AP @ .25 IoU mark for Car and still growing after 100 epochs (Pedestrian ~25 AP, Cyclist ~10 AP) -- gotta do a proper eval with KITTI metrics later though. What I see is, however, that the positive vote ratio (although growing very slowly) remains very low at ~ 0.005-ish, probably due to a poor signal to noise ratio of the LiDAR data. Curious if BoxNet would perform better out-of-the-box. I have not yet implemented any augs or scene in-paintings to KITTI dataset, have not touched the training params yet, am still on a vanilla single-scale grouping layers in the backbone, and probably gonna test some kind of pre-segmentation to deal with low positive vote rates. All in all, the very first results are promising.

@alar0330 Will u opensource it? Looking forward for your speed and visualization results.

Yes, I was planning on that when the bulk part of things I wanted to test is implemented.

@alar0330 when can we expect the open source of it on kitti ???

I don't think VoteNet would work well on KITTI but it's still worth a try.

The main contribution of VoteNet, the voting module, tries to solve the problem that there are sometimes to much empty space around the center of objects. As the paper says:

As depth sensors only capture surfaces of objects, 3D object centers are likely to be in empty space, far away from any point. As a result, point based networks have difficulty aggregating scene context in the vicinity of object centers.

This problem usually occurs with indoor data like SUN-RGB-D because the objects are relatively large in comparison with the whole scene. But for outdoor data like KITTI, the most objects are relatively tiny compared to the range of Lidar. So the voting module would probably bring no benefit.

What's more, the pointnet++ backbone would be too computational expensive if the point cloud was not down sampled. Take SUN-RGB-D as an example. The original point cloud has around 350K points and the backbone only randomly takes 20K as input. For indoor scenario such down sampling would be fine because each object still has a decent amount of points after the sampling. But for KITTI, aggressive down sampling would probably make some objects invisible, like a pedestrian with only a few points on the surface.

@lilanxiao good point! For smaller objects like pedestrians or bicycles, VoteNet may don't work well. But I'm curious as to how it performs detecting cars/trucks within a short-range distance. Has anyone tried?

Well, it is true that VoteNet does not do a fine job on KITTI dataset out-of-the-box.

The problem with KITTI is not even the downsampling of point cloud (in fact, after projecting the pc onto the image plane of the front camera, the resulting frustum has, on average, ~16300 points), but it is the sparsity and scale of the outdoor LiDAR scenes resulting in a low signal-to-noise ratio (i.e. rate of positive votes to negative ones). So, for example, the ratio of positive seeds to all seed point is about 0.5%. Actually, even BoxNet could struggle with such "slim" classes, as Pedestrian and Cyclist, where number of foreground points is rather small. The situation gets even worse in case of distant objects (the depth fields of KITTI reach beyond 70 meters along forward-axis), where some objects have as little as only a few fg points left.

Some results are shown below:

image

As you can see, with some hparam tuning and enhancements with MSG layers, one could squeeze out up to 55.2 AP for Car-class, 31.2 AP for Pedestrian, and 21.0 AP for Cyclist (and that is IoU@0.25, not as "hard" as in the case of the official KITTI 3D object detection leaderboard).

And here are some visualizations of the 3D-detections with VoteNet:

image

As you can see that in cases when objects are relatively close to the LiDAR, we generally get
better results since the number of foreground points is relatively large. The model also manages to predict well the oriented amodal 3D boxes of partially occluded objects, e.g. parked cars. The common mode of failure is, however, related to the distant objects, where the number of foreground
points is substantially smaller. Furthermore, as seen on Fig. 4 (2-d), the model sometimes wrongly labels such narrow tall objects like trees as pedestrian (but curiously gets the cyclist correctly predicted on the same scene). It also appears challenging for the model to correctly labels multiple instances of the same object packed close to each other, e.g. Fig. 4 (1-d).

Hope that addresses majority of the questions :)

@alar0330 Wow! People have really juicy discussion here! Thank you very much for sharing your work!
Yeah, the sparsity and scale of the outdoor LiDAR scenes are the problem. Perhaps that's why many leading works in KITTI-Benchmark use Bird Eye View or fusion with the RGB images. And perhaps it's also the reason why the previous Frustum-PointNet uses 2D-detector to generate region proposal for its PointNet-based 3D-detector.
BTW, have you tried removing the ground? I've noticed that a lot of seed points are on the ground. Removing then might improve the SNR. Or maybe one can try to concatenate some RGB-features to the input points like this work does...

Thanks for sharing your results and insights during the process, @alar0330! It's really interesting to see how it performs across the different categories and distance ranges. Seems to validate the intuition that the closer the objects the better it works. I was planning to experiment with a driving dataset as well. Is there any reason why you chose KITTI over another driving dataset (e.g. Waymo, nuScenes, Lyft L5)? Did you encounter any challenges training on KITTI? Would you mind sharing the script you use to prepare the dataset?

I agree with you @lilanxiao, in that doing some filtering of ground points may improve results. Btw, not sure if you've seen this already, but the authors just released the new version of VoteNet that fuses point clouds with RGB: ImVoteNet.

@ilopezfr Thank you! I haven't seen it before.
You actually saved my life! In the last few weeks I was struggling with combining VoteNet with RGB-features. This ImVoteNet makes my work look bad :( . What I did is some naive attempt to concatenate low level RGB-features from a 2D-detector with the point cloud. I could only improve the result with about 1% mAP @0.25IoU. Thankfully you showed me this paper before I put too much effort in this direction.

@lilanxiao No problem! Glad it's helpful. :) Have you also implemented VoteNet on KITTI? I'd be curious to see your results. If so, is there a way I can reach out to ask you some questions about the implementation?

@ilopezfr sorry, I've implemented it yet. I'm still working on SUN-RGB-D.

@lilanxiao @ilopezfr thanks guys for your feedback and all the suggestions.

It has been my project for the CS230 course at Stanford. However, since I've been taking it along my full-time job, I only had like literally 4-5 weekends to digest all the PointNet-papers, disassemble VoteNet-scripts and try to push through with my ideas. So there is still a lot of room for improvement.

I agree, that in order to mitigate the poor signal-to-noise ratio in such large-scale outdoor scenes as KITTI, one could implement a kind of point filtering or foreground pre-segmentation step. I just did not have time to implement any of that. Floor cancellation might improve the performance overall, but I doubt that it will be enough though. Consider such scenes (below) where bg noise is spread all over the vertical axis of the scene.

snapshot

I chose KITTI mainly because I could recycle much of the codebase for extracting, transforming, and visualizing KITTI scenes (i.e. data preparation) found here on GitHub-vastness and move quickly on to experiment with the deep learning part of the project. Besides that other publicly available LiDAR datasets are, AFAIK, way heavier than KITTI (e.g. Waymo is like ~1TB big), resulting in a much slower experiment cycles. Training on ~3700 KITTI scenes already took about ~9h for 180 epochs on a V100. And already that was burning through the sponsored AWS-credits quite quickly, you know ;)

By the way, my project work was recently published, you can find the final report here, and poster here.