In this project my focus is on understanding the combination the LiDAR and camera systems, bridging 2D and 3D data, and gaining a deep understanding of the advanced Sensor Fusion Algorithms used by top self-driving car companies.
In this Project, I practiced some most advanced skills that Self Driving Car Engineers use to fuse LiDARs with Cameras. LiDAR is used more and more in almost every robotics startup, and merging it with a camera is really cool!
Note: There is no such a thing like self driving car engineer. There are Perception Engineers, Sensor Fusion Engineers, Control Engineers and so on.
In this Project, I have built my understanding around three most important things about Sensor Fusion:
- Sensors and Sensor Fusion
- Early Fusion
- Late Fusion
In this project, my focus is on Cameras and LiDARs (IMAGES and POINT CLOUDS), which are being used by almost every Self Driving Car Company.
Today, Cameras are very scalable and integrated in every Self Driving Car.
In above Picture, which is taken from here, it's the Tesla's Autopilot. And notice that we have 3 cameras, one in left, one in right and one in the center, side cameras are generally called stereo pairs which makes it easy to estimate disance to an obstacle.
Another example of my OpenCV AI Kit with Depth OAK-D, Stereo Camera, which is architectured by Brandon Gilles (Late) Founder of LUXONIS Robotic Vision Startup. Just notice the Object Detection and Depth Perception (X,Y,Z), we get with OAK-D.
LiDAR (Light Detection and Ranging)
In above picture, which is taken from here, it's the waymo's autonomous vehicle.
So, This is how LiDAR sensors work:
- They send out a beam of light.
- This light beam hits an object and bounces back.
- LiDAR measures how long it takes for the light to bounce back.
- Using the time it takes (time-of-flight), LiDAR calculates the distance to the object.
So, LiDAR uses the time it takes for light to hit something and come back to figure out how far away that thing is. This helps create detailed maps of the surroundings and is used in technologies like autonomous cars and mapping. Above Picture is taken from here
LiDAR generate point clouds. Point clouds are collections of points, and each point has its XYZ position. This allows us to accurately know the depth and distance of every obstacle or object in a 3D space.
Let's take a look at the following video to understand more about Point Clouds, I captured these point clouds with Basler Blaze 101 Time of Flight Camera. To know more about this check my project Safe Rail.
A self-driving car typically has:
Multiple Cameras: These cameras have different angles and capabilities. Some can see far with a narrow view, while others have a wider view but shorter range.
LiDARs: These sensors detect objects around the car and precisely estimate their 3D positions.
To do early fusion I performed three steps:
- Project the Point Clouds (3D) to the Image(2D)
- Detect Obstacles in 2D (Camera)
- Fuse the Results
So, here is the Self Driving Car Setup from KITTI Vision Benchmark Suite, It works well for learning sensor Fusion.
This Self Driving Car setup has one Valodyne LiDAR (HDL-64E Laser Scanner), and a total of 4 cameras, 2 color cameras and 2 Grayscale cameras, Two cameras (Color and Gray) on left side, two cameras (Color and Gray) on right side. The sensors orientation is not the same, coordinate system is different and psition is also different. https://velodynelidar.com/blog/hdl-64e-lidar-sensor-retires/
Projecting a LiDAR Point (3D) in a Camera Image (2D)
Considering an image, and a point cloud, and this is the output of projection!
And to do this projection, Information of physical position of sensors in needed, and how to convert from one to another.
So here is the list of Sensors we are using data of.
- 1 X Velodyne HDL-64E Laserscanner
- 4 X FLIR Point Cameras
- LiDAR and cameras use different systems to see and aren't in the same spot: it means that we're seeing an obstacle from 2 different positions with two different frames. We'll therefore need to convert the point seen by the LiDAR to the camera space, and then to the image space!
Projection Formula: Here is the formula to convert a point X in 3D into a point Y in 2D.
above formula is highly dependent on our sensor setup system.
P:
A camera produces images by capturing light from the environment through its lens and converting it into a digital image. When buying a camera, it's typically pre-calibrated, meaning it's set up to produce images accurately without requiring you to adjust complex settings right away. This calibration ensures that the images you take are properly exposed and in focus.
Calibration is the process of teaching your camera how to translate a point in the real world (3D) into a pixel on the camera sensor. The intrinsic calibration matrix, the P, is a critical component in this process. It helps the camera understand the relationship between the 3D world and the 2D image it captures.
In above matrix, f is is the focal length (fx, fy) and c is the optical center (cx, cy). This matrix is used to transform from the camera's perspective (camera frame) to the image's perspective (image frame).
R0:
In stereo vision, the aim is to make the left and right images line up perfectly, it helps in drawing a horizontal line across them, below are the same objects aligned with each other.
We refer to that horizontal line as the "Epipolar Line."
In Stereo Vision, the matrix R0 aligns the left and right images, making them match perfectly. This step isn't necessary with a single camera, but it's crucial for stereo vision to work correctly.
R0|t:
R|t is the transformation from the Velodyne frame of reference to the Camera frame of reference. It describes how data measured in the Velodyne's coordinate system can be translated and rotated to align with the coordinate system of the Camera.
In the above picture, Valedyne LiDAR Coordinate system is different from Camera Coordinate system.
So we have to go from Velodyne coordinate system to Camera coordinate system, and this is called rotation.
And to move the coordinate system physically is called translation.
Keypoints:
- To project a point obtained from a LiDAR in an image, we need to understand clearly how are our sensors positioned, and what are their coordinate system.
- The projection formula is Y = P * R0 * R|t * X
- R|t is the matrix that convert a point from the Velodyne Frame to the Image frame. R0 is a matrix to rectify the stereo cameras. P is the intrinsic calibration matrix to take a point from the camera frame to the image (pixel).
Applying the Projection Formula:
Projecting a 3D Point to 2D Image
**
This is what we do in object detection, given an image and output bounding boxes. And nothing more.
Implementation of Object Detection and Sensor Fusion
To do the late fusion, I performed 5 steps:
- Detecting objects in 2D
- Detecting objects in 3D
- Projecting the 3D Bounding Box in the Image
- Fusing the Bounding Boxes
- Building the Ultimate 3D Object