UAV Navigation in 3D Gaussian Splatting Environments using ML-Agents

Plan of Action

Pre-requisites
NeRF in a nutshell
3D Gaussian Splatting

2. NeRF in a nutshell

In a previous project - Assessing NeRF's Efficacy in 3D Model Reconstruction: A Comparative Analysis with Blender - I explained the mechanics behind NeRF and how to train a NeRF from scratch. In this section, I will briefly recap on the workings of NeRF.

In a nutshell, NeRF is about 3D reconstructing a scene and providing photorealistic novel-view synthesis based on a series of 2D images represented using a continuous neural radiance field, which is a neural network.

But how is this different from traditional methods? Well, traditional 3D reconstruction methods such as point clouds, voxel grids, or meshes are discrete methods.

Point clouds: Each point in 3D space have its own set of (x, y, z) coordinates. More on this here.
Voxel Grids: Represents the 3D geometry as a regular grid of cubic voxels (volumetric pixels), where each voxel is either occupied or empty.More on this later.
Meshes: A mesh representation consists of a set of vertices (points in 3D space), edges (lines connecting vertices), and faces (typically triangles formed by vertices and edges). More on this above.

In all these three methods, the continuous 3D geometry is approximated by a finite set of discrete elements such as points, voxels, vertices, and faces.

In NeRF, the 3D scene is represented by a continuous function that maps a 3D position coordinate (x, y, z) and a viewing direction (θ, φ) to a color (RGB) and a density value. This function is modeled by a neural network trained on a series of 2D images and their corresponding camera poses (intrinsic and extrinsic parameters).

Again, how is this different from the traditional rendering of computer graphics?

Rasterization: We take 3D meshes and convert them into 2D pixels. The meshes are effectively projected onto the 2D image plane and filled in with color values.
Ray Tracing: Ray tracing calculates how light interacts with objects in a 3D scene. Rays are tracked from the viewer's eye through each pixel on the screen, considering complex interactions such as reflections and refractions.

With NeRF, we are reconstructing our scene within a Neural Network such that the weights and parameters of the latter, are an encoding representation of our actual scene. As such, NeRF renders images by querying the neural network at various points along camera rays and accumulating the color and density values. Below depicts the process of sampling along the ray on a scene - Milo the lazy cat.

In ray tracing, we send out a ray from the camera through each pixel on the screen. When that ray hits an object in the 3D scene, we calculate the color of that pixel by simulating how light bounces off and interacts with the object's surface. In contrast with NeRF, we have a trained renderer - our entire multi-camera scene is represented as a Neural Network. It involves sending out camera rays and sampling multiple points along each ray. At each point, the neural network is queried with the 3D position and viewing direction to determine the color and density values, which are then accumulated to produce the final pixel color.

As shown below, we are in a vacuum for the first sample so the density is 0 and we set the color to white (255,255,255). We continue for the next sample; check the density and set the color. For the 7th sample, we get a density of 1 as we hit the object so we set the color respectively.

When we shoot the rays from multiple camera positions, we will have intersecting rays as shown below. Through triangulation, we figure out where our object is in 3D space. Hence, we need around 200+ images for NeRF to get a good 3D representation of our scene.

The problem with NeRF is that it is slow for real-time rendering. Mainly because:

To render an image, NeRF needs to sample multiple points along each camera ray passing through the pixels.
For each sampled point along the ray, NeRF needs to evaluate the neural network to obtain the color and density values.
NeRF does not rely on meshes or polygons, which graphics hardware can efficiently process.
Unlike traditional rasterization techniques that can leverage hardware acceleration on GPUs, NeRF uses volume rendering techniques to accumulate the color and density values sampled along each ray which can be computationally costly.

yudhisteer/UAV-Navigation-in-3D-Gaussian-Splatting-Environments-using-ML-Agents

UAV Navigation in 3D Gaussian Splatting Environments using ML-Agents

Plan of Action

2. NeRF in a nutshell

3. 3D Gaussian Splatting

References