Learning 3D Vision with Inverse Graphics - Part I

Plan of Action

  1. Meshing Around
  2. Single View to 3D
  3. Occupancy Network

1. Meshing Around

In order to define a mesh, let's start with a point cloud which is an unordered set of points - {p_1, p_2, ..., p_N}. When we represent a 3D model with a point cloud such as the sphere in red as shown below, we have no explicit connectivity information. Hence, how do we answer the question: How do we know if a point lies inside or outside the surface? Hence, the need for connectivity - meshes.

Meshes are piecewise linear approximations of the underlying surface. Which means they are discrete parametrizations of a 3D scene. We start from our point cloud, now called vertices, joining them by edges to form faces. Thus, we establish connectivity by having 3 vertices to make a face. So now we need to answer again the question: How do we know if a point lies inside or outside the surface? It turns out that now indeed we can answer this question due to the "watertight" property of meshes. That is, if we filled the mesh with water, we would have no leakage. Therefore, if our mesh is watertight, we can indeed define "inside" and "outside".

Let's build our mesh with a base triangular polygon. We need to establish the vertices in x,y,z coordinates in a [3, 3] tensor and our faces in a [1, 3]. Note that the elements in the face tensor are just the indices of the vertices tensor. However, PyTorch3D expects our tensor to be batched so we unsqueeze them later to become [1, 3, 3] and [1, 1, 3] respectively. We then use pytorch3d.structures.Meshes to create our mesh. The MeshGifRenderer class has a function to render our mesh from multiple viewpoints.

# Triangle Mesh
vertices = torch.tensor([[-1, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=torch.float32)
faces = torch.tensor([[0, 1, 2]], dtype=torch.int64)
filename = "triangle_mesh.gif"
num_views = 30
triangle_mesh = MeshGifRenderer(vertices=vertices, faces=faces)
triangle_mesh.gif_renderer(filename=filename, num_views=num_views)

1.1 Building mesh by mesh

Now that we have built a triangular mesh. We can use this as a base to create more complex 3D models such as a cube. Note that we need to use two sets of triangle faces to represent one face of the cube. Our cube will have 8 vertices and 12 triangular faces. Below is a step-by-step of joining all the 12 faces to form the final cube:

square_mesh_0 square_mesh_1 square_mesh_2 square_mesh_3 square_mesh_4 square_mesh_5 square_mesh_6 square_mesh_7 square_mesh_8 square_mesh_9 square_mesh_10 square_mesh_11

1.1 Render Mesh with Texture

Although we showed how our 3D model are made up of triangular meshes, we kind of jump ahead in rendering a mesh. Now let's look at a step by step process of how we can import a ".obj" file, its texture from a .mtl file and render it.

1.1.1 Load data

We first start by loading our data using the load_obj function from pytorch3d.io. This returns the vertices of shape [N_v, 3], the face_props tuple which contains the vertex indices (verts_idx) of shape [N_f, 3] and texture indices (textures_idx) of similar shape [N_f, 3], and the aux tuple which contains the uv coordinate per vertex (verts_uvs) of shape [N_t, 2].

vertices, face_props, aux = load_obj(data_file)
print(vertices.shape) #[N_v, 3]

faces = face_props.verts_idx #[N_f, 3]
faces_uvs = face_props.textures_idx #[N_f, 3]

verts_uvs = text_props.verts_uvs #[N_t, 2]

Note that all Pytorch3D elements need to be batched.

vertices = vertices.unsqueeze(0)  # [1 x N_v x 3]
faces = faces.unsqueeze(0)  # [1 x N_f x 3]

1.1.2 Load Texture

Pytorch3d mainly supports 3 types of textures formats TexturesUV, TexturesVertex and TexturesAtlas. TexturesVertex has only one color per vertex. TexturesUV has rather one color per corner of a face. The 3D object file .obj directs to the material .mtl file and the material file directs to the texture ``.pngfile. So if we only have a.obj``` file we can still render our mesh using a texture of our choice as such:

texture_rgb = torch.ones_like(vertices.unsqueeze(0)) # [1 x N_v X 3]
texture_rgb = texture_rgb * torch.tensor([0.7, 0.7, 1])

We use TexturesVertex to define a texture for the rendering:

textures = pytorch3d.renderer.TexturesVertex(texture_rgb)

However if we do have a texture map, we can load it as a normal image and visualize it:

texture_map = plt.imread("cow_texture.png") #(1024, 1024, 3)
plt.imshow(texture_map)
plt.show()

We then use TexturesUV which is an auxiliary datastructure for storing vertex uv and texture maps for meshes.

textures = pytorch3d.renderer.TexturesUV(
                        maps=torch.tensor([texture_map]),
                        faces_uvs=faces_uvs.unsqueeze(0),
                        verts_uvs=verts_uvs.unsqueeze(0)).to(device)

1.1.3 Create Mesh

Next, we create an instance of a mesh using pytorch3d.structures.Meshes. Our arguments are the vertices and faces batched, and the textures.

meshes = pytorch3d.structures.Meshes(
    verts=vertices.unsqueeze(0), # batched tensor or a list of tensors
    faces=faces.unsqueeze(0),
    textures=textures)

1.1.4 Position a Camera

We want to be able to generate images of our 3D model so we set up a camera. Below are the 4 coordinate systems for 3D data:

  1. World Coordinate System: The environment where the object or scene exists.
  2. Camera View Coordinate System: Originates at the image plane with the Z-axis perpendicular to this plane, and orientations are such that +X points left, +Y points up, and +Z points outward. A rotation (R) and translation (T) transform this from the world system.
  3. NDC (Normalized Device Coordinate) System: Normalizes the coordinates within a view volume, with specific mappings for the corners based on aspect ratios and the near and far planes. This transformation uses the camera projection matrix (P).
  4. Screen Coordinate System: Maps the view volume to pixel space, where (0,0) and (W,H) represent the top left and bottom right corners of the viewable screen, respectively.

Image source: PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

We use the pytorch3d.renderer.FoVPerspectiveCameras function to generate a camera. Our 3D object lives in the world coordinates and we want to visualzie it in the image coordinates. We first need a rotation and translation matrix to build the extrinsic matrix of the camera, the intrinsic matrix will be supplied by PyTorch3D.

R = torch.eye(3).unsqueeze(0) # [1, 3, 3]
T = torch.tensor([[0, 0, 3]]) # [1, 3]

cameras = pytorch3d.renderer.FoVPerspectiveCameras(
    R=R,
    T=T,
    fov=60,
    device=device)

Below we have the extrinsic matrix which consists of the translation and rotation matrix in homogeneous coordinates.

transform = cameras.get_world_to_view_transform()
print(transform.get_matrix()) # [1, 4, 4]
tensor([[[ 1.,  0.,  0.,  0.],
         [ 0.,  1.,  0.,  0.],
         [ 0.,  0.,  1.,  0.],
         [ 0.,  0., 3.,  1.]]], device='cuda:0')

In the project Pseudo-LiDARs with Stereo Vision, I explain more about the camera coordinate system:

Now when rendering an image, we may experience that our rendered image is white because the camera is not face our mesh. We have 2 solutions for this: move the mesh or move the camera.

We rotate our mesh 90 degrees clockwise. Notice how the camera is always facing towards the x-axis.

relative_rotation = pytorch3d.transforms.euler_angles_to_matrix(torch.tensor([0, np.pi/2, 0]), "XYZ") # [3, 3]
vertices_rotate = vertices @ relative_rotation # [N_v, 3]
Before rotation After rotation
Image 1 Image 2

Or we rotate the camera. Notice how the camera is now facing towards the z-axis:

relative_rotation = pytorch3d.transforms.euler_angles_to_matrix(torch.tensor([0, np.pi/2, 0]), "XYZ") # [3, 3]
R_rotate = relative_rotation.unsqueeze(0) # [1, 3, 3]
Before rotation After rotation
Image 1 Image 2

1.1.5 Create a renderer

To create a render we need a rasterizer which is given a pixel, which triangles correspond to it and a shader, that is, given triangle, texture, lighting, etc, how should the pixel be colored.

image_size = 512

# Rasterizer
raster_settings = pytorch3d.renderer.RasterizationSettings(image_size=image_size)
rasterizer = pytorch3d.renderer.MeshRasterizer(
    raster_settings=raster_settings)

# Shader
shader = pytorch3d.renderer.HardPhongShader(device=device)
# Renderer
renderer = pytorch3d.renderer.MeshRenderer(
    rasterizer=rasterizer,
    shader=shader)

1.1.6 Set up light

Our image will be pretty dark if we do not set up a light source in our world.

lights = pytorch3d.renderer.PointLights(location=[[0, 0, -3]], device=device)

1.1.7 Render Mesh

image = renderer(meshes, cameras=cameras, lights=lights)
plt.imshow(image[0].cpu().numpy())
plt.show()

1.2 Rendering Generic 3D Representations

1.2.1 Rendering Point Clouds from RGB-D Images

Our dataset contains 3 images of the same plan. We have the RGB image, a depth map, a mask, and a Pytorch3D camera corresponding to the pose that the image was taken from. Frst, we want to convert the depth map int oa point cloud. For that, we make use of the unproject_depth_image function which uses the camera intrinsics and extrinisics to cast a ray from every pixel in the image into world coordinates space. The ray's final distance is the depth value at that pixel, and the color of each point can be determined from the corresponding image pixel.

1.2.2 Parametric Functions

We can define a 3D object as a parameteric function and sample points along its surface and render these points. If we were to define the equation of a sphere with center (x_0, y_0, z_0) and radius R.

Now if we were to define the parameteric function of the sphere using the elevation angle (theta) and the azimuth angle (phi). Note that by sampling values of theta and phi, we can generate a sphere point cloud.

Below are the rendered point clouds where we sampled 50, 300 and 1000 points on the surface respectively.

1.2.3 Implicit Surfaces

An implicit function is a way to define a shape without explicitly listing its coordinates. The function F(x, y, z) describes the surface by its "zero level-set," which means all points (x, y, z) that satisfy F(x, y, z) = 0 belong to the surface.

To visualize a shape defined by an implicit function, we start by discretizing 3D space into a grid of voxels (volumetric pixels). We then evaluate the function F at each voxel's coordinates to determine whether each voxel should be part of the shape (i.e., does it satisfy the equation F = 0?). The result of this process is stored in a voxel grid, a 3D array where each value indicates whether the corresponding voxel is inside or outside the shape.

To reconstruct the mesh, we use the marching cubes algorithm, which helps us extract surfaces at a specific threshold level (0-level set). In practice, we can create our voxel grid using torch.meshgrid, which helps in setting up coordinates for each voxel in our space. We use these coordinates to evaluate our mathematical function. After setting up the voxel grid, we apply the mcubes library to transform this grid into a triangle mesh.

The implicit function for a torus:

Below we have the torus with voxel size 20, 30, and 80 respectively.

So how is these torus different from the point cloud ones? With implicit surfaces, we have connectivity between the vertices as compared to point clouds which has no connectivity.

1.2.4 Sampling Points on Meshes

One way to convert meshes into point clouds would be simply to use the vertices.But this can be problematic if the triangular mesh - faces- are of different sizes. A better method is uniform sampling of the surface through stratified sampling. Below is the process:

  1. Choose a triangle to sample from based on its size; larger triangles (larger area) have a higher chance of being chosen.
  2. Inside the chosen triangle, pick a random spot. This is done using something called barycentric coordinates, which help in defining a point in relation to the triangle’s corners.
  3. Calculate the exact position of this random spot on the triangle to get a uniform spread of points across the entire mesh.

Below is an example whereby we take a triangle mesh and the number of samples and outputs a point cloud. We randomly sample 1000, 10000, and 100000 points respectively.


2. Single View to 3D

2.1 Fitting a Voxel Grid

To fit a voxel, we wil first generate a randomly initalized voxel of size [b x h x w x d] and define a binary cross entropy (BCE) loss that can help us fit a 3D binary voxel grid using the Adam optimizer.

In a 3D voxel grid, a value of 0 indicates an empty cell, while 1 signifies an occupied cell. Thus, when fitting a voxel grid to a target, the process essentially involves solving a binary classification problem aimed at maximizing the log-likelihood of the ground-truth label in each voxel. That is, we will be predicting an occupancy score for every point in the voxel grid and we compare that with the binary occupancy in our ground truths.

In summary, the BCE loss function is the mean value of the voxel-wise binary cross entropies between the reconstructed object and the ground truth. In the equation below, N is the number of voxels in the ground truth. y and y-hat is the predicted occupancy and the corresponding ground truth respectively.

We will define a Binary Cross Entropy loss with logits which combines a Sigmoid layer and the BCELoss in one single class. The pos_weight factor calculates a weightage for occupied voxels based on the average value of the target voxels. By dividing 0.5 the weight inversely adjusts according to the frequency of occupied voxels in the data. This method addresses class imbalances where we have more unoccupied cells than occupied ones.

def voxel_loss(voxel_src: torch.Tensor, voxel_tgt: torch.Tensor) -> torch.Tensor:
    # voxel_src: b x h x w x d
    # voxel_tgt: b x h x w x d
    pos_weight = (0.5 / voxel_tgt.mean())
    criterion = torch.nn.BCEWithLogitsLoss(reduction='mean', pos_weight=pos_weight)
    loss = criterion(voxel_src, voxel_tgt)
    return loss

Below is the code to fit a voxel:

# Generate voxel source with randomly initialized values
voxels_src = torch.rand(feed_cuda["voxels"].shape, requires_grad=True, device=args.device)

# Initialize optimizer to optimize voxel source
optimizer = torch.optim.Adam([voxels_src], lr=args.lr)

for step in tqdm(range(start_iter, args.max_iter)):
    # Calculate loss
    loss = voxel_loss(voxels_src, voxels_tgt)
    # Zero the gradients before backpropagation.
    optimizer.zero_grad()
    # Backpropagate the loss to compute the gradients.
    loss.backward()
    # Update the model parameters based on the computed gradients.
    optimizer.step()

We train our data for 10000 iterations and observe the loss steadily decreases to about 0.1.

Below are the visualization for the ground truth, the fitted voxels, and the optimization progress results.

Ground Truth Fitted Progress

2.2 Image to voxel grid

Fitting a voxel grid is easy but now we want to 3D reconstruct a vocel grid from a single image only. For that, we will make use of an auto-encoder which first encode the image into latent code using a 2D encoder. We use a pre-trained ResNet-18 model from torchvision to extract features from the image. The final classification layer is to make it a feature encoder. Our image will be transformed to a latent code.

Our input image is of size [batch_size, 137, 137, 3]. The encoder transforms it into a latent code of size [batch_size, 512]. Next, we need to reconstruct the latent code into a voxel grid. For that, we first build a decoder using multi-layer perceptron (MLP) only as shown below.

self.decoder = torch.nn.Sequential(
    nn.Linear(512, 1024),
    nn.PReLU(),
    nn.Linear(1024, 32*32*32)
)

Secondly, we change our decoder to fit the architecture of the paper Pix2Vox which uses 3D de-convolutional network (transpose convolution) to upsample 1 x 1 x 1 ch to N x N x N x ch. Note that the latent code is what is actually encoding the scene (the image) and decoding the latents will give us a scene representation (3D model). The input of the decoder is of size [batch_size, 512] and the output of it is [batch_size x 32 x 32 x 32].

# Input: b x 512
# Output: b x 32 x 32 x 32

self.fc = nn.Linear(512, 128 * 4 * 4 * 4)
self.decoder = nn.Sequential(
    nn.ConvTranspose3d(128, 64, kernel_size=4, stride=2, padding=1),
    nn.BatchNorm3d(64),
    nn.ReLU(),
    nn.ConvTranspose3d(64, 32, kernel_size=4, stride=2, padding=1),
    nn.BatchNorm3d(32),
    nn.ReLU(),
    nn.ConvTranspose3d(32, 8, kernel_size=4, stride=2, padding=1),
    nn.BatchNorm3d(8),
    nn.ReLU(),
    nn.Conv3d(8, 1, kernel_size=1),
    nn.Sigmoid()
)

# Set model to training mode
model.train()
# Initialize the Adam optimizer with model parameters and learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)

# Loop through the training steps
for step in range(start_iter, args.max_iter):
    # Restart the iterator when a new epoch begins
    if step % len(train_loader) == 0:
        train_loader = iter(loader)

    # Fetch the next batch of data
    feed_dict = next(train_loader)
    # Preprocess the data into the required format
    images_gt, ground_truth_3d = preprocess(feed_dict, args)  # [32, 137, 137, 3], [32, 1, 32, 32, 32]
    # Generate predictions from the model
    prediction_3d = model(images_gt, args)  # [32, 1, 32, 32, 32])  # voxels_pred
    # Calculate the loss based on predict
    loss = calculate_loss(prediction_3d, ground_truth_3d, args)
    # Zero the parameter gradients
    optimizer.zero_grad()
    # Backpropagate to compute gradients
    loss.backward()
    # Update model parameters
    optimizer.step()

After training for 3000 epochs with a batch size of 32 and a learning rate of 4e-4, we achive a loss of 0.395. For some reason, we got worst result with the deconvolutional network. In the paper, the authors describe their decoder as a coarse voxel generator before passing it into a refiner. We will continue with the MLP network for evaluation.

Decoder with MLP Decoder with 3D De-conv

In the first row are the single view image, ground truths of the mesh and the second row is the predicted voxels. Note that although we do not have a perfect 3D reconstruction wew can still delienate how the structure of the chairs from the voxelgrid matches the single view image and the 3D mesh.

2.3 Fitting a Point Cloud

Similarly, to fitting a voxel, we generate a point cloud with random xyz values. We define the chamfer loss function that will allow us to fit the random point cloud into our target point cloud again using the Adam optimizer.

Note that a point cloud represents a set of P points in 3D space. It can represent fine structures a huge numebr of poitns as we see below in the visualizations which uses 1000 points only. However, it does not explicitly represent the surface of the of a shape hence, we need to extract a mesh from the point cloud. I explain more about point cloud in my other projects: Point Clouds: 3D Perception with Open3D and Robotic Grasping Detection with PointNet.

To fit a point cloud, we need a differentiable way to compare pointclouds as sets because order does not matter! Therefore, we need a permutation invariant learning objective. We use the Chamfer distance which is the sum of L2 distance to each point's nearest neighbor in the other set.

# Generate pointcloud source with randomly initialized values
pointclouds_src = torch.randn([1, args.n_points, 3], requires_grad=True, device=args.device)
# Initialize optimizer to optimize pointcloud source
optimizer = torch.optim.Adam([pointclouds_src], lr=args.lr)
for step in range(start_iter, args.max_iter):
    # Calculate loss
    loss = losses.chamfer_loss(pointclouds_src, pointclouds_tgt)
    # Zero the gradients before backpropagation.
    optimizer.zero_grad()
    # Backpropagate the loss to compute the gradients.
    loss.backward()
    # Update the model parameters based on the computed gradients.
    optimizer.step()

Suppose we have a gorund truth point cloud and a predicted point cloud. For each point in the ground truth set, we get its nearest neighbor in the predicted set, and their Euclidean distance is calculated. These distances are summed to form the first term of the equation below. Similarly, for each predicted point, the nearest ground truth point is found, and the distance to this neighbor is similarly summed to create the second term of the loss. The Chamfer loss is the total of these two sums, indicating the average mismatch between the two sets. A zero Chamfer loss, signifying perfect alignment, occurs only when each point in one set exactly coincides with a point in the other set. Ibn summary, the chamfer loss guides the learning process by comparing the predicted point cloud against a ground truth set, regardless of the order of points in either set.

from pytorch3d.ops import knn_points
def chamfer_loss(point_cloud_src: torch.Tensor, point_cloud_tgt: torch.Tensor) -> torch.Tensor:
    # point_cloud_src: b x n_points x 3  
    # point_cloud_tgt: b x n_points x 3  
    dist1 = knn_points(point_cloud_src, point_cloud_tgt)
    dist2 = knn_points(point_cloud_tgt, point_cloud_src)
    loss_chamfer = torch.sum(dist1.dists) + torch.sum(dist2.dists)
    return loss_chamfer

We train our data for 10000 iterations and observe the loss steadily decreases to about 0.014 Note that we have a lower loss compared to fitting a voxel.

Below are the visualization for the ground truth, the fitted point cloud, and the optimization progress results.

Ground Truth Fitted Progress

2.4 Image to Point Cloud

For single view image to 3D reconstruction, we will use a similar approach to that for the image-to-voxelgrid as shown above. We will have the ResNet18 encode the image into a latent code and build an MLP to decode the latter into N x 3 output. Recall that the output of the encoder is of size [batch_size x 512] and the output of the decoder will be of size [batch_size x n_points x 3]. Note that explicit prediction yields a fixed size point cloud denoted as n_points here.

Our MLP has starts with an input feature vector of size 512, the model employs a series of fully connected layers with increasing size—1024, 2048, and 4096—each followed by a LeakyReLU activation with a negative slope of 0.1. The final layer expands the output to n_points * 3, where n_point is the number of points each representing three coordinates (x, y, z).

# Input: b x 512
# Output: b x args.n_points x 3 # b x N x 3

self.n_point = args.n_points
self.decoder = torch.nn.Sequential(
    torch.nn.Linear(512, 1024),
    torch.nn.LeakyReLU(0.1),
    torch.nn.Linear(1024, 2048),
    torch.nn.LeakyReLU(0.1),
    torch.nn.Linear(2048, 4096),
    torch.nn.LeakyReLU(0.1),
    torch.nn.Linear(4096, self.n_point * 3),
)

We train our model for 3000 epochs with n_points = 5000. The loss curve depicts a rapid initial decrease followed by fluctuating stability at a 0.1.

In the first row are the single view image, ground truths of the mesh and the second row is the predicted pointcloud. Notice that we have poor 3D reconstruction since we trained for only 3000 epochs. We will do a comparative analysis later on.

2.5 Fitting a Mesh

Finally, we want to fit a mesh by deforming an initial generic shape to fit a target mesh. The process is a bit different from fitting a voxelgrid or pointcloud.

Below are the steps for iterative mesh refinement:

  1. We start by creating an ico-sphere mesh which will be our source mesh: mesh_src.
  2. We initialize deform_vertices_src as a random or zero tensor with requires_grad=True to make it a learnable parameter that can be optimized. The Adam optimizer is set up to update deform_vertices_src during training.
  3. Within each step in the training loop, we create a new mesh new_mesh_src by offsetting the vertices of mesh_src using the learned deformation values - deform_vertices_src.
    # start from scospahere mesh
    mesh_src = ico_sphere(5, args.device)

    # Randomly initialized
    deform_vertices_src = torch.randn(mesh_src.verts_packed().shape, requires_grad=True, device="cuda")

    # Initialize the Adam optimizer with model parameters and learning rate
    optimizer = torch.optim.Adam([deform_vertices_src], lr=args.lr)

    for step in range(start_iter, args.max_iter):

        # Create a new mesh with vertices offset by deform_vertices_src
        new_mesh_src = mesh_src.offset_verts(deform_vertices_src)

        # Sample points from the target and source meshes
        sample_trg = sample_points_from_meshes(mesh_tgt, args.n_points)
        sample_src = sample_points_from_meshes(new_mesh_src, args.n_points)

        # Calculate the Chamfer loss between the sampled points
        loss_reg = chamfer_loss(sample_src, sample_trg)

        # Calculate the smoothness loss for the new mesh
        loss_smooth = smoothness_loss(new_mesh_src)

        # Combine losses with weighting factors
        loss = args.w_chamfer * loss_reg + args.w_smooth * loss_smooth

        # Zero the gradients before backpropagation
        optimizer.zero_grad()
        # Compute gradients through backpropagation
        loss.backward()
        # Update the deformable vertex offsets
        optimizer.step()

Note that the same shape can be represented with different meshes. For example, we can represent the surface of a cube with 2 triangular mesh or 4 small triangular meshes. By taking this into account, how can we define a loss function between predicted and ground-truth mesh such that it is invariant to the way we represent shape with triangles? We want a loss function to depend on the underlying shape. In order to do that, we will convert our mesh into pointcloud and then compute loss!

We sample points from the surface of the ground-truth mesh (offline) and sample points from the surface of the predicted mesh (online) and compute the loss between these two sets of points using the Chamfer distance. However, only minimizing the chamfer distance between the predicted and the target mesh will lead to a non-smooth shape. We then have a smoothness loss using laplacian smoothing which ensures that the deformations do not produce overly sharp/disjointed geometries, maintaining a smooth surface. This smoothing function constraint each vertex to stay along with its neighbor. We combine both losses with weighting factors.

Below is the loss curce starting with an icosphere level 4.

Note that an icosphere is a sphere based on an icosahedron which is a three-dimensional shape with 20 equilateral triangular faces. The icosahedron has 12 vertices and 30 edges. This is for us an icosphere of level 0. Below we try with icospheres of level 1 to 5 which has faces and vertices of 80-42, 320-162, 1280-642, 5120-2562, 20480-10242 respectively.

We observe that can icosphere of low level gives us a poor fitted mesh. This is because of we have a limited number of faces and vertices. When fitting a mesh we are not creating new vertices but instead optimizing their position. Hence, an icosphere of higher level gives us a finer-detailed fitted mesh.

Icosphere Level Ground Truth Fitted Progress

Instead of starting from an icosphere, we can also start from a random generated mesh. In the third row, we have the same number of vertices as an icosphere of level 5 and we observe that it will take us more than 10000 iterations to have a better fitted mesh compared to if we started with an icosphere and we have more overly-sharp geometries. We will do more comparative analysis later on.

Ground Truth Fitted Progress

2.6 Image to Mesh

Single view 3D reconstruction for mesh is also different in the sense that we are not explicitly predicting meshes but instead deforming a source mesh. For voxelgrid and pointcloud, we are encoding an image and then we reshape the output of the decoder so that it fits the shape of either the voxelgrid or pointcloud. For meshes, we will start from an icosphere mesh and then offset each vertex using learnable parameters such that the predicted mesh is closer to the target mesh at each optimization step.

For the decoder, I used a similar MLP architecture to that of the image-to-point cloud as above except that I used ReLU insetad of LeakyReLU.

# Input: b x 512
# Output: b x mesh_pred.verts_packed().shape[0] x 3

# Initialize source mesh
mesh_pred = ico_sphere(4, self.device)
self.mesh_pred = pytorch3d.structures.Meshes(mesh_pred.verts_list() * args.batch_size, mesh_pred.faces_list() * args.batch_size)

self.decoder = torch.nn.Sequential(
    torch.nn.Linear(512, 1024),
    torch.nn.ReLU(),
    torch.nn.Linear(1024, 2048),
    torch.nn.ReLU(),
    torch.nn.Linear(2048, 4096),
    torch.nn.ReLU(),
    nn.Linear(4096, self.mesh_pred.verts_list()[0].shape[0] * 3))

def forward(self, images, args):
    deform_vertices_pred  = self.decoder(encoded_feat)
    mesh_pred = self.mesh_pred.offset_verts(deform_vertices_pred.reshape([-1, 3]))
    return mesh_pred

Similarly, we train our model for 3000 epcochs and loss weights of w_chamfer=1.0 and w_smooth=0.1. For the loss curve, we have a sharp decrease but notice that the graph is less noisy compared to the voxelgrid or pointcloud ones.

In the first row are the single view image, ground truths of the mesh and the second row is the predicted mesh.

Now some important note about mesh deformation. The latter gives good results but the topology is fixed by the initial mesh. That is, depending on our starting mesh, we are constraint by the topology of the 3D shapes we can output. Mesh deformation only works if the shape we need to output has the same topology as the initial 3D mesh source. But what do I mean by same topology?

To a topologist, a doughnut, coffee mug, and a straw are all the same! They all have one hole and is said to be homeomorphic to each other because we can continuously deform one into the other. However, a doughnut would not be homeomorphic to a sphere. Considering polyhedron shapes such as a cube, a pyramid and the icosahedron we saw above. They all have a Euler Characteristic (X) of 2 given by the formula below.

Similar for a sphere which has an infinite number of faces, edges and vertices and yet is Euler Characteristic is still 2. We say that if two objects are homeomorphic to each other then they have the same Euler Characterisitic. Which then also means that we can deform one object into another as long as they have the same Euler Characteristic.

Video source: Topology joke

Similarly, a doughnut which has a hole in the middle has the shape of a torus and has a Euler Characteristic of 0. Hence, this is the reason why it cannot be formed into a sphere which has a Euler Characteristic of 2. In our project, we are trying to model a chair from a icosahedron mesh. Consider a chair where the legs, seat, and backrest are all modeled as rectangular prisms and we'll need to consider the shared vertices and edges at their junctions. A rectangular prism has 8 vertices, 12 edges and 6 faces. Applying Euler's formula we get a Euler Characteristic of 12.

In our dataset we have a variety of chairs with and without hole/s ad this means they have different Euler Characteristic. However, we have been using a icosahedron to model all of them hence, the reason why we have not been able to have holes in our predicted mesh and why the 3D reconstructed mesh is of poor quality.

2.6 Evaluation Metrics

Now to evaluate if our predictions closely match the ground truths, we will need a metric.

1. Volumetric IoU

3D IoU is defined as the volume of the intersection of two meshes divided by the volume of their union. However, it is not always straightfoward to use it.

  • If no watertight meshes, 3D IoU cannot be used.
  • For meshes, we need to voxelize or sample.
  • It cannot capture thin sructures.
  • Cannot be used for point clouds as no connectivity.
  • It is not very meaningful at low values.

I explain more about IoU in my other project: Real-time Multi-Object Tracking for Rescue Operations

2. Chamfer Distance

Above, we used Chamfer Distance to evaluate point cloud. However, one disadvantage of it is that since it relies on L2 distance, this makes it sensitive to outliers!

3. F1-score@t

In order to calculate the F1-score, we will need to sample points from the surface of our predictions and groud-truths. F1-score is a better metric as it is robust to outliers as shown below.

Precision@t is defined as the fraction of predicted points within some range t of some ground-truth point. From the image below, the precision@t is 3/4.

Recall@t is defined as the fraction of ground-truth points within some range t of some predicted point. From the image below, the recall@t is 2/3.

Hence, the F1-score@t is given by the formula below. For our example our F1-score@t is approx. 0.70.


3. Occupancy Network


References

  1. https://www.andrew.cmu.edu/course/16-889/projects/
  2. https://www.andrew.cmu.edu/course/16-825/projects/
  3. https://www.educative.io/courses/3d-machine-learning-with-pytorch3d
  4. https://towardsdatascience.com/how-to-render-3d-files-using-pytorch3d-ef9de72483f8
  5. https://towardsdatascience.com/glimpse-into-pytorch3d-an-open-source-3d-deep-learning-library-291a4beba30f
  6. https://www.youtube.com/watch?v=MOBAJb5nJRI
  7. https://www.youtube.com/watch?v=v3hTD9m2tM8&t
  8. https://www.youtube.com/watch?v=468Cxn1VuJk&list=PL3OV2Akk7XpDjlhJBDGav08bef_DvIdH2&index=4
  9. https://github.com/learning3d
  10. https://geometric3d.github.io/
  11. https://learning3d.github.io/schedule.html
  12. https://www.scenerepresentations.org/courses/inverse-graphics-23/
  13. https://www-users.cse.umn.edu/~hspark/CSci5980/csci5980_3dvision.html
  14. https://github.com/mint-lab/3dv_tutorial
  15. https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/autonomous-vision/lectures/computer-vision/
  16. https://www.youtube.com/watch?v=_M21DcHaMrg&list=PLZk0jtN0g8e_4gGYEpm1VYPh8xNka66Jt&index=6
  17. https://learn.udacity.com/courses/cs291
  18. https://madebyevan.com/webgl-path-tracing/
  19. https://numfactory.upc.edu/web/Geometria/signedDistances.html
  20. https://mobile.rodolphe-vaillant.fr/entry/86/implicit-surface-aka-signed-distance-field-definition
  21. https://www.youtube.com/watch?v=KnUFccsAsfs&t=2512s
  22. https://towardsdatascience.com/understanding-pytorch-loss-functions-the-maths-and-algorithms-part-2-104f19346425
  23. https://towardsdatascience.com/3d-object-classification-and-segmentation-with-meshcnn-and-pytorch-3bb7c6690302
  24. https://towardsdatascience.com/generating-3d-models-with-polygen-and-pytorch-4895f3f61a2e
  25. https://www.youtube.com/watch?v=S1_nCdLUQQ8&t
  26. https://www.youtube.com/watch?v=IxAwhW4gP_c
  27. https://www.youtube.com/watch?v=VOKgMJEc_ro