[CV_3D] PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Question

[CV_3D] PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

jeonggg119 opened this issue 2 years ago · 0 comments

jeonggg119 commented 2 years ago

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Paper Review

1. Introduction

Previous research : Weight sharing, kernel optimization 위해 irregular format 특성을 가지는 point cloud를 3D voxel grid or collections of img로 transform 후 feed → Result : Quantization artifacts
PointNet
- Input : Point clouds
  - Simple and unified 구조 → 학습 easy
  - A set of points → Invariant to permutations & rigid motions
- Output : class labels for entire input or per point segment/part labels for each point of input
- Max pooling : single symmetric function
- FC layers : (shape classification) to aggregate learnt optimal values into global descriptor or (shape segmentation) predict per point labels
- Data-dependent STN : to canonicalize data before PointNet process them
- Any continuous set function을 approximate 할 수 있음
- Input point cloud를 sparse set of key points로 summarize
- Robust to small perturbation of input points (corruption by outliers or missing data)
Key contributions
- Model Design : Deep network for unordered point sets in 3D
- Tasks : 3D shape classification, shape part segmentation, scene semantic parsing
- Analysis : Empirical and theoretical analysis on Stability and efficiency
- Experiment : 3D features illustration computed by selected neurons in net

2. Related Work

Point Cloud Features

Previous method : certain statistical properties를 encode →certain transformation에 invariant
ex. intrinsic or extrinsic / local or global

DL on 3D Data

Volumetric CNN : 3D CNN → data sparsity, 3D conv의 computation cost 제약
FPNN, Vote3D : sparse volumes 인해 large point clouds 어려움
Multiview CNN : 3D point cloud or shapes를 2D imgs로 render 후 2D conv 적용
Spectral CNN : manifold mesh, non-isometric shapes 제약
Feature-based DNN : 3D data를 vector로 바꿔 shape features 뽑은 후 fc로 분류

DL on Unordered Sets

Point cloud = unordered set of vectors .. VS .. Most works in DL : regular representations
read-process-write network with attention : sorting for generic sets and NLP → geometry 부족

3. Problem Statement

Each Point's channel of PC
- (x, y, z) + extra feature channels (ex. color, normal, ..)
- Implementation : (x, y, z) coordinate for simplicity
Object classification task
- Input point cloud : directly sampled from a shape 또는 pre-segmented from a scene point cloud
- Output : k scores (k : candidate class 수)
Semantic segmentation task
- Input : part segmentation로 얻은 single object 또는 object segmentation로 얻은 3D scene의 sub-volume
- Output : n x m scores (n : point 수, m : semantic sub-category 수)

4. Deep Learning on Point Sets

4.1 Properties of Point Sets in R^n

Unordered : N 3D point sets → Network needs to be invariant to N! permutations
Interaction among points : meaningful local structures from nearby points
Invariance under transformations : 변환(ex. rotating, translating)해도 category나 segmentation 값 일정

4.2 PointNet Architecture

Full network = Classification network + Segmentation network

[ 3 Key modules ]

❶ Max pooling layer : Symmetry Function for Unordered Input

Goal : To aggregate information from all points → make model invariant to input permutation (N!)
Input : n vectors → Output : a new vector = [f_1, ..., f_K] (invariant to input order)
Key idea : To approximate general function $f$ defined on point set by symmetric function on transformed elements
Implementation : approximate $h$ by MLP & $g$ by single variable func + max pooling func

❷ Local and Global Information Aggregation [Segmentation]

Max pooling output [f_1, ..., f_K] : only global information for Classification task
Goal : To get Local and Global information for Point Segmentation task
Implementation (Input) : Concatenating global feature (1024) + each of local point feature (64) → Extracting new per point feature (ex. per-point normals)

❸ T-Net : Joint Alignment Network

Goal : Invariant to transformations (ex. rigid transformation)
Implementation : Predicting affine transformation matrix by mini-net (T-net) → Applying this transformation to coordinates of input points
Result Check : semantic labeling 그대로 나오면 invariant
Idea from STN (orthogonal img 위한 transformation matrix 계산 후 기존 input img에 곱하여 변형없는 output img)
T-net : composed by basic modules of point independent feature extraction + max pooling + FC layers
Feature space Alignment : another transformation matrix 추가해 align features from different input point clouds
- Higher dimension (64x64) 이므로 Regularization loss term 추가해 optimize
- Result : orthogonal matrix (not lose information in input) → more stable, better performance

5. Experiment

5.1 Applications (3D recognition)

1) 3D Object Classification

Goal : To learn global point cloud feature
Dataset : ModelNet40 (12311 CAD models from 40 man-made object categories) → 75% Train + 25% Test
Input point cloud : 1024 points uniformly sampling on from mesh faces → normalizing into a unit sphere
Data augmentation : random rotate along up-axis, jitter position of each points by Gaussian noise
Result : fc and max pooling 만으로 fast inference speed, parallel in CPU

2) 3D Object Part Segmentation

Part Segmentation : Given 3D scan or mesh model → Point labels = object part category label to each point of face
Dataset : ShapeNet part dataset (16881 shapes from 16 categories, annotated with 50 parts)
Idea : Part-point Classification
Evaluation metric : mIoU on points (shape's mIoU)
Result : 2.3% mean IoU improvement
Robustness Test (simulated Kinect scans) : lose only 5.3% mIoU

3) Semantic Segmentation in Scenes

Point labels : semantic object classes
Dataset : Standford 3D semantic parsing dataset (3D scans in 6 areas including 271 rooms from 13 categories)
Point representation : 12-dim vector = 9-dim of XYZ, RGB, normalized location + 3-dim of local point density, local curvature, normal)
Classifier : standard MLP
Result : smooth predictions, robustness to missing points and occlusions
3D Object Detection system

5.2 Architecture Design Analysis

Dataset : ModelNet40 shape classification problem for comparisons

Comparison with Alternative Order-invariant Methods

3 Approaches
- MLP (unsorted / sorted input) : points as nx3 arrays
- LSTM : points as a sequence
- Symmetry operation : Attention sum, Average pooling, Max pooling
Result : Max pooling = Best performance (Acc 87.1%)

Effectiveness of Input and Feature Transformations

Input & Feature Transformation STN + Regularization → Acc 2.1% ↑

Robustness Test

Robust to various input corruptions
- Model : Max pooling network / Input points : normalized into a unit sphere
- Result : 50% point missing → Acc 2.4%, 3.8% ↓ wrt furthest, random input sampling
Robust to outliear
- Types : XYZ / XYZ+density
- Result : Acc more than 80% even when 20% are outlier points

5.3 Visualizing PointNet

Critical point sets $C_S$ and Upper-bound shapes $N_S$ for sample shapes $S$
- Critical point sets $C_S$ : max pooled feature (summerized skeleton of shape
- Upper-bound shapes $N_S$ : largest possible point cloud that give global shape feature f(S)
Result : some non-critical points 잃는다고 $f(S)$ 바뀌지X (Robustness)

5.4 Time and Space Complexity Analysis

MVCNN, 3DCNN : conv layer computation ↑ vs PointNet : O(N) efficient

Code Review

Dataloader

from torch.utils.data import Dataset
import numpy as np

class PointCloudDataset(Dataset):
    def __init__(self, npoints=1024):
        self.npoints = npoints
        ...
        
    def __getitem__(self, index):
        points = self.point_list[index]
        
        #randomly sample points
        choice = np.random.choice(points.shape[0], self.npoints, replace=True)
        points = points[choice, :]
        
        #normalize to unit sphere
        points = points - np.expand_dims(np.mean(points, axis=0), 0) #center
        dist = np.max(np.sqrt(np.sum(points**2, axis=1)), 0)
        points = points / dist #scale
        
        points = self.data_augmentation(points)
        
        label = self.label_list[index]
        
        return torch.from_numpy(points).float(), torch.tensor(label)
        
    def data_augmentation(self, points):
        theta = np.random.uniform(0, np.pi*2) #0~360
        rotation_matrix = np.array([[np.cos(theta), -np.sin(theta)],[np.sin(theta), np.cos(theta)]])
        points[:,[0,2]] = points[:,[0,2]].dot(rotation_matrix) # random rotation
        points += np.random.normal(0, 0.02, size=points.shape) # random jitter
        return points

Point Cloud : 각 sample마다 point 수 다름. batch 단위 학습 위해 각 sample의 point 수를 맞춰줘야함 → n_points 설정해서 각 sample마다 random sampling
추출한 point들은 unit sphere로의 normalization 적용
Data augmentation : y축 기준 random rotation, Gaussian noise 기반 jittering

Main network

class PointNetCls(nn.Module):
    def __init__(self, num_classes=2):
        super(PointNetCls, self).__init__()

        self.tnet = TNet(dim=3)
        self.mlp1 = mlpblock(3, 64)

        self.tnet_feature = TNet(dim=64)

        self.mlp2 = nn.Sequential(
            mlpblock(64, 128),
            mlpblock(128, 1024, act_f=False)
        )

        self.mlp3 = nn.Sequential(
            fcblock(1024, 512),
            fcblock(512, 256, dropout_rate=0.3),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        """
        :input size: (N, n_points, 3)
        :output size: (N, num_classes)
        """
        x = x.transpose(2, 1) #N, 3, n_points
        trans = self.tnet(x) #N, 3, 3
        x = torch.bmm(x.transpose(2, 1), trans).transpose(2, 1) #N, 3, n_points
        x = self.mlp1(x) #N, 64, n_points

        trans_feat = self.tnet_feature(x) #N, 64, 64
        x = torch.bmm(x.transpose(2, 1), trans_feat).transpose(2, 1) #N, 64, n_points

        x = self.mlp2(x) #N, 1024, n_points
        x = torch.max(x, 2, keepdim=False)[0] #N, 1024 (global feature)

        x = self.mlp3(x) #N, num_classes

        return x, trans_feat

(1) input feature 대해 T-Net 통해 transformation matrix 계산 → matrix multiplication 통해 transformation 수행
(2) Shared mlp1 통해 feature dim 3 → 64
(3) 64 dim shared mlp1에 T-Net과 matrix multiplication 통한 transformation 수행
(4) Shared mlp2 통해 feature dim 64 →128 →1024
(5) Max pooling으로 1024 dim vector 추출
(6) Last mlp3 통해 classification 수행

mlpblock, fcblock

def mlpblock(in_channels, out_channels, act_f=True):
    layers = [
        nn.Conv1d(in_channels, out_channels, 1),
        nn.BatchNorm1d(out_channels),
    ]
    if act_f:
        layers.append(nn.ReLU())
    return nn.Sequential(*layers)

def fcblock(in_channels, out_channels, dropout_rate=None):
    layers = [
        nn.Linear(in_channels, out_channels),
    ]
    if dropout_rate is not None:
        layers.append(nn.Dropout(p=dropout_rate))
    layers += [
        nn.BatchNorm1d(out_channels),
        nn.ReLU()
    ]
    return nn.Sequential(*layers)

Shared mlp : kernel size=1, 1D conv layer로 구현

T-Net

class TNet(nn.Module):
    def __init__(self, dim=64):
        super(TNet, self).__init__()
        self.dim = dim
        self.mlp = nn.Sequential(
            mlpblock(dim, 64),
            mlpblock(64, 128),
            mlpblock(128, 1024)
        )
        self.fc = nn.Sequential(
            fcblock(1024, 512),
            fcblock(512, 256),
            nn.Linear(256, dim*dim)
        )
        
    def forward(self, x):
        x = self.mlp(x)
        x = torch.max(x, 2, keepdim=True)[0]
        x = x.view(-1, 1024)

        x = self.fc(x)

        idt = torch.eye(self.dim, dtype=torch.float32).flatten().unsqueeze(0).repeat(x.size()[0], 1)
        idt = idt.to(x.device)
        x = x + idt
        x = x.view(-1, self.dim, self.dim)
        return x

Canonical space로의 mapping 위한 transformation matrix 계산

Train

import torch
import torch.nn as nn

def feature_transform_regularizer(trans):
    D = trans.size()[1]
    I = torch.eye(D)[None, :, :]
    I = I.to(trans.device)
    loss = torch.mean(torch.norm(torch.bmm(trans, trans.transpose(2,1)) - I, dim=(1,2)))
    return loss
    
#sample data
points = torch.rand(5, 1024, 3)
target = torch.empty(5, dtype=torch.long).random_(10)

model = PointNetCls(num_classes=10)
loss_f = nn.CrossEntropyLoss()

pred, trans_feat = model(points)
loss = loss_f(pred, target)
loss += feature_transform_regularizer(trans_feat) * 0.001

feature transform의 regularization 함수 정의
Loss : Cross entropy loss

Reference