[CV_3D] PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
jeonggg119 opened this issue · 0 comments
jeonggg119 commented
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
Paper Review
1. Introduction
- Previous research : Weight sharing, kernel optimization 위해 irregular format 특성을 가지는 point cloud를 3D voxel grid or collections of img로 transform 후 feed → Result : Quantization artifacts
- PointNet
- Input : Point clouds
- Simple and unified 구조 → 학습 easy
- A set of points → Invariant to permutations & rigid motions
- Output : class labels for entire input or per point segment/part labels for each point of input
- Max pooling : single symmetric function
- FC layers : (shape classification) to aggregate learnt optimal values into global descriptor or (shape segmentation) predict per point labels
- Data-dependent STN : to canonicalize data before PointNet process them
- Any continuous set function을 approximate 할 수 있음
- Input point cloud를 sparse set of key points로 summarize
- Robust to small perturbation of input points (corruption by outliers or missing data)
- Input : Point clouds
- Key contributions
- Model Design : Deep network for unordered point sets in 3D
- Tasks : 3D shape classification, shape part segmentation, scene semantic parsing
- Analysis : Empirical and theoretical analysis on Stability and efficiency
- Experiment : 3D features illustration computed by selected neurons in net
2. Related Work
Point Cloud Features
- Previous method : certain statistical properties를 encode →certain transformation에 invariant
- ex. intrinsic or extrinsic / local or global
DL on 3D Data
- Volumetric CNN : 3D CNN → data sparsity, 3D conv의 computation cost 제약
- FPNN, Vote3D : sparse volumes 인해 large point clouds 어려움
- Multiview CNN : 3D point cloud or shapes를 2D imgs로 render 후 2D conv 적용
- Spectral CNN : manifold mesh, non-isometric shapes 제약
- Feature-based DNN : 3D data를 vector로 바꿔 shape features 뽑은 후 fc로 분류
DL on Unordered Sets
- Point cloud = unordered set of vectors .. VS .. Most works in DL : regular representations
- read-process-write network with attention : sorting for generic sets and NLP → geometry 부족
3. Problem Statement
- Each Point's channel of PC
- (x, y, z) + extra feature channels (ex. color, normal, ..)
- Implementation : (x, y, z) coordinate for simplicity
- Object classification task
- Input point cloud : directly sampled from a shape 또는 pre-segmented from a scene point cloud
- Output : k scores (k : candidate class 수)
- Semantic segmentation task
- Input : part segmentation로 얻은 single object 또는 object segmentation로 얻은 3D scene의 sub-volume
- Output : n x m scores (n : point 수, m : semantic sub-category 수)
4. Deep Learning on Point Sets
4.1 Properties of Point Sets in R^n
- Unordered : N 3D point sets → Network needs to be invariant to N! permutations
- Interaction among points : meaningful local structures from nearby points
- Invariance under transformations : 변환(ex. rotating, translating)해도 category나 segmentation 값 일정
4.2 PointNet Architecture
Full network = Classification network + Segmentation network
[ 3 Key modules ]
❶ Max pooling layer : Symmetry Function for Unordered Input
-
Goal : To aggregate information from all points → make model invariant to input permutation (N!)
- Input : n vectors → Output : a new vector = [f_1, ..., f_K] (invariant to input order)
-
Key idea : To approximate general function
$f$ defined on point set by symmetric function on transformed elements -
Implementation : approximate
$h$ by MLP &$g$ by single variable func + max pooling func
❷ Local and Global Information Aggregation [Segmentation]
- Max pooling output [f_1, ..., f_K] : only global information for Classification task
- Goal : To get Local and Global information for Point Segmentation task
- Implementation (Input) : Concatenating global feature (1024) + each of local point feature (64) → Extracting new per point feature (ex. per-point normals)
❸ T-Net : Joint Alignment Network
- Goal : Invariant to transformations (ex. rigid transformation)
- Implementation : Predicting affine transformation matrix by mini-net (T-net) → Applying this transformation to coordinates of input points
- Result Check : semantic labeling 그대로 나오면 invariant
- Idea from STN (orthogonal img 위한 transformation matrix 계산 후 기존 input img에 곱하여 변형없는 output img)
- T-net : composed by basic modules of point independent feature extraction + max pooling + FC layers
- Feature space Alignment : another transformation matrix 추가해 align features from different input point clouds
5. Experiment
5.1 Applications (3D recognition)
1) 3D Object Classification
- Goal : To learn global point cloud feature
- Dataset : ModelNet40 (12311 CAD models from 40 man-made object categories) → 75% Train + 25% Test
- Input point cloud : 1024 points uniformly sampling on from mesh faces → normalizing into a unit sphere
- Data augmentation : random rotate along up-axis, jitter position of each points by Gaussian noise
- Result : fc and max pooling 만으로 fast inference speed, parallel in CPU
2) 3D Object Part Segmentation
- Part Segmentation : Given 3D scan or mesh model → Point labels = object part category label to each point of face
- Dataset : ShapeNet part dataset (16881 shapes from 16 categories, annotated with 50 parts)
- Idea : Part-point Classification
- Evaluation metric : mIoU on points (shape's mIoU)
- Result : 2.3% mean IoU improvement
- Robustness Test (simulated Kinect scans) : lose only 5.3% mIoU
3) Semantic Segmentation in Scenes
- Point labels : semantic object classes
- Dataset : Standford 3D semantic parsing dataset (3D scans in 6 areas including 271 rooms from 13 categories)
- Point representation : 12-dim vector = 9-dim of XYZ, RGB, normalized location + 3-dim of local point density, local curvature, normal)
- Classifier : standard MLP
- Result : smooth predictions, robustness to missing points and occlusions
- 3D Object Detection system
5.2 Architecture Design Analysis
- Dataset : ModelNet40 shape classification problem for comparisons
Comparison with Alternative Order-invariant Methods
- 3 Approaches
- MLP (unsorted / sorted input) : points as nx3 arrays
- LSTM : points as a sequence
- Symmetry operation : Attention sum, Average pooling, Max pooling
- Result : Max pooling = Best performance (Acc 87.1%)
Effectiveness of Input and Feature Transformations
Robustness Test
- Robust to various input corruptions
- Model : Max pooling network / Input points : normalized into a unit sphere
- Result : 50% point missing → Acc 2.4%, 3.8% ↓ wrt furthest, random input sampling
- Robust to outliear
5.3 Visualizing PointNet
- Critical point sets
$C_S$ and Upper-bound shapes$N_S$ for sample shapes$S$ - Critical point sets
$C_S$ : max pooled feature (summerized skeleton of shape - Upper-bound shapes
$N_S$ : largest possible point cloud that give global shape feature f(S)
- Critical point sets
-
Result : some non-critical points 잃는다고
$f(S)$ 바뀌지X (Robustness)
5.4 Time and Space Complexity Analysis
Code Review
Dataloader
from torch.utils.data import Dataset
import numpy as np
class PointCloudDataset(Dataset):
def __init__(self, npoints=1024):
self.npoints = npoints
...
def __getitem__(self, index):
points = self.point_list[index]
#randomly sample points
choice = np.random.choice(points.shape[0], self.npoints, replace=True)
points = points[choice, :]
#normalize to unit sphere
points = points - np.expand_dims(np.mean(points, axis=0), 0) #center
dist = np.max(np.sqrt(np.sum(points**2, axis=1)), 0)
points = points / dist #scale
points = self.data_augmentation(points)
label = self.label_list[index]
return torch.from_numpy(points).float(), torch.tensor(label)
def data_augmentation(self, points):
theta = np.random.uniform(0, np.pi*2) #0~360
rotation_matrix = np.array([[np.cos(theta), -np.sin(theta)],[np.sin(theta), np.cos(theta)]])
points[:,[0,2]] = points[:,[0,2]].dot(rotation_matrix) # random rotation
points += np.random.normal(0, 0.02, size=points.shape) # random jitter
return points
- Point Cloud : 각 sample마다 point 수 다름. batch 단위 학습 위해 각 sample의 point 수를 맞춰줘야함 → n_points 설정해서 각 sample마다 random sampling
- 추출한 point들은 unit sphere로의 normalization 적용
- Data augmentation : y축 기준 random rotation, Gaussian noise 기반 jittering
Main network
class PointNetCls(nn.Module):
def __init__(self, num_classes=2):
super(PointNetCls, self).__init__()
self.tnet = TNet(dim=3)
self.mlp1 = mlpblock(3, 64)
self.tnet_feature = TNet(dim=64)
self.mlp2 = nn.Sequential(
mlpblock(64, 128),
mlpblock(128, 1024, act_f=False)
)
self.mlp3 = nn.Sequential(
fcblock(1024, 512),
fcblock(512, 256, dropout_rate=0.3),
nn.Linear(256, num_classes)
)
def forward(self, x):
"""
:input size: (N, n_points, 3)
:output size: (N, num_classes)
"""
x = x.transpose(2, 1) #N, 3, n_points
trans = self.tnet(x) #N, 3, 3
x = torch.bmm(x.transpose(2, 1), trans).transpose(2, 1) #N, 3, n_points
x = self.mlp1(x) #N, 64, n_points
trans_feat = self.tnet_feature(x) #N, 64, 64
x = torch.bmm(x.transpose(2, 1), trans_feat).transpose(2, 1) #N, 64, n_points
x = self.mlp2(x) #N, 1024, n_points
x = torch.max(x, 2, keepdim=False)[0] #N, 1024 (global feature)
x = self.mlp3(x) #N, num_classes
return x, trans_feat
- (1) input feature 대해 T-Net 통해 transformation matrix 계산 → matrix multiplication 통해 transformation 수행
- (2) Shared mlp1 통해 feature dim 3 → 64
- (3) 64 dim shared mlp1에 T-Net과 matrix multiplication 통한 transformation 수행
- (4) Shared mlp2 통해 feature dim 64 →128 →1024
- (5) Max pooling으로 1024 dim vector 추출
- (6) Last mlp3 통해 classification 수행
mlpblock, fcblock
def mlpblock(in_channels, out_channels, act_f=True):
layers = [
nn.Conv1d(in_channels, out_channels, 1),
nn.BatchNorm1d(out_channels),
]
if act_f:
layers.append(nn.ReLU())
return nn.Sequential(*layers)
def fcblock(in_channels, out_channels, dropout_rate=None):
layers = [
nn.Linear(in_channels, out_channels),
]
if dropout_rate is not None:
layers.append(nn.Dropout(p=dropout_rate))
layers += [
nn.BatchNorm1d(out_channels),
nn.ReLU()
]
return nn.Sequential(*layers)
- Shared mlp : kernel size=1, 1D conv layer로 구현
T-Net
class TNet(nn.Module):
def __init__(self, dim=64):
super(TNet, self).__init__()
self.dim = dim
self.mlp = nn.Sequential(
mlpblock(dim, 64),
mlpblock(64, 128),
mlpblock(128, 1024)
)
self.fc = nn.Sequential(
fcblock(1024, 512),
fcblock(512, 256),
nn.Linear(256, dim*dim)
)
def forward(self, x):
x = self.mlp(x)
x = torch.max(x, 2, keepdim=True)[0]
x = x.view(-1, 1024)
x = self.fc(x)
idt = torch.eye(self.dim, dtype=torch.float32).flatten().unsqueeze(0).repeat(x.size()[0], 1)
idt = idt.to(x.device)
x = x + idt
x = x.view(-1, self.dim, self.dim)
return x
- Canonical space로의 mapping 위한 transformation matrix 계산
Train
import torch
import torch.nn as nn
def feature_transform_regularizer(trans):
D = trans.size()[1]
I = torch.eye(D)[None, :, :]
I = I.to(trans.device)
loss = torch.mean(torch.norm(torch.bmm(trans, trans.transpose(2,1)) - I, dim=(1,2)))
return loss
#sample data
points = torch.rand(5, 1024, 3)
target = torch.empty(5, dtype=torch.long).random_(10)
model = PointNetCls(num_classes=10)
loss_f = nn.CrossEntropyLoss()
pred, trans_feat = model(points)
loss = loss_f(pred, target)
loss += feature_transform_regularizer(trans_feat) * 0.001
- feature transform의 regularization 함수 정의
- Loss : Cross entropy loss
Reference