3D Object Classification With Selective Multi-View Fusion And Shape Rendering

This repository is for the following paper "3D Object Classification With Selective Multi-View Fusion And Shape Rendering" introduced by Mona Alzahrani, Muhammad Usman, Randah Alharbi, Saeed Anwar, Ajmal Mian, and Tarek Helmy, DICTA 2024.

Requirements:

The model is built in Visual Studio Code editor using:

Python 3.9.16
Tensorflow-gpu 2.7
pytorch 2.0.1
Pytorch-cuda 11.7
keras 2.6
Transformers 4.38.2

Introduction:

3D classification is complex and challenging because of high-dimensional data, the intricate nature of their spatial relationships, and viewpoint variations. We fill the gap in view-based 3D object classification by examining the factors that influence classification's effectiveness via determining their respective merits in feature extraction for 3D object recognition by comparing CNN-based and Transformer-based backbone networks side-by-side. Our research extends to evaluating various fusion strategies to determine the most effective method for integrating multiple views and ascertain the optimal number of views that balances classification and computation. We also probe into the effectiveness of different feature types from rendering techniques in accurately depicting 3D objects. This investigation is supported by an extensive experimental framework, incorporating a diverse set of 3D objects from the ModelNet40 dataset. Finally, based on the analysis, we present a Selective Multi-View deep model (SelectiveMV) that shows efficient performance and provides high accuracy given a few views.

Architecture:

The architecture of the proposed Selective Multi-View deep model (SelectiveMV) consists of five phases::
(A) View rendering: multiple m views are captured from different viewpoints of a given 3D object. Based on the rendering technique, the rendered views are either grayscale, shaded or depth maps.
(B) Feature extraction: each extracted view is input into a pre-trained backbone network to extract the corresponding feature sets.
(C) Vectorization: the detected feature sets are flattened into vectors.
(D) Selective fusion: the feature vectors are compared based on their similarity using Cosine Similarity, and a vital score is obtained and normalized. The views with higher scores are selected and fused using a fusion technique to generate a global descriptor.
(D) Classification: the global descriptor of the object is fed into a classifier to predict its class. The boxes in blue and the number of selected views are the key variables impacting the classification performance we evaluate.

The architecture of proposed Selective Multi-View deep model (SelectiveMV). The blue boxes (in addition to number of selected views) are the key variables that we evaluate their impact in the classification performance.

Dataset:

All experiments in this study utilize the widely recognized ModelNet40. The dataset comprises 12,311 grayscale objects categorized into 40 classes, with standard training and test splits. Specifically, it includes 9,843 objects for training and 2,468 objects for testing. It is important to note that the number of objects varies across categories, resulting in an imbalanced distribution. Therefore, two metrics, Overall Accuracy (OA) and Average Accuracy (AA), are reported.

In order to capture multiple views from each 3D object, a circular camera setup is employed, similar to MVCNN. In many related studies, such as MVCNN, RotationNet, view-GCN, and MVTN, 12 virtual cameras are positioned around the object, resulting in the extraction of 12 rendered views:

Furthermore, this study investigates different views with distinct feature types using various image rendering techniques. The following three representation views with varying types of features are explored:

Grayscale Views: These views employ surface normal maps generated using the Phong reflection model. The mesh polygons are rendered under a perspective projection, and the color of the shape in general will be grayscale as the original 3D object. While the Grayscale color of each pixel is determined by interpolating the reflected intensity of the polygon vertices. The shapes are uniformly scaled to fit within the viewing volume. Samples of Grayscale views are:

For a fair comparison, we used the same grayscale views rendered by MVCNN:
- GreyscaleModelNet40v1 Training can be download from here.
- GreyscaleModelNet40v1 Testing can be download from here.
Shaded Views: These views are also rendered using the Phong reflection model, but the resulting images are grayscale with a black background. The camera's field of view is adjusted to encapsulate the 3D object within the image canvas tightly. Samples of Shaded views are:

For a fair comparison, we used the same shaded views rendered by MVCNN-new:
- ShadedModelNet40 Training can be download from here.
- ShadedModelNet40 Testing can be download from here.
Depth Views: In this case, the views solely record the depth value of each pixel. Samples of Depth views are

For a fair comparison, we used the same depth views rendered by MVCNN-new:

DepthdModelNet40 Training can be download from here.
DepthdModelNet40 Testing can be download from here.

Getting Started:

Since we will experiment different rendering techniques, backbone networks, numbers of selected views, fusion startegies, and classifiers, we do the following to organize the data and results:

Prepare two folders:

data folder: put all the unzip dataset folders inside it. Note that each dataset will have two folders one for training and other for testing. The data folder will also be used later by our code to save the extracted features.
Results folder: create new folders inside it and rename them with dataset names (one folder for each dataset).
- And inside each dataset folder, create new folders and rename them with backbone network names (pre-trained CNN or transformer).

The folders will look like this:

---------data
          --------modelnet40v1_train
          --------modelnet40v1_test
          --------shaded_modelnet40v1_train
          --------shaded_modelnet40v1_test
          --------depth_modelnet40v1_train
          --------depth_modelnet40v1_test

---------Results
          --------modelnet40v1
                    --------VGG16
                    --------VGG19
                    --------ResNet50
                    --------ResNet152
                    --------EfficientNetB0
                    --------ViT
                    --------BEiT
          --------shaded_modelnet40v1
                              .
                              .
          --------depth_modelnet40v1
                              .
                              .

Feature Extraction:

For feature extraction, we used the following seven pre-trained backbones seperetly:

Training and Testing:

To run an experiment, use the following guidline to guide you to which files you should run for training and testing:

Single View Experiment: Use the following files:
- For training: Training-SV+Voting.ipynb
- For testing: Testing-SV.ipynb
  Note: all samples are used for training; and no fusion needed in testing.
Majority-Voting Multi-view Experiment: Use the following files:
- For training: Training-SV+Voting.ipynb
- For testing: Testing-MV-Voting.ipynb
  Note: all samples are used for training; and late Majority-Voting fusion needed in testing.
Max-pooling Multi-view Experiment: Use the following files:
- For training: Training-MV-Max+Sum.ipynb
- For testing: Testing-MV-Max+Sum.ipynb
  Note: early Max-pooling fusion needed in training and testing.
Sum-pooling Multi-view Experiment: Use the following files:
- For training: Training-MV-Max+Sum.ipynb
- For testing: Testing-MV-Max+Sum.ipynb
  Note: early Sax-pooling fusion needed in training and testing.

The following need to be specified before experiment running in all training and testing files:

Track and replace the paths of data and Results folders with your paths:
```
"./Results/"
"./data/"
```
Choose the dataset version and path:

dataset_version= 'original_modelnet40v1'  
dataset_train = './data/original_modelnet40v1_train'

dataset_version= 'shaded_modelnet40v1'    
dataset_train = './data/shaded_modelnet40v1_train'

dataset_version= 'depth_modelnet40v1' 
dataset_train = './data/depth_modelnet40v1_train'

Spicify the img size (here 224*224)

Img_Size= 224

Spicify the backbone feature extractor (here BEiT); and run its following code cells. Note: we only experimented seven backbones but more options are included in the code.

all_model_name_txt = ["BEiT"]

Spicify the BATCH_SIZE, Mini_BATCH_SIZE, EPOCHS, learning_rate:

BATCH_SIZE = 384
Mini_BATCH_SIZE = 32
EPOCHS = 30
learning_rate = 0.0001

Results:

We start a comparison with existing 3D object classification models. Then, we perform a series of experiments to investigate the variables affecting the classification performance. Then, based on the mentioned analysis, we present the best-selected variables for SelectiveMV and analyze its predicted classes, followed by visualization and analysis of the fundamental selection mechanism.

Comparison with 3D Classification Models:

Our SelectiveMV model is benchmarked against existing state-of-the-art techniques within both view-based and model-based categories. SelectiveMV demonstrates exceptional performance, outperforming the alternatives, which is a testament to its robustness and effective design. It adeptly handles the input data's intricacies, whether in the form of multi-angle views or complex 3D models, further establishing its superiority in the current landscape of 3D object classification methodologies:

Ablation Study:

We analyze the effect of backbone networks, rendering techniques, fusion strategy, classifiers, and the number of selected views.

The Effect of the Backbone Network:

This experiment focuses on the accuracy of different backbone architectures and rendering techniques of SelectiveMV for feature extraction. These models were tasked with processing all the views, combined using max-pooling as a fusion strategy to generate the global descriptors which later classsify using FCL. The detailed results are reported in the following table:

In general, ResNet-152 and BEiT-B stood out from the crowd, leading the charge as the most effective CNN-based and Transformer-based models, respectively, among all the rendering techniques. Delving into specifics, ResNet-152 showed impressive performance, especially when fed with shaded views, where it scored an OA of 91.82%. This made it the most proficient among its CNN-based peers. Where VGG-16, followed by VGG-19, were the worst-performing CNN backbones for all the rendering views. On the flip side, BEiT-B showed an OA of 90.72% . This performance edged out the ViT-B. Interestingly, these top models maintained high performance across all rendering techniques we considered, including grayscale, shaded, and depth. Given ResNet-152 and BEiT-B's standout performance, we will concentrate our efforts on them in subsequent experiments, as they have proven to be the most effective models among those tested.

The Effect of the Rendering Technique:

From the above table, we have seen that the shaded technique was superior, followed by depth, then grayscale in all backbone networks. This fact applied even with the powerful ResNet-152 and BEiT-B. Except with the ViT-B backbone, depth results outperform others. Shaded views may be better because they can convey a more comprehensive set of visual information that aids neural networks in learning to recognize and classify objects more accurately. Depth views also provide valuable spatial information but may lack some surface detail, while grayscale views might omit important color information that could be crucial for distinguishing between similar objects. In the context of 3D object recognition, several works used the shaded technique as the only rendering technique to experiment with their proposed models, such as MVA-CNN, MVDAN, and MVCVT. However, it's important to note that the best rendering technique can be context-dependent.

The Effect of the Fusion Strategy:

The following table details the classification accuracies achieved by ResNet-152 and BEiT-B when subjected to various fusion methods with shaded rendering. Max-pooling emerged as a highly effective strategy for both architectures, although majority voting displayed competitive accuracies, mainly when a smaller number of views were utilized. For BEiT-B, the application of majority-voting led to an increase in performance, with OA reaching 92.54% and 92.79% upon fusing 3 and 6 views, respectively. In the case of ResNet-152, max-pooling generally yielded the highest accuracies across different view counts. However, an exception was noted with 12 views, where majority voting slightly improved the OA from 91.82% with max-pooling to 91.94%. Conversely, the sum-pooling technique resulted in a marginal decrease in classification performance for both neural network backbones.

The Effect of the Classifier:

Analyzing the classifiers results in the above table, it's clear that ResNet-152 and BEiT-B demonstrate varying degrees of compatibility with different classifiers. BEiT-B consistently outperforms with FCN when the majority voting, the best-performed fusion strategy, is utilized. BEiT-B favors pairing with an FCN, indicating that the FCN's more elaborate structure is beneficial for interpreting the consensus-based features derived from BEiT-B to predict classes for majority voting. However, with max-pooling, BEiT-B outperforms with FCL. The distinction underscores BEiT-B's flexible adaptability to different fusion approaches, optimizing its classification prowess with the appropriate combination of fusion strategy and classifier architecture.

On the other hand, ResNet-152 prefers FCL with max-pooling, the best-performed fusion strategy, when working with 6 or 12 views but switches its allegiance to FCN when the view count is reduced to 3 or just a single view. These insights suggest that while ResNet-152 may have a strong capacity for feature extraction, the optimal pairing with a classifier depends on the amount of viewpoint data available. With fewer views, the FCN can utilize the deep, complex features provided by ResNet-152, but with more views, the straightforward of an FCL may be more appropriate for achieving high accuracy.

However, BEiT-B can be considered better than ResNet-152 and all other backbones since it has the highest OA of 92.02% with just 3 views in shaded settings. It even outperformed other models with a single view, reaching an OA of 91.98%. This efficiency with minimal views arguably places BEiT-B at the top of the leaderboard, surpassing ResNet-152 and all other models we tested.

The Effect of the Number of Views:

In this experiment, each experimented rendering technique is considered separately with different quantities of selected views. This approach allows for an in-depth analysis of how the number of perspectives within each rendering technique impacts the classification performance of the 3D objects. The relationship between the number of views and the classification accuracy of various 3D classification models including our SelectiveMV model with BEit-B backbone, Ma et al., Pairwise, MVCNN, and 3DShapeNets is illustrated in the following figure:

Effect of the selected number of views. The number of views vs. overall accuracy of different 3D classification models, including our SelectiveMV, is plotted. G, Sh, and D refer to grayscale, shaded, and depth, respectively.

Based on the analysis of the data presented, it can be observed that the performance of SelectiveMV models with a selected number of views depends on the rendering techniques. SelectiveMV (Shaded) presents the highest OA with 92.79%, suggesting its strategy is particularly effective with a limited number of views, such as 6, while OA drops to 90.88% when all the views are selected. These findings highlight the effectiveness of the proposed model in capturing essential features and suggest that even a smaller number of carefully chosen views can yield comparable results to a more extensive set of views. SelectiveMV (Depth) follows the same pattern with the number of views and an OA of 90.36% with 6 views, indicating it also handles a few view scenarios well, but with slightly less efficiency than SelectiveMV (Shaded). On the other hand, SelectiveMV (Grayscale) has the lowest OA among them and achieved its highest performance at 89.25% when all the views were utilized, which might imply that its approach is less suited to fewer-view situations or that it requires more views to leverage its full potential. For those interested in the performance of the other backbones, with varying rendering techniques and number of views including single view, we have included those details in the supplementary material.

The comparative analysis illustrated above, clearly underscores the exceptional performance of our SelectiveMV (Shaded) model in the realm of 3D object classification. The model consistently outperforms established benchmarks such as Ma et al., Pairwise, MVCNN, and 3DShapeNets as evidenced by higher classification accuracy across varying numbers of views. Notably, SelectiveMV (Shaded) maintains its lead in accuracy irrespective of whether the input consists of fewer or more views. The only exception arises with the comparison to Ma et al. model when utilizing 12 views; in this case, their model slightly edges out ours. Intriguingly, however, our SelectiveMV (Shaded) with just a single view can surpass the accuracy of Ma et al.'s 12-view model. This remarkable capability of SelectiveMV (Shaded) attests to its robustness and the sophistication of its approach, particularly when harnessing Shaded views. The implications of these findings are significant, as they not only validate the efficacy of our model but also position it as a frontrunner in 3D object classification.

SelectiveMV Analysis:

Based on the above analysis, we reach the best settings for the SelectiveMV model by using shaded views, BEiT-B, majority-voting, and FCN as the chosen rendering technique, backbone network, fusion strategy, and classifier, respectively. This chosen setting achieved the highest 3D classification accuracies, 91.98%, 92.54%, and 92.79% OA, with only 1, 3, and 6 selected views, respectively. SelectiveMV can achieve 100% accuracy in 10 classes, and more than 90% in in 18 classes. The confusion matrix of SelectiveMV with the best settings is shown:

The confusion matrix achieved by the Transformer-based BEiT backbone.

The SelectiveMV could classify remarkably similar classes, such as dresser and nightstand, with 88% and 80%, respectively. However, the dataset includes the same objects that are classified under different classes such as flower-pot and plant, which confused our classifier and made it able to classify the one class with the highest samples (plant with 240 training objects lead to 88% accuracy) and fail on the another with lowest samples (flower-pot with 149 training objects lead to 20% accuracy):

Some of the similar 3D objects from the ModelNet40 training set that belong to different classes: A) Dresser and night-stand, B) Flower-pot and plant.

Selection Mechanism:

In this analysis, we seek to visualize the selective approach with various selected views, each offering a distinct insight into the object's structure. These selected views span from a single view to a collection of twelve. The qualitative results highlighting the selected views of the piano's 3D model representation with Shaded views are presented:

Qualitative results of selected views of a piano 3D object based on the features extracted by BEit-B. The selected number of views are 1, 3, 6, and all 12.

The 3D rounded nature objects (e.g., bottles) are characterized by their spherical geometries. Due to their symmetrical shapes, multiple views rendered around these objects are expected to have a high degree of similarity. An example of our selection approach's accuracy, giving almost similar importance scores due to the high similarity between the rendered views is shown:

Multi-view representations of rounded nature objects (Bottle) in Shaded views. The similarity of views leads to almost equal importance scoring.

Citation:

For those who find the provided code beneficial for their research or work, we kindly request citing the following papers:

Original Paper:

@InProceedings{SelectiveMV2024,
    author    = {Alzahrani, Mona and Usman, Muhammad and Anwar, Saeed and Helmy, Tarek},
    title     = {Selective Multi-View Deep Model for 3D Object Classification},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2024},
    pages     = {728-736}
}

Extended work:

@article{SelectiveMV2-2024,
  title={3D Object Classification With Selective Multi-View Fusion And Shape Rendering},
  author={Alzahrani, Mona and Usman, Muhammad and Alharbi, Randah and Anwar, Saeed and Mian, Ajmal and Helmy, Tarek},
  booktitle={International Conference on Digital Image Computing: Techniques and Applications (DICTA),
  month= {November},
  year={2024},
  organization={IEEE}
}

For proper attribution, we kindly request citations to the original work that inspired this work. This paper introduced the selective multi-view deep model for enhancing 3D object classification with single view only. Your acknowledgment through citations is greatly valued.

"Please note that the paper is forthcoming. Once the paper is officially published, we will update the citation details accordingly."

Acknowledgement:

This project is funded by the Interdisciplinary Research Center for Intelligent Secure Systems at King Fahd University of Petroleum & Minerals (KFUPM) under Grant Number INSS2305.

Mona-Alzahrani/SelectiveMV-Multiview