Selective Multi-View Deep Model for 3D Object Classification (SelectiveMV)

This repository is for the following paper "Selective Multi-View Deep Model for 3D Object Classification (SelectiveMV)" introduced by Mona Alzahrani, Muhammad Usman, Saeed Anwar, and Tarek Helmy, CVPR 2024. (pdf) (supp) (bibtex)

Requirements:

The model is built in Visual Studio Code editor using:

Tensorflow-gpu 2.10
Cuda 11.2
cuDNN 8.1
Python 3.9

Introduction:

3D object classification has emerged as a practical technology with applications in various domains, such as medical image analysis, automated driving, intelligent robots, and crowd surveillance. Among the different approaches, multi-view representations for 3D object classification has shown the most promising results, achieving state-of-the-art performance. However, there are certain limitations in current view-based 3D object classification methods. One observation is that using all captured views for classifying 3D objects can confuse the classifier and lead to misleading results for certain classes. Additionally, some views may contain more discriminative information for object classification than others. These observations motivate the development of smarter and more efficient selective multi-view classification models. In this work, we propose a Selective Multi-View Deep Model that extracts multi-view images from 3D data representations and selects the most influential view by assigning importance scores using the cosine similarity method based on visual features detected by a pre-trained CNN. The proposed method is evaluated on the ModelNet40 dataset for the task of 3D classification. The results demonstrate that the proposed model achieves an overall accuracy of 88.13% using only a single view when employing a shading technique for rendering the views, pre-trained ResNet-152 as the backbone CNN for feature extraction, and a Fully Connected Network (FCN) as the classifier.

Illustration of the proposed framework. The proposed framework operates in five phases to predict the class of a 3D object: A) It generates _m_ multi-view images from the 3D object. B) Feature maps are extracted from each view. C) These feature maps are converted into feature vectors, and D) importance scores are assigned based on their cosine similarity. The feature vector with the highest importance score, known as the Most Similar View (MSV), is selected as the global descriptor. E) Finally, the global descriptor is utilized to classify the object using a pre-trained classifier.

Architecture:

The architecture of the proposed selective multi-view deep model contains five phases:
(A) Multi-view extraction: from a given 3D object, m multiple views are extracted from different viewpoints and angles.
(B) Feature extraction: each extracted view is fed to a pre-trained CNN to extract the corresponding feature stack of the detected visual features.
(C) Vectorization: the detected m feature stacks are converted to m feature vectors.
(D) View selection: The feature vectors are compared based on their similarity using Cosine Similarity and give a vital score that is normalized later. The more discriminative view is selected as a global descriptor based on them.
(D) Object classification: the global descriptor of the object feeds to a classifier to predict its class.

The architecture of the proposed selective multi-view deep model.

Dataset:

ModelNet is a large-scale 3D dataset provided in 2014 by Wu et al. from Princeton University’s Computer Science Department. ModelNet40 contains manually cleaned 3D objects without color information that belong to 40 class categories. In all of our experiments, and for a fair comparison, we have experimented with two versions of that dataset based on the camera settings from the literature:

ModelNet40v1 (Balanced and aligned dataset): in this version, the same training and testing splits of ModelNet40 as in 3dshapenets, MVCNN, RotationNet, DeepCCFV were experimented. Where for each category, they used the first 80 training objects (or all if there are less than 80) for training, and balanced testing, they used the first 20 testing objects. They used the circular configuration for each object to extract the 12 aligned views. So, they ended up with 3,983 objects in 40 categories consisting of 3,183 training objects (38,196 views) and 800 testing objects (9,600 views).

Circular configuration (12 views).
- ModelNet40v1 Training can be download from here.
- ModelNet40v1 Testing can be download from here.
ModelNet40v2 (Imbalanced and unaligned dataset): here, the whole ModelNet40 as in RotationNet, view-GCN, MVTN were experimented. This original version of the dataset is not balanced where there is a diverse number of objects across diverse categories. It contains 12,311 3D objects split into 9,843 for training and 2,468 for testing. The literature used a spherical configuration to extract the 20 unaligned views from each object to end up with a total of 196,860 for training and 49,360 for testing.

Spherical configuration (20 views).
- ModelNet40v2 Training can be download from here.
- ModelNet40v2 Testing can be download from here.

Additionally, we investigate the effect of shape representation on the classification of a single view for rendering 3D objects. We utilized the ModelNet40v2 dataset for this experiment, with 12 views per 3D object. However, each 3D object was rendered using the Phong shading technique. Shading techniques have been demonstrated to improve performance in models such as MVDAN and MVCNN. The rendered views were grayscale images with dimensions of 224 × 224 pixels and black backgrounds. The camera's field of view was adjusted so that the image canvas tightly encapsulated the 3D object.

Shaded multi-view images (12 views).

ShadedModelNet40v1 Training can be download from here.
ShadedModelNet40v1 Testing can be download from here.

Getting Started:

Since we will experiment different versions of the datasets, feature extractors and classifiers, we do the following to organize the data and results:

Prepare two folders:
- data folder: put all the unzip dataset folders inside it. Note that each dataset will have two folders one for training and other for testing. The data folder will also be used later by our code to save the extracted features.
- Results folder: create new folders inside it and rename them with dataset names (one folder for each dataset).

* And inside each dataset folder, create new folders and rename them with feature extractor names (pre-trained CNN names).

* And inside each feature extractor folder, create new folders and rename them with number of epochs used for training (here we trained on 20 and 30 epochs).

Training:

In the training phase, we train the following five feature extractors (pre-trained CNN) seperetly:

And we train the following two classifiers:

To train the feature extractor and classifiers, run Training.ipynb Jupyter Notebook after you modify the following:

Track and replace the paths of data and Results folders with your paths:
```
"./Results/"
"./data/"
```
Choose the dataset version and paths:

dataset_version= 'modelnet40v1' 
dataset_train = 'C:/Users/mona/Desktop/data/modelnet40v1_train'

dataset_version= 'modelnet40v2'
dataset_train = './data/modelnet40v2_train'

dataset_version= 'shaded_modelnet40v2'
dataset_train = './data/modelnet40v2_train'

Spicify the img size (here 224*224)

Img_Size= 224

Spicify the feature extractor (here ResNet152). Note: we only experimented five feature extractors but more options are included in the code.

all_deep_models = [ResNet152]
all_model_name_txt = ["ResNet152"]

Spicify the BATCH_SIZE, EPOCHS, learning_rate:

BATCH_SIZE = 384
EPOCHS = 20
learning_rate = 0.0001

Testing:

For testing, run Testing.ipynb Jupyter Notebook after you modify Steps from 1 to 4, and specify the following selection mechanisms:

Most Similar View (MSV):
```
selection_mechanism = 'MSV'
```
OR
Most Dissimilar View (MDV):
```
selection_mechanism = 'MDV'
```

Results:

Quantitative Results:

The classification accuracy results of the proposed models using the ModelNet40v1 and ModelNet40v2 datasets when the models trained for 30 epochs are summarized in the bellow Table. This table presents the outcomes of various experiments conducted under different settings. It is worth noting that the proposed approach achieves the best results, an Overall Accuracy (OA) of 83.63% and Average Accuracy (AA) of 83.63%, when only a single view is used for classifying 3D objects. This is observed when the pre-trained ResNet-152 model is employed for feature extraction, and the FCN is used as the classifier, trained with 12 views from ModelNet40v1 dataset (model M₁₃). Additionally, when the same feature extractor is trained with 20 views from the ModelNet40v2 dataset, the proposed approach with the FCL classifier demonstrates competitive performance, achieving an OA of 83.7%, but with an AA of 80.39% (model M₁₅).

The classification accuracy of our proposed model on ModelNet40v1 and ModelNet40v2 datasets is rendered as 12 views and 20 views for each object, respectively. Each model is trained for 30 epochs. The best results are shown in bold and underlined:

Also, we investigate the effect of shape representation on the classification of a single view for rendering 3D objects. We utilized the ModelNet40v2 dataset for this experiment, with 12 views per 3D object. However, each 3D object was rendered using the Phong shading technique. Shading techniques have been demonstrated to improve performance in models such as MVDAN and MVCNN. The rendered views were grayscale images with dimensions of 224*224 pixels and black backgrounds, as depicted in the bellow Figure. The camera's field of view was adjusted so that the image canvas tightly encapsulated the 3D object.

Results of the proposed model with shading as rendering technique:

Comparison with the selective view-based 3D object classification methods experimented with a single view. OA is overall accuracy, and AA is average accuracy. The best results are shown in bold and underlined:

Visual Results:

In this work, we consider and experiment with the best discriminative view differently. The first selection technique, considers the Most Similar View (MSV) as a considerably reasonable discriminating view because it could contain most features on other views corresponding to the same object. The MSV has a higher cosine similarity (a higher important score). The other way is by considering the Most Dissimilar View (MDV) as the best discriminative view due to the unique and irredundant features of different views corresponding to the same object. The MDV is the view that has the lower cosine similarity (lower important score).

The set of 12 circular views obtained from sample objects and their corresponding importance scores are displayed. Views with the highest importance scores, representing the Most Similar Views (MSV), are highlighted with green boxes. Conversely, views with the lowest importance scores, representing the Most Dissimilar Views (MDV), are enclosed in brown boxes.

Here, we use the Grad-CAM technique to analyze the predicted labels to highlight the regions on the views responsible for the classification. We show some correctly predicted views by the proposed model with their corresponding feature maps highlighted with Guided GradCam showing the responsible regions that led to the correct classification. These feature maps show how the proposed model selects the views that contain distinguishing features, such as shelves in bookshelves and circular edges in bowls.

Samples of feature maps belong to correctly classified labels highlighted with the Grad-CAM technique to show the responsible regions that led to the classification.

It has been found that top confusions happened when: i) "flower pot" predicted as "plant", ii) "dressers" predicted as "night stand", and iii) "plant" predicted as "flower pot". Even for human observers, distinguishing between these specific pairs of classes can be challenging due to the ambiguity present.

Multi-view samples from ModelNet40v1 dataset of the most wrongly classified objects by the proposed model.

Since Most Similar Views (MSV) give better results, input and output of the proposed model will be as follow: Given a 3D object as input, our proposed model generates m multi-view images from the 3D object and assigns importance scores based on their cosine similarity, in which the view with the highest importance score is selected as the global descriptor to classify the object and finally, predict its category as output.

Input and output of the proposed model.

Citation:

For those who find the provided code beneficial for their research or work, we kindly request citing the following papers:

Original Paper:

@InProceedings{SelectiveMV2024,
    author    = {Alzahrani, Mona and Usman, Muhammad and Anwar, Saeed and Helmy, Tarek},
    title     = {Selective Multi-View Deep Model for 3D Object Classification},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2024},
    pages     = {728-736}
}

Extended work:

@article{SelectiveMV2-2024,
  title={3D Object Classification With Selective Multi-View Fusion And Shape Rendering},
  author={Alzahrani, Mona and Usman, Muhammad and Alharbi, Randah and Anwar, Saeed and Mian, Ajmal and Helmy, Tarek},
  booktitle={International Conference on Digital Image Computing: Techniques and Applications (DICTA),
  month= {November},
  year={2024},
  organization={IEEE}
}

This extended work delves deeper into 3D object classification with selective multi-view fusion and shape rendering. Your citations and acknowledgments are greatly appreciated."

Acknowledgement:

The authors would like to acknowledge the support received from Saudi Data and AI Authority (SDAIA) and King Fahd University of Petroleum and Minerals (KFUPM) under SDAIA-KFUPM Joint Research Center for Artificial Intelligence Grant no. JRC-AI-RFP-19.

Mona-Alzahrani/SelectiveMV