State shortly what you did during each week. Just a table with the main results is enough. Remind to upload a brief presentation (pptx?) at virtual campus. Do not modify the previous weeks code. If you want to reuse it, just copy it on the corespondig week folder.
This week we have solved an image classification task using the idea of a Bag of Visual Words to create the image features. We have explored a basic machine learning pipeline, and we have tried to improve creating more complex features (Dense SIFT, Spatial Pyramid, Fisher Vectors), introducing some preprocessing techniques (Normalization, scaling and dimensionality reduction) and finally trying different models (SVMs with different kernels, KNN and Logistic Regression). Also, we have performed many hyperparameter searches following both naive and optimal approaches (Optuna).
From all the testing, we have obtained the following model configuration:
Hyperparameter | Value | Hyperparameter | Value |
---|---|---|---|
Descriptor: | Dense SIFT | Dimensionality reduction algorithm: | No |
Descriptor parameter: | 771 | Number of components: | None |
Scale factor (Dense SIFT): | 8 | Model type: | SVM |
Step size (Dense SIFT): | 10 | Regularization parameter (C): | 0.13146 |
K (Codebook): | 348 | Kernel: | Histogram Intersection |
Normalization: | No | Alpha parameter (HI): | 0.7 |
Scaling: | No | Pyramid level: | 2 |
And we have obtained the following results on the test set:
F1-Score | Accuracy | Precision | Recall |
---|---|---|---|
0.9318 | 0.8724 | 0.9412 | 0.9227 |
This week we have explored the same classification problem as the previous week, but now using a MLP in different ways: A classifier and a feature extractor.
The MLP network trained is used as a classification algorithm.
The MLP network trained is used as a feature extractor, and the embeddings of the last hidden layer are used as the image descriptors that will be used to train a SVM classifier.
First we have trained a MLP model to classify random patches of the original MIT_split dataset -> 0.4857 accuracy score on test set Then, divide dataset images into patches, computed the embeddings using the trained network, created a BoVW, and then trained an SVM model:
Approach | MLP | K (Codebook) | Acc (Training) | Acc (Test) |
---|---|---|---|---|
Approach 1 | 3 layers (64, 128, 64) | - | 0.6871 | 0.4821 |
Approach 2 | 4 layers (512, 512, 512, 512) | - | 0.4583 | 0.4002 |
Approach 3 | 2 layers (1024, 2048) | 64 | 0.7602 | 0.6245 |
This week we have worked with a pretrained ResNet50 model to solve the classification problem in an n-to-n way. Since this model can easily achieve near-perfect score on our data, the goal is to fine-tune the model using only 40 samples per class.
To this end, we explore the effect of a variety of things, from data augmentation techniques to all the different hyperparameters that can be set and changed. Also, we look into the topology of the network by introducing drop-out and batch normalization layers, as well as regularizing the weights with penalties to the loss. Finally, we train the model while unfreezing more and more layers, and finish the experiments with an ablation studiy on the size of the architecture by directly removing entire blocks.
Train a CNN model from scratch using the reduced dataset MIT_small_train_1 (400 training images, 8 labels).
- MODEL 1: 880 weights, 65.22% accuracy.
- MODEL 2: 19848 weights, 77.80% accuracy.
Model | Performance ratio (*) | Distance (**) |
---|---|---|
MODEL 1 | 0.7411 | 0.9462 |
MODEL 2 | 0.3920 | 19.8492 |
(*) Performance ratio is accuracy/(number of weights/100000). (**) Distance to the top left corner of the plot.
Model 1 is preferred since it has the higher performance ratio, leading to the lowest distance.